1996-08-28 03:59:28 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* execnodes.h
|
1997-09-07 07:04:48 +02:00
|
|
|
* definitions for executor state nodes
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
*
|
2019-01-02 18:44:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/nodes/execnodes.h
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef EXECNODES_H
|
|
|
|
#define EXECNODES_H
|
|
|
|
|
Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own. The children are called
partitions and contain all of the actual data. Each partition has an
implicit partitioning constraint. Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed. Partitions
can't have extra columns and may not allow nulls unless the parent
does. Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.
Currently, tables can be range-partitioned or list-partitioned. List
partitioning is limited to a single column, but range partitioning can
involve multiple columns. A partitioning "column" can be an
expression.
Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations. The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.
Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
2016-12-07 19:17:43 +01:00
|
|
|
#include "access/tupconvert.h"
|
2011-09-22 17:29:18 +02:00
|
|
|
#include "executor/instrument.h"
|
2019-08-16 19:35:31 +02:00
|
|
|
#include "fmgr.h"
|
2015-05-15 13:26:51 +02:00
|
|
|
#include "lib/pairingheap.h"
|
1999-07-16 19:07:40 +02:00
|
|
|
#include "nodes/params.h"
|
2009-06-11 16:49:15 +02:00
|
|
|
#include "nodes/plannodes.h"
|
Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
2019-03-07 17:13:12 +01:00
|
|
|
#include "partitioning/partdefs.h"
|
2016-09-13 15:21:35 +02:00
|
|
|
#include "utils/hsearch.h"
|
2017-04-01 06:17:18 +02:00
|
|
|
#include "utils/queryenvironment.h"
|
2011-02-23 18:18:09 +01:00
|
|
|
#include "utils/reltrigger.h"
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
#include "utils/sharedtuplestore.h"
|
2019-01-15 02:02:12 +01:00
|
|
|
#include "utils/snapshot.h"
|
2011-12-07 06:18:38 +01:00
|
|
|
#include "utils/sortsupport.h"
|
2002-08-30 02:28:41 +02:00
|
|
|
#include "utils/tuplestore.h"
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
#include "utils/tuplesort.h"
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
#include "nodes/tidbitmap.h"
|
|
|
|
#include "storage/condition_variable.h"
|
2002-08-30 02:28:41 +02:00
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
struct PlanState; /* forward references in this file */
|
2018-11-16 18:54:15 +01:00
|
|
|
struct PartitionRoutingInfo;
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
struct ParallelHashJoinState;
|
2018-10-08 16:41:34 +02:00
|
|
|
struct ExecRowMark;
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
struct ExprState;
|
|
|
|
struct ExprContext;
|
2018-10-04 21:48:17 +02:00
|
|
|
struct RangeTblEntry; /* avoid including parsenodes.h here */
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
struct ExprEvalStep; /* avoid including execExpr.h everywhere */
|
tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering.
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.
A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.
Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.
By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).
Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
2019-04-05 00:47:19 +02:00
|
|
|
struct CopyMultiInsertBuffer;
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
|
|
|
|
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
/* ----------------
|
|
|
|
* ExprState node
|
|
|
|
*
|
|
|
|
* ExprState is the top-level node for expression evaluation.
|
|
|
|
* It contains instructions (in ->steps) to evaluate the expression.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef Datum (*ExprStateEvalFunc) (struct ExprState *expression,
|
2017-06-21 20:39:04 +02:00
|
|
|
struct ExprContext *econtext,
|
|
|
|
bool *isNull);
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
|
|
|
|
/* Bits in ExprState->flags (see also execExpr.h for private flag bits): */
|
|
|
|
/* expression is for use with ExecQual() */
|
|
|
|
#define EEO_FLAG_IS_QUAL (1 << 0)
|
|
|
|
|
|
|
|
typedef struct ExprState
|
|
|
|
{
|
|
|
|
Node tag;
|
|
|
|
|
|
|
|
uint8 flags; /* bitmask of EEO_FLAG_* bits, see above */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Storage for result value of a scalar expression, or for individual
|
|
|
|
* column results within expressions built by ExecBuildProjectionInfo().
|
|
|
|
*/
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRSTATE_RESNULL 2
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
bool resnull;
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRSTATE_RESVALUE 3
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
Datum resvalue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If projecting a tuple result, this slot holds the result; else NULL.
|
|
|
|
*/
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRSTATE_RESULTSLOT 4
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
TupleTableSlot *resultslot;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Instructions to compute expression's return value.
|
|
|
|
*/
|
|
|
|
struct ExprEvalStep *steps;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Function that actually evaluates the expression. This can be set to
|
|
|
|
* different values depending on the complexity of the expression.
|
|
|
|
*/
|
|
|
|
ExprStateEvalFunc evalfunc;
|
|
|
|
|
|
|
|
/* original expression tree, for debugging only */
|
|
|
|
Expr *expr;
|
|
|
|
|
2017-12-29 21:38:15 +01:00
|
|
|
/* private state for an evalfunc */
|
|
|
|
void *evalfunc_private;
|
|
|
|
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
/*
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
* XXX: following fields only needed during "compilation" (ExecInitExpr);
|
|
|
|
* could be thrown away afterwards.
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
*/
|
|
|
|
|
|
|
|
int steps_len; /* number of steps currently */
|
|
|
|
int steps_alloc; /* allocated length of steps array */
|
|
|
|
|
Rearrange execution of PARAM_EXTERN Params for plpgsql's benefit.
This patch does three interrelated things:
* Create a new expression execution step type EEOP_PARAM_CALLBACK
and add the infrastructure needed for add-on modules to generate that.
As discussed, the best control mechanism for that seems to be to add
another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly, since it can't be retrieved
from the parent plan node's EState.
* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual. This also lets us get rid
of ParamListInfo.paramMask, instead leaving it to the paramFetch hook to
decide which param IDs should be accessible or not. plpgsql_param_fetch
was already doing the identical masking check, so having callers do it too
seemed redundant. While I was at it, I added a "speculative" flag to
paramFetch that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide values
of non-DTYPE_VAR variables to the planner for fear of triggering premature
"record not assigned yet" or "field not found" errors during planning.
* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for. This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal. copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.) This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).
Performance testing shows this to be at worst a break-even change,
and it can provide wins ranging up to 20% in test cases involving
accesses to fields of "record" variables. The fact that values of
such variables can now be exposed to the planner might produce wins
in some situations, too, but I've not pursued that angle.
In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer in a
transient field in ExprState. The ParamListInfo pointer for a stand-alone
expression is handled the same way; we'd otherwise have had to add
yet another recursively-passed-down argument in expression compilation.
Discussion: https://postgr.es/m/32589.1513706441@sss.pgh.pa.us
2017-12-21 18:57:41 +01:00
|
|
|
struct PlanState *parent; /* parent PlanState node, if any */
|
|
|
|
ParamListInfo ext_params; /* for compiling PARAM_EXTERN nodes */
|
|
|
|
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
Datum *innermost_caseval;
|
|
|
|
bool *innermost_casenull;
|
|
|
|
|
|
|
|
Datum *innermost_domainval;
|
|
|
|
bool *innermost_domainnull;
|
|
|
|
} ExprState;
|
|
|
|
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* IndexInfo information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2003-05-28 18:04:02 +02:00
|
|
|
* this struct holds the information needed to construct new index
|
2000-07-15 00:18:02 +02:00
|
|
|
* entries for a particular index. Used for both index_build and
|
|
|
|
* retail creation of index entries.
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2018-04-07 22:00:39 +02:00
|
|
|
* NumIndexAttrs total number of columns in this index
|
|
|
|
* NumIndexKeyAttrs number of key columns in index
|
2018-08-30 08:08:33 +02:00
|
|
|
* IndexAttrNumbers underlying-rel attribute numbers used as keys
|
2018-04-07 22:00:39 +02:00
|
|
|
* (zeroes indicate expressions). It also contains
|
|
|
|
* info about included columns.
|
2003-05-28 18:04:02 +02:00
|
|
|
* Expressions expr trees for expression entries, or NIL if none
|
|
|
|
* ExpressionsState exec state for expressions, or NIL if none
|
2001-07-16 07:07:00 +02:00
|
|
|
* Predicate partial-index predicate, or NIL if none
|
2002-12-13 20:46:01 +01:00
|
|
|
* PredicateState exec state for predicate, or NIL if none
|
2009-12-07 06:22:23 +01:00
|
|
|
* ExclusionOps Per-column exclusion operators, or NULL if none
|
|
|
|
* ExclusionProcs Underlying function OIDs for ExclusionOps
|
|
|
|
* ExclusionStrats Opclass strategy numbers for ExclusionOps
|
2019-06-03 06:44:03 +02:00
|
|
|
* UniqueOps These are like Exclusion*, but for unique indexes
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
* UniqueProcs
|
|
|
|
* UniqueStrats
|
2000-07-15 00:18:02 +02:00
|
|
|
* Unique is it a unique index?
|
2007-09-20 19:56:33 +02:00
|
|
|
* ReadyForInserts is it valid for inserts?
|
2006-08-25 06:06:58 +02:00
|
|
|
* Concurrent are we doing a concurrent index build?
|
2007-09-20 19:56:33 +02:00
|
|
|
* BrokenHotChain did we detect any broken HOT chains?
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
* ParallelWorkers # of workers requested (excludes leader)
|
2018-08-30 08:08:33 +02:00
|
|
|
* Am Oid of index AM
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
* AmCache private cache area for index AM
|
|
|
|
* Context memory context holding this IndexInfo
|
2007-09-20 19:56:33 +02:00
|
|
|
*
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
* ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
|
|
|
|
* during index build; they're conventionally zeroed otherwise.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct IndexInfo
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
NodeTag type;
|
2018-04-07 22:00:39 +02:00
|
|
|
int ii_NumIndexAttrs; /* total number of columns in index */
|
|
|
|
int ii_NumIndexKeyAttrs; /* number of key columns in index */
|
2018-04-12 12:02:45 +02:00
|
|
|
AttrNumber ii_IndexAttrNumbers[INDEX_MAX_KEYS];
|
2003-08-04 02:43:34 +02:00
|
|
|
List *ii_Expressions; /* list of Expr */
|
|
|
|
List *ii_ExpressionsState; /* list of ExprState */
|
2002-12-13 20:46:01 +01:00
|
|
|
List *ii_Predicate; /* list of Expr */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *ii_PredicateState;
|
2010-02-26 03:01:40 +01:00
|
|
|
Oid *ii_ExclusionOps; /* array with one entry per column */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
Oid *ii_ExclusionProcs; /* array with one entry per column */
|
|
|
|
uint16 *ii_ExclusionStrats; /* array with one entry per column */
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
Oid *ii_UniqueOps; /* array with one entry per column */
|
2015-05-24 03:35:49 +02:00
|
|
|
Oid *ii_UniqueProcs; /* array with one entry per column */
|
|
|
|
uint16 *ii_UniqueStrats; /* array with one entry per column */
|
2000-07-15 00:18:02 +02:00
|
|
|
bool ii_Unique;
|
2007-09-20 19:56:33 +02:00
|
|
|
bool ii_ReadyForInserts;
|
2006-08-25 06:06:58 +02:00
|
|
|
bool ii_Concurrent;
|
2007-09-20 19:56:33 +02:00
|
|
|
bool ii_BrokenHotChain;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
int ii_ParallelWorkers;
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
Oid ii_Am;
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
void *ii_AmCache;
|
|
|
|
MemoryContext ii_Context;
|
1997-09-08 23:56:23 +02:00
|
|
|
} IndexInfo;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/* ----------------
|
|
|
|
* ExprContext_CB
|
|
|
|
*
|
|
|
|
* List of callbacks to be called at ExprContext shutdown.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef void (*ExprContextCallbackFunction) (Datum arg);
|
|
|
|
|
|
|
|
typedef struct ExprContext_CB
|
|
|
|
{
|
|
|
|
struct ExprContext_CB *next;
|
|
|
|
ExprContextCallbackFunction function;
|
|
|
|
Datum arg;
|
|
|
|
} ExprContext_CB;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* ExprContext
|
|
|
|
*
|
|
|
|
* This class holds the "current context" information
|
|
|
|
* needed to evaluate expressions for doing tuple qualifications
|
2014-05-06 18:12:18 +02:00
|
|
|
* and tuple projections. For example, if an expression refers
|
1997-09-07 07:04:48 +02:00
|
|
|
* to an attribute in the current inner tuple then we need to know
|
|
|
|
* what the current inner tuple is and so we look at the expression
|
|
|
|
* context.
|
2000-07-12 04:37:39 +02:00
|
|
|
*
|
|
|
|
* There are two memory contexts associated with an ExprContext:
|
2002-12-15 17:17:59 +01:00
|
|
|
* * ecxt_per_query_memory is a query-lifespan context, typically the same
|
2014-05-06 18:12:18 +02:00
|
|
|
* context the ExprContext node itself is allocated in. This context
|
2002-12-15 17:17:59 +01:00
|
|
|
* can be used for purposes such as storing function call cache info.
|
2000-07-12 04:37:39 +02:00
|
|
|
* * ecxt_per_tuple_memory is a short-term context for expression results.
|
|
|
|
* As the name suggests, it will typically be reset once per tuple,
|
|
|
|
* before we begin to evaluate expressions for that tuple. Each
|
|
|
|
* ExprContext normally has its very own per-tuple memory context.
|
2002-12-15 17:17:59 +01:00
|
|
|
*
|
2000-07-12 04:37:39 +02:00
|
|
|
* CurrentMemoryContext should be set to ecxt_per_tuple_memory before
|
|
|
|
* calling ExecEvalExpr() --- see ExecEvalExprSwitchContext().
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct ExprContext
|
|
|
|
{
|
2002-09-04 22:31:48 +02:00
|
|
|
NodeTag type;
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
/* Tuples that Var nodes in expression may refer to */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_SCANTUPLE 1
|
1998-02-26 05:46:47 +01:00
|
|
|
TupleTableSlot *ecxt_scantuple;
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_INNERTUPLE 2
|
1998-02-26 05:46:47 +01:00
|
|
|
TupleTableSlot *ecxt_innertuple;
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
|
1998-02-26 05:46:47 +01:00
|
|
|
TupleTableSlot *ecxt_outertuple;
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
/* Memory contexts for expression evaluation --- see notes above */
|
2002-09-04 22:31:48 +02:00
|
|
|
MemoryContext ecxt_per_query_memory;
|
|
|
|
MemoryContext ecxt_per_tuple_memory;
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
/* Values to substitute for Param nodes in expression */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
ParamExecData *ecxt_param_exec_vals; /* for PARAM_EXEC params */
|
2002-09-04 22:31:48 +02:00
|
|
|
ParamListInfo ecxt_param_list_info; /* for other param types */
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2008-12-28 19:54:01 +01:00
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* Values to substitute for Aggref nodes in the expressions of an Agg
|
|
|
|
* node, or for WindowFunc nodes within a WindowAgg node.
|
2008-12-28 19:54:01 +01:00
|
|
|
*/
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
|
2008-12-28 19:54:01 +01:00
|
|
|
Datum *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
|
2008-12-28 19:54:01 +01:00
|
|
|
bool *ecxt_aggnulls; /* null flags for aggs/windowfuncs */
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2004-03-17 21:48:43 +01:00
|
|
|
/* Value to substitute for CaseTestExpr nodes in expression */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
|
2004-03-17 21:48:43 +01:00
|
|
|
Datum caseValue_datum;
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_CASENULL 11
|
2004-03-17 21:48:43 +01:00
|
|
|
bool caseValue_isNull;
|
|
|
|
|
2003-02-03 22:15:45 +01:00
|
|
|
/* Value to substitute for CoerceToDomainValue nodes in expression */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
|
2002-11-15 03:50:21 +01:00
|
|
|
Datum domainValue_datum;
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
|
2002-11-15 03:50:21 +01:00
|
|
|
bool domainValue_isNull;
|
|
|
|
|
2006-08-04 23:33:36 +02:00
|
|
|
/* Link to containing EState (NULL if a standalone ExprContext) */
|
2002-12-15 17:17:59 +01:00
|
|
|
struct EState *ecxt_estate;
|
|
|
|
|
Support ordered-set (WITHIN GROUP) aggregates.
This patch introduces generic support for ordered-set and hypothetical-set
aggregate functions, as well as implementations of the instances defined in
SQL:2008 (percentile_cont(), percentile_disc(), rank(), dense_rank(),
percent_rank(), cume_dist()). We also added mode() though it is not in the
spec, as well as versions of percentile_cont() and percentile_disc() that
can compute multiple percentile values in one pass over the data.
Unlike the original submission, this patch puts full control of the sorting
process in the hands of the aggregate's support functions. To allow the
support functions to find out how they're supposed to sort, a new API
function AggGetAggref() is added to nodeAgg.c. This allows retrieval of
the aggregate call's Aggref node, which may have other uses beyond the
immediate need. There is also support for ordered-set aggregates to
install cleanup callback functions, so that they can be sure that
infrastructure such as tuplesort objects gets cleaned up.
In passing, make some fixes in the recently-added support for variadic
aggregates, and make some editorial adjustments in the recent FILTER
additions for aggregates. Also, simplify use of IsBinaryCoercible() by
allowing it to succeed whenever the target type is ANY or ANYELEMENT.
It was inconsistent that it dealt with other polymorphic target types
but not these.
Atri Sharma and Andrew Gierth; reviewed by Pavel Stehule and Vik Fearing,
and rather heavily editorialized upon by Tom Lane
2013-12-23 22:11:35 +01:00
|
|
|
/* Functions to call back when ExprContext is shut down or rescanned */
|
2002-05-12 22:10:05 +02:00
|
|
|
ExprContext_CB *ecxt_callbacks;
|
1997-09-08 23:56:23 +02:00
|
|
|
} ExprContext;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2000-08-24 05:29:15 +02:00
|
|
|
/*
|
2017-01-19 23:12:38 +01:00
|
|
|
* Set-result status used when evaluating functions potentially returning a
|
|
|
|
* set.
|
2000-08-24 05:29:15 +02:00
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
2001-10-28 07:26:15 +01:00
|
|
|
ExprSingleResult, /* expression does not return a set */
|
|
|
|
ExprMultipleResult, /* this result is an element of a set */
|
|
|
|
ExprEndResult /* there are no more elements in the set */
|
2000-08-24 05:29:15 +02:00
|
|
|
} ExprDoneCond;
|
|
|
|
|
2002-08-30 02:28:41 +02:00
|
|
|
/*
|
|
|
|
* Return modes for functions returning sets. Note values must be chosen
|
|
|
|
* as separate bits so that a bitmask can be formed to indicate supported
|
2008-10-31 20:37:56 +01:00
|
|
|
* modes. SFRM_Materialize_Random and SFRM_Materialize_Preferred are
|
|
|
|
* auxiliary flags about SFRM_Materialize mode, rather than separate modes.
|
2002-08-30 02:28:41 +02:00
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
SFRM_ValuePerCall = 0x01, /* one value returned per call */
|
2008-10-29 01:00:39 +01:00
|
|
|
SFRM_Materialize = 0x02, /* result set instantiated in Tuplestore */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
SFRM_Materialize_Random = 0x04, /* Tuplestore needs randomAccess */
|
2008-10-31 20:37:56 +01:00
|
|
|
SFRM_Materialize_Preferred = 0x08 /* caller prefers Tuplestore */
|
2002-08-30 02:28:41 +02:00
|
|
|
} SetFunctionReturnMode;
|
|
|
|
|
2000-08-24 05:29:15 +02:00
|
|
|
/*
|
|
|
|
* When calling a function that might return a set (multiple rows),
|
|
|
|
* a node of this type is passed as fcinfo->resultinfo to allow
|
|
|
|
* return status to be passed back. A function returning set should
|
2002-08-30 02:28:41 +02:00
|
|
|
* raise an error if no such resultinfo is provided.
|
2000-08-24 05:29:15 +02:00
|
|
|
*/
|
|
|
|
typedef struct ReturnSetInfo
|
|
|
|
{
|
|
|
|
NodeTag type;
|
2002-08-31 01:59:46 +02:00
|
|
|
/* values set by caller: */
|
2002-08-30 02:28:41 +02:00
|
|
|
ExprContext *econtext; /* context function is being called in */
|
2002-08-31 01:59:46 +02:00
|
|
|
TupleDesc expectedDesc; /* tuple descriptor expected by caller */
|
2002-08-30 02:28:41 +02:00
|
|
|
int allowedModes; /* bitmask: return modes caller can handle */
|
2002-08-31 01:59:46 +02:00
|
|
|
/* result status from function (but pre-initialized by caller): */
|
|
|
|
SetFunctionReturnMode returnMode; /* actual return mode */
|
2002-08-30 02:28:41 +02:00
|
|
|
ExprDoneCond isDone; /* status for ValuePerCall mode */
|
2002-08-31 01:59:46 +02:00
|
|
|
/* fields filled by function in Materialize return mode: */
|
2002-09-04 22:31:48 +02:00
|
|
|
Tuplestorestate *setResult; /* holds the complete returned tuple set */
|
2002-08-31 01:59:46 +02:00
|
|
|
TupleDesc setDesc; /* actual descriptor for returned tuples */
|
2000-08-24 05:29:15 +02:00
|
|
|
} ReturnSetInfo;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* ProjectionInfo node information
|
|
|
|
*
|
2003-01-12 05:03:34 +01:00
|
|
|
* This is all the information needed to perform projections ---
|
|
|
|
* that is, form new tuples by evaluation of targetlist expressions.
|
|
|
|
* Nodes which need to do projections create one of these.
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* The target tuple slot is kept in ProjectionInfo->pi_state.resultslot.
|
2003-01-12 05:03:34 +01:00
|
|
|
* ExecProject() evaluates the tlist, forms a tuple, and stores it
|
2014-05-06 18:12:18 +02:00
|
|
|
* in the given slot. Note that the result will be a "virtual" tuple
|
2005-03-16 22:38:10 +01:00
|
|
|
* unless ExecMaterializeSlot() is then called to force it to be
|
2014-05-06 18:12:18 +02:00
|
|
|
* converted to a physical tuple. The slot must have a tupledesc
|
2005-03-16 22:38:10 +01:00
|
|
|
* that matches the output of the tlist!
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct ProjectionInfo
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
NodeTag type;
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
/* instructions to evaluate projection */
|
|
|
|
ExprState pi_state;
|
|
|
|
/* expression context in which to evaluate expression */
|
1997-09-08 04:41:22 +02:00
|
|
|
ExprContext *pi_exprContext;
|
1997-09-08 23:56:23 +02:00
|
|
|
} ProjectionInfo;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* JunkFilter
|
|
|
|
*
|
2000-07-12 04:37:39 +02:00
|
|
|
* This class is used to store information regarding junk attributes.
|
1997-09-07 07:04:48 +02:00
|
|
|
* A junk attribute is an attribute in a tuple that is needed only for
|
|
|
|
* storing intermediate information in the executor, and does not belong
|
2004-10-07 20:38:51 +02:00
|
|
|
* in emitted tuples. For example, when we do an UPDATE query,
|
2000-07-12 04:37:39 +02:00
|
|
|
* the planner adds a "junk" entry to the targetlist so that the tuples
|
|
|
|
* returned to ExecutePlan() contain an extra attribute: the ctid of
|
2014-05-06 18:12:18 +02:00
|
|
|
* the tuple to be updated. This is needed to do the update, but we
|
2000-07-12 04:37:39 +02:00
|
|
|
* don't want the ctid to be part of the stored new tuple! So, we
|
|
|
|
* apply a "junk filter" to remove the junk attributes and form the
|
2006-12-04 03:06:55 +01:00
|
|
|
* real output tuple. The junkfilter code also provides routines to
|
|
|
|
* extract the values of the junk attribute(s) from the input tuple.
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
|
|
|
* targetList: the original target list (including junk attributes).
|
2004-10-07 20:38:51 +02:00
|
|
|
* cleanTupType: the tuple descriptor for the "clean" tuple (with
|
1997-09-07 07:04:48 +02:00
|
|
|
* junk attributes removed).
|
2001-05-27 22:48:51 +02:00
|
|
|
* cleanMap: A map with the correspondence between the non-junk
|
|
|
|
* attribute numbers of the "original" tuple and the
|
|
|
|
* attribute numbers of the "clean" tuple.
|
2005-03-16 22:38:10 +01:00
|
|
|
* resultSlot: tuple slot used to hold cleaned tuple.
|
2006-12-04 03:06:55 +01:00
|
|
|
* junkAttNo: not used by junkfilter code. Can be used by caller
|
|
|
|
* to remember the attno of a specific junk attribute
|
2013-03-10 19:14:53 +01:00
|
|
|
* (nodeModifyTable.c keeps the "ctid" or "wholerow"
|
|
|
|
* attno here).
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct JunkFilter
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
NodeTag type;
|
|
|
|
List *jf_targetList;
|
|
|
|
TupleDesc jf_cleanTupType;
|
|
|
|
AttrNumber *jf_cleanMap;
|
2001-05-27 22:48:51 +02:00
|
|
|
TupleTableSlot *jf_resultSlot;
|
2006-12-04 03:06:55 +01:00
|
|
|
AttrNumber jf_junkAttNo;
|
1997-09-08 23:56:23 +02:00
|
|
|
} JunkFilter;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2018-03-26 15:43:54 +02:00
|
|
|
/*
|
|
|
|
* OnConflictSetState
|
|
|
|
*
|
|
|
|
* Executor state of an ON CONFLICT DO UPDATE operation.
|
|
|
|
*/
|
|
|
|
typedef struct OnConflictSetState
|
|
|
|
{
|
|
|
|
NodeTag type;
|
|
|
|
|
2019-03-07 00:43:33 +01:00
|
|
|
TupleTableSlot *oc_Existing; /* slot to store existing target tuple in */
|
|
|
|
TupleTableSlot *oc_ProjSlot; /* CONFLICT ... SET ... projection target */
|
2018-03-26 15:43:54 +02:00
|
|
|
ProjectionInfo *oc_ProjInfo; /* for ON CONFLICT DO UPDATE SET */
|
|
|
|
ExprState *oc_WhereClause; /* state for the WHERE clause */
|
|
|
|
} OnConflictSetState;
|
|
|
|
|
2017-06-20 20:29:48 +02:00
|
|
|
/*
|
|
|
|
* ResultRelInfo
|
2000-11-12 01:37:02 +01:00
|
|
|
*
|
2017-06-20 20:29:48 +02:00
|
|
|
* Whenever we update an existing relation, we have to update indexes on the
|
|
|
|
* relation, and perhaps also fire triggers. ResultRelInfo holds all the
|
|
|
|
* information needed about a result relation, including indexes.
|
2018-10-04 20:03:37 +02:00
|
|
|
*
|
|
|
|
* Normally, a ResultRelInfo refers to a table that is in the query's
|
|
|
|
* range table; then ri_RangeTableIndex is the RT index and ri_RelationDesc
|
|
|
|
* is just a copy of the relevant es_relations[] entry. But sometimes,
|
|
|
|
* in ResultRelInfos used only for triggers, ri_RangeTableIndex is zero
|
|
|
|
* and ri_RelationDesc is a separately-opened relcache pointer that needs
|
|
|
|
* to be separately closed. See ExecGetTriggerResultRel.
|
2000-11-12 01:37:02 +01:00
|
|
|
*/
|
|
|
|
typedef struct ResultRelInfo
|
|
|
|
{
|
|
|
|
NodeTag type;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
2018-10-04 20:03:37 +02:00
|
|
|
/* result relation's range table index, or 0 if not in range table */
|
2000-11-12 01:37:02 +01:00
|
|
|
Index ri_RangeTableIndex;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* relation descriptor for result relation */
|
2000-11-12 01:37:02 +01:00
|
|
|
Relation ri_RelationDesc;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* # of indices existing on result relation */
|
2000-11-12 01:37:02 +01:00
|
|
|
int ri_NumIndices;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* array of relation descriptors for indices */
|
2000-11-12 01:37:02 +01:00
|
|
|
RelationPtr ri_IndexRelationDescs;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* array of key/attr info for indices */
|
2000-11-12 01:37:02 +01:00
|
|
|
IndexInfo **ri_IndexRelationInfo;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* triggers to be fired, if any */
|
2001-06-01 04:41:36 +02:00
|
|
|
TriggerDesc *ri_TrigDesc;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* cached lookup info for trigger functions */
|
2001-06-01 04:41:36 +02:00
|
|
|
FmgrInfo *ri_TrigFunctions;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* array of trigger WHEN expr states */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState **ri_TrigWhenExprs;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* optional runtime measurements for triggers */
|
2011-09-22 17:29:18 +02:00
|
|
|
Instrumentation *ri_TrigInstrument;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
2019-02-27 05:30:28 +01:00
|
|
|
/* On-demand created slots for triggers / returning processing */
|
2019-05-22 18:55:34 +02:00
|
|
|
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
|
2019-02-27 05:30:28 +01:00
|
|
|
TupleTableSlot *ri_TrigOldSlot; /* for a trigger's old tuple */
|
|
|
|
TupleTableSlot *ri_TrigNewSlot; /* for a trigger's new tuple */
|
|
|
|
|
2017-06-20 20:29:48 +02:00
|
|
|
/* FDW callback functions, if foreign table */
|
2013-03-10 19:14:53 +01:00
|
|
|
struct FdwRoutine *ri_FdwRoutine;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* available to save private state of FDW */
|
2013-03-10 19:14:53 +01:00
|
|
|
void *ri_FdwState;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* true when modifying foreign table directly */
|
2016-03-18 18:48:58 +01:00
|
|
|
bool ri_usesFdwDirectModify;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* list of WithCheckOption's to be checked */
|
2013-07-18 23:10:16 +02:00
|
|
|
List *ri_WithCheckOptions;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* list of WithCheckOption expr states */
|
2013-07-18 23:10:16 +02:00
|
|
|
List *ri_WithCheckOptionExprs;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* array of constraint-checking expr states */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState **ri_ConstraintExprs;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
2019-03-30 08:13:09 +01:00
|
|
|
/* array of stored generated columns expr states */
|
|
|
|
ExprState **ri_GeneratedExprs;
|
|
|
|
|
2017-06-20 20:29:48 +02:00
|
|
|
/* for removing junk attributes from tuples */
|
2000-11-12 01:37:02 +01:00
|
|
|
JunkFilter *ri_junkFilter;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
2018-04-07 01:16:11 +02:00
|
|
|
/* list of RETURNING expressions */
|
|
|
|
List *ri_returningList;
|
|
|
|
|
2017-06-20 20:29:48 +02:00
|
|
|
/* for computing a RETURNING list */
|
2006-08-12 04:52:06 +02:00
|
|
|
ProjectionInfo *ri_projectReturning;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
2018-03-26 15:43:54 +02:00
|
|
|
/* list of arbiter indexes to use to check conflicts */
|
|
|
|
List *ri_onConflictArbiterIndexes;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
2018-03-26 15:43:54 +02:00
|
|
|
/* ON CONFLICT evaluation state */
|
|
|
|
OnConflictSetState *ri_onConflict;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* partition check expression */
|
Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own. The children are called
partitions and contain all of the actual data. Each partition has an
implicit partitioning constraint. Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed. Partitions
can't have extra columns and may not allow nulls unless the parent
does. Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.
Currently, tables can be range-partitioned or list-partitioned. List
partitioning is limited to a single column, but range partitioning can
involve multiple columns. A partitioning "column" can be an
expression.
Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations. The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.
Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
2016-12-07 19:17:43 +01:00
|
|
|
List *ri_PartitionCheck;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* partition check expression state */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *ri_PartitionCheckExpr;
|
2017-06-20 20:29:48 +02:00
|
|
|
|
|
|
|
/* relation descriptor for root partitioned table */
|
2017-01-04 20:36:34 +01:00
|
|
|
Relation ri_PartitionRoot;
|
MERGE SQL Command following SQL:2016
MERGE performs actions that modify rows in the target table
using a source table or query. MERGE provides a single SQL
statement that can conditionally INSERT/UPDATE/DELETE rows
a task that would other require multiple PL statements.
e.g.
MERGE INTO target AS t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED AND t.balance > s.delta THEN
UPDATE SET balance = t.balance - s.delta
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED AND s.delta > 0 THEN
INSERT VALUES (s.sid, s.delta)
WHEN NOT MATCHED THEN
DO NOTHING;
MERGE works with regular and partitioned tables, including
column and row security enforcement, as well as support for
row, statement and transition triggers.
MERGE is optimized for OLTP and is parameterizable, though
also useful for large scale ETL/ELT. MERGE is not intended
to be used in preference to existing single SQL commands
for INSERT, UPDATE or DELETE since there is some overhead.
MERGE can be used statically from PL/pgSQL.
MERGE does not yet support inheritance, write rules,
RETURNING clauses, updatable views or foreign tables.
MERGE follows SQL Standard per the most recent SQL:2016.
Includes full tests and documentation, including full
isolation tests to demonstrate the concurrent behavior.
This version written from scratch in 2017 by Simon Riggs,
using docs and tests originally written in 2009. Later work
from Pavan Deolasee has been both complex and deep, leaving
the lead author credit now in his hands.
Extensive discussion of concurrency from Peter Geoghegan,
with thanks for the time and effort contributed.
Various issues reported via sqlsmith by Andreas Seltenreich
Authors: Pavan Deolasee, Simon Riggs
Reviewer: Peter Geoghegan, Amit Langote, Tomas Vondra, Simon Riggs
Discussion:
https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.com
https://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com
2018-04-03 10:28:16 +02:00
|
|
|
|
2018-11-16 18:54:15 +01:00
|
|
|
/* Additional information specific to partition tuple routing */
|
|
|
|
struct PartitionRoutingInfo *ri_PartitionInfo;
|
tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering.
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.
A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.
Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.
By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).
Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
2019-04-05 00:47:19 +02:00
|
|
|
|
|
|
|
/* For use by copy.c when performing multi-inserts */
|
|
|
|
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
|
2000-11-12 01:37:02 +01:00
|
|
|
} ResultRelInfo;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* EState information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2002-12-15 17:17:59 +01:00
|
|
|
* Master working state for an Executor invocation
|
1997-09-07 07:04:48 +02:00
|
|
|
* ----------------
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct EState
|
|
|
|
{
|
1999-05-25 18:15:34 +02:00
|
|
|
NodeTag type;
|
2002-12-15 17:17:59 +01:00
|
|
|
|
|
|
|
/* Basic state for all query types: */
|
2003-08-04 02:43:34 +02:00
|
|
|
ScanDirection es_direction; /* current scan direction */
|
2002-12-15 17:17:59 +01:00
|
|
|
Snapshot es_snapshot; /* time qual to use */
|
2004-08-29 07:07:03 +02:00
|
|
|
Snapshot es_crosscheck_snapshot; /* crosscheck time qual for RI */
|
2007-02-22 23:00:26 +01:00
|
|
|
List *es_range_table; /* List of RangeTblEntry */
|
2018-10-04 21:48:17 +02:00
|
|
|
Index es_range_table_size; /* size of the range table arrays */
|
|
|
|
Relation *es_relations; /* Array of per-range-table-entry Relation
|
2018-10-04 20:03:37 +02:00
|
|
|
* pointers, or NULL if not yet opened */
|
2018-10-08 16:41:34 +02:00
|
|
|
struct ExecRowMark **es_rowmarks; /* Array of per-range-table-entry
|
|
|
|
* ExecRowMarks, or NULL if none */
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
PlannedStmt *es_plannedstmt; /* link to top of plan tree */
|
2017-02-22 07:45:17 +01:00
|
|
|
const char *es_sourceText; /* Source text from QueryDesc */
|
2002-12-15 17:17:59 +01:00
|
|
|
|
2009-10-12 20:10:51 +02:00
|
|
|
JunkFilter *es_junkFilter; /* top-level junk filter, if any */
|
|
|
|
|
2007-11-30 22:22:54 +01:00
|
|
|
/* If query can insert/delete tuples, the command ID to mark them with */
|
|
|
|
CommandId es_output_cid;
|
|
|
|
|
2011-02-26 00:56:23 +01:00
|
|
|
/* Info about target table(s) for insert/update/delete queries: */
|
2001-03-22 05:01:46 +01:00
|
|
|
ResultRelInfo *es_result_relations; /* array of ResultRelInfos */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
int es_num_result_relations; /* length of array */
|
|
|
|
ResultRelInfo *es_result_relation_info; /* currently active array elt */
|
2005-06-20 20:37:02 +02:00
|
|
|
|
2017-05-01 14:23:01 +02:00
|
|
|
/*
|
2018-10-17 22:41:00 +02:00
|
|
|
* Info about the partition root table(s) for insert/update/delete queries
|
|
|
|
* targeting partitioned tables. Only leaf partitions are mentioned in
|
|
|
|
* es_result_relations, but we need access to the roots for firing
|
|
|
|
* triggers and for runtime tuple routing.
|
2017-05-01 14:23:01 +02:00
|
|
|
*/
|
|
|
|
ResultRelInfo *es_root_result_relations; /* array of ResultRelInfos */
|
|
|
|
int es_num_root_result_relations; /* length of the array */
|
Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
2019-03-07 17:13:12 +01:00
|
|
|
PartitionDirectory es_partition_directory; /* for PartitionDesc lookup */
|
2017-05-01 14:23:01 +02:00
|
|
|
|
2018-02-08 20:29:05 +01:00
|
|
|
/*
|
2018-04-26 20:47:16 +02:00
|
|
|
* The following list contains ResultRelInfos created by the tuple routing
|
|
|
|
* code for partitions that don't already have one.
|
2018-02-08 20:29:05 +01:00
|
|
|
*/
|
|
|
|
List *es_tuple_routing_result_relations;
|
2017-08-18 19:01:05 +02:00
|
|
|
|
2007-08-15 23:39:50 +02:00
|
|
|
/* Stuff used for firing triggers: */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
List *es_trig_target_relations; /* trigger-only ResultRelInfos */
|
2005-11-14 18:42:55 +01:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/* Parameter info: */
|
2002-11-25 22:29:42 +01:00
|
|
|
ParamListInfo es_param_list_info; /* values of external params */
|
|
|
|
ParamExecData *es_param_exec_vals; /* values of internal params */
|
2002-12-15 17:17:59 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
QueryEnvironment *es_queryEnv; /* query environment */
|
2017-04-01 06:17:18 +02:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/* Other working state: */
|
|
|
|
MemoryContext es_query_cxt; /* per-query context in which EState lives */
|
|
|
|
|
2009-09-27 22:09:58 +02:00
|
|
|
List *es_tupleTable; /* List of TupleTableSlots */
|
2002-12-15 17:17:59 +01:00
|
|
|
|
Widen query numbers-of-tuples-processed counters to uint64.
This patch widens SPI_processed, EState's es_processed field, PortalData's
portalPos field, FuncCallContext's call_cntr and max_calls fields,
ExecutorRun's count argument, PortalRunFetch's result, and the max number
of rows in a SPITupleTable to uint64, and deals with (I hope) all the
ensuing fallout. Some of these values were declared uint32 before, and
others "long".
I also removed PortalData's posOverflow field, since that logic seems
pretty useless given that portalPos is now always 64 bits.
The user-visible results are that command tags for SELECT etc will
correctly report tuple counts larger than 4G, as will plpgsql's GET
GET DIAGNOSTICS ... ROW_COUNT command. Queries processing more tuples
than that are still not exactly the norm, but they're becoming more
common.
Most values associated with FETCH/MOVE distances, such as PortalRun's count
argument and the count argument of most SPI functions that have one, remain
declared as "long". It's not clear whether it would be worth promoting
those to int64; but it would definitely be a large dollop of additional
API churn on top of this, and it would only help 32-bit platforms which
seem relatively less likely to see any benefit.
Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me
2016-03-12 22:05:10 +01:00
|
|
|
uint64 es_processed; /* # of tuples processed */
|
2001-03-22 05:01:46 +01:00
|
|
|
|
2011-02-27 19:43:29 +01:00
|
|
|
int es_top_eflags; /* eflags passed to ExecutorStart */
|
2009-12-15 05:57:48 +01:00
|
|
|
int es_instrument; /* OR of InstrumentOption flags */
|
2011-02-27 19:43:29 +01:00
|
|
|
bool es_finished; /* true when ExecutorFinish is done */
|
2002-12-05 16:50:39 +01:00
|
|
|
|
2003-08-04 02:43:34 +02:00
|
|
|
List *es_exprcontexts; /* List of ExprContexts within EState */
|
2002-12-15 17:17:59 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
List *es_subplanstates; /* List of PlanState for SubPlans */
|
2007-02-27 02:11:26 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
List *es_auxmodifytables; /* List of secondary ModifyTableStates */
|
2011-02-26 00:56:23 +01:00
|
|
|
|
2000-11-12 01:37:02 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* this ExprContext is for per-output-tuple operations, such as constraint
|
|
|
|
* checks and index-value computations. It will be reset for each output
|
|
|
|
* tuple. Note that it will be created only if needed.
|
2000-08-22 06:06:22 +02:00
|
|
|
*/
|
|
|
|
ExprContext *es_per_tuple_exprcontext;
|
2002-11-25 22:29:42 +01:00
|
|
|
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
/*
|
|
|
|
* These fields are for re-evaluating plan quals when an updated tuple is
|
Store tuples for EvalPlanQual in slots, rather than as HeapTuples.
For the upcoming pluggable table access methods it's quite
inconvenient to store tuples as HeapTuples, as that'd require
converting tuples from a their native format into HeapTuples. Instead
use slots to manage epq tuples.
To fit into that scheme, change the foreign data wrapper callback
RefetchForeignRow, to store the tuple in a slot. Insist on using the
caller provided slot, so it conveniently can be stored in the
corresponding EPQ slot. As there is no in core user of
RefetchForeignRow, that change was done blindly, but we plan to test
that soon.
To avoid duplicating that work for row locks, move row locks to just
directly use the EPQ slots - it previously temporarily stored tuples
in LockRowsState.lr_curtuples, but that doesn't seem beneficial, given
we'd possibly end up with a significant number of additional slots.
The behaviour of es_epqTupleSet[rti -1] is now checked by
es_epqTupleSlot[rti -1] != NULL, as that is distinguishable from a
slot containing an empty tuple.
Author: Andres Freund, Haribabu Kommi, Ashutosh Bapat
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-01 19:37:57 +01:00
|
|
|
* substituted in READ COMMITTED mode. es_epqTupleSlot[] contains test
|
|
|
|
* tuples that scan plan nodes should return instead of whatever they'd
|
|
|
|
* normally return, or an empty slot if there is nothing to return; if
|
|
|
|
* es_epqTupleSlot[] is not NULL if a particular array entry is valid; and
|
|
|
|
* es_epqScanDone[] is state to remember if the tuple has been returned
|
|
|
|
* already. Arrays are of size es_range_table_size and are indexed by
|
|
|
|
* scan node scanrelid - 1.
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
*/
|
Store tuples for EvalPlanQual in slots, rather than as HeapTuples.
For the upcoming pluggable table access methods it's quite
inconvenient to store tuples as HeapTuples, as that'd require
converting tuples from a their native format into HeapTuples. Instead
use slots to manage epq tuples.
To fit into that scheme, change the foreign data wrapper callback
RefetchForeignRow, to store the tuple in a slot. Insist on using the
caller provided slot, so it conveniently can be stored in the
corresponding EPQ slot. As there is no in core user of
RefetchForeignRow, that change was done blindly, but we plan to test
that soon.
To avoid duplicating that work for row locks, move row locks to just
directly use the EPQ slots - it previously temporarily stored tuples
in LockRowsState.lr_curtuples, but that doesn't seem beneficial, given
we'd possibly end up with a significant number of additional slots.
The behaviour of es_epqTupleSet[rti -1] is now checked by
es_epqTupleSlot[rti -1] != NULL, as that is distinguishable from a
slot containing an empty tuple.
Author: Andres Freund, Haribabu Kommi, Ashutosh Bapat
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-01 19:37:57 +01:00
|
|
|
TupleTableSlot **es_epqTupleSlot; /* array of EPQ substitute tuples */
|
2010-02-26 03:01:40 +01:00
|
|
|
bool *es_epqScanDone; /* true if EPQ tuple has been fetched */
|
2016-12-19 22:47:15 +01:00
|
|
|
|
Allow bitmap scans to operate as index-only scans when possible.
If we don't have to return any columns from heap tuples, and there's
no need to recheck qual conditions, and the heap page is all-visible,
then we can skip fetching the heap page altogether.
Skip prefetching pages too, when possible, on the assumption that the
recheck flag will remain the same from one page to the next. While that
assumption is hardly bulletproof, it seems like a good bet most of the
time, and better than prefetching pages we don't need.
This commit installs the executor infrastructure, but doesn't change
any planner cost estimates, thus possibly causing bitmap scans to
not be chosen in cases where this change renders them the best choice.
I (tgl) am not entirely convinced that we need to account for this
behavior in the planner, because I think typically the bitmap scan would
get chosen anyway if it's the best bet. In any case the submitted patch
took way too many shortcuts, resulting in too many clearly-bad choices,
to be committable.
Alexander Kuzmenkov, reviewed by Alexey Chernyshov, and whacked around
rather heavily by me.
Discussion: https://postgr.es/m/239a8955-c0fc-f506-026d-c837e86c827b@postgrespro.ru
2017-11-01 22:38:12 +01:00
|
|
|
bool es_use_parallel_mode; /* can we use parallel workers? */
|
2017-10-27 17:29:20 +02:00
|
|
|
|
2016-12-19 22:47:15 +01:00
|
|
|
/* The per-query shared memory area to use for parallel execution. */
|
2017-05-17 22:31:56 +02:00
|
|
|
struct dsa_area *es_query_dsa;
|
2018-03-22 19:45:07 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* JIT information. es_jit_flags indicates whether JIT should be performed
|
|
|
|
* and with which options. es_jit is created on-demand when JITing is
|
|
|
|
* performed.
|
2018-09-25 21:54:29 +02:00
|
|
|
*
|
2019-07-08 06:15:09 +02:00
|
|
|
* es_jit_worker_instr is the combined, on demand allocated,
|
2018-10-03 21:48:37 +02:00
|
|
|
* instrumentation from all workers. The leader's instrumentation is kept
|
|
|
|
* separate, and is combined on demand by ExplainPrintJITSummary().
|
2018-03-22 19:45:07 +01:00
|
|
|
*/
|
|
|
|
int es_jit_flags;
|
|
|
|
struct JitContext *es_jit;
|
2018-10-03 21:48:37 +02:00
|
|
|
struct JitInstrumentation *es_jit_worker_instr;
|
1997-09-08 23:56:23 +02:00
|
|
|
} EState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
|
2008-11-15 20:43:47 +01:00
|
|
|
/*
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
* ExecRowMark -
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* runtime representation of FOR [KEY] UPDATE/SHARE clauses
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
*
|
Allow foreign tables to participate in inheritance.
Foreign tables can now be inheritance children, or parents. Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.
As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior. Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it. And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.
An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:
Update on pt1 (cost=0.00..321.05 rows=3541 width=46)
Update on pt1
Foreign Update on ft1
Foreign Update on ft2
Update on child3
-> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46)
-> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46)
-> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46)
-> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46)
This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.
Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
|
|
|
* When doing UPDATE, DELETE, or SELECT FOR [KEY] UPDATE/SHARE, we will have an
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
* ExecRowMark for each non-target relation in the query (except inheritance
|
Allow foreign tables to participate in inheritance.
Foreign tables can now be inheritance children, or parents. Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.
As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior. Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it. And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.
An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:
Update on pt1 (cost=0.00..321.05 rows=3541 width=46)
Update on pt1
Foreign Update on ft1
Foreign Update on ft2
Update on child3
-> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46)
-> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46)
-> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46)
-> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46)
This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.
Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
|
|
|
* parent RTEs, which can be ignored at runtime). Virtual relations such as
|
|
|
|
* subqueries-in-FROM will have an ExecRowMark with relation == NULL. See
|
|
|
|
* PlanRowMark for details about most of the fields. In addition to fields
|
Add support for doing late row locking in FDWs.
Previously, FDWs could only do "early row locking", that is lock a row as
soon as it's fetched, even though local restriction/join conditions might
discard the row later. This patch adds callbacks that allow FDWs to do
late locking in the same way that it's done for regular tables.
To make use of this feature, an FDW must support the "ctid" column as a
unique row identifier. Currently, since ctid has to be of type TID,
the feature is of limited use, though in principle it could be used by
postgres_fdw. We may eventually allow FDWs to specify another data type
for ctid, which would make it possible for more FDWs to use this feature.
This commit does not modify postgres_fdw to use late locking. We've
tested some prototype code for that, but it's not in committable shape,
and besides it's quite unclear whether it actually makes sense to do late
locking against a remote server. The extra round trips required are likely
to outweigh any benefit from improved concurrency.
Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me
2015-05-12 20:10:10 +02:00
|
|
|
* directly derived from PlanRowMark, we store an activity flag (to denote
|
|
|
|
* inactive children of inheritance trees), curCtid, which is used by the
|
|
|
|
* WHERE CURRENT OF code, and ermExtra, which is available for use by the plan
|
|
|
|
* node that sources the relation (e.g., for a foreign table the FDW can use
|
|
|
|
* ermExtra to hold information).
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
*
|
2018-10-08 16:41:34 +02:00
|
|
|
* EState->es_rowmarks is an array of these structs, indexed by RT index,
|
|
|
|
* with NULLs for irrelevant RT indexes. es_rowmarks itself is NULL if
|
|
|
|
* there are no rowmarks.
|
2008-11-15 20:43:47 +01:00
|
|
|
*/
|
2005-12-02 21:03:42 +01:00
|
|
|
typedef struct ExecRowMark
|
|
|
|
{
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
Relation relation; /* opened and suitably locked relation */
|
Allow foreign tables to participate in inheritance.
Foreign tables can now be inheritance children, or parents. Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.
As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior. Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it. And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.
An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:
Update on pt1 (cost=0.00..321.05 rows=3541 width=46)
Update on pt1
Foreign Update on ft1
Foreign Update on ft2
Update on child3
-> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46)
-> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46)
-> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46)
-> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46)
This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.
Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
|
|
|
Oid relid; /* its OID (or InvalidOid, if subquery) */
|
2005-12-02 21:03:42 +01:00
|
|
|
Index rti; /* its range table index */
|
2008-11-15 20:43:47 +01:00
|
|
|
Index prti; /* parent range table index, if child */
|
2011-02-10 05:27:07 +01:00
|
|
|
Index rowmarkId; /* unique identifier for resjunk columns */
|
2010-02-26 03:01:40 +01:00
|
|
|
RowMarkType markType; /* see enum in nodes/plannodes.h */
|
Add support for doing late row locking in FDWs.
Previously, FDWs could only do "early row locking", that is lock a row as
soon as it's fetched, even though local restriction/join conditions might
discard the row later. This patch adds callbacks that allow FDWs to do
late locking in the same way that it's done for regular tables.
To make use of this feature, an FDW must support the "ctid" column as a
unique row identifier. Currently, since ctid has to be of type TID,
the feature is of limited use, though in principle it could be used by
postgres_fdw. We may eventually allow FDWs to specify another data type
for ctid, which would make it possible for more FDWs to use this feature.
This commit does not modify postgres_fdw to use late locking. We've
tested some prototype code for that, but it's not in committable shape,
and besides it's quite unclear whether it actually makes sense to do late
locking against a remote server. The extra round trips required are likely
to outweigh any benefit from improved concurrency.
Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me
2015-05-12 20:10:10 +02:00
|
|
|
LockClauseStrength strength; /* LockingClause's strength, or LCS_NONE */
|
2014-10-07 22:23:34 +02:00
|
|
|
LockWaitPolicy waitPolicy; /* NOWAIT and SKIP LOCKED */
|
Add support for doing late row locking in FDWs.
Previously, FDWs could only do "early row locking", that is lock a row as
soon as it's fetched, even though local restriction/join conditions might
discard the row later. This patch adds callbacks that allow FDWs to do
late locking in the same way that it's done for regular tables.
To make use of this feature, an FDW must support the "ctid" column as a
unique row identifier. Currently, since ctid has to be of type TID,
the feature is of limited use, though in principle it could be used by
postgres_fdw. We may eventually allow FDWs to specify another data type
for ctid, which would make it possible for more FDWs to use this feature.
This commit does not modify postgres_fdw to use late locking. We've
tested some prototype code for that, but it's not in committable shape,
and besides it's quite unclear whether it actually makes sense to do late
locking against a remote server. The extra round trips required are likely
to outweigh any benefit from improved concurrency.
Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me
2015-05-12 20:10:10 +02:00
|
|
|
bool ermActive; /* is this mark relevant for current tuple? */
|
2011-01-13 02:47:02 +01:00
|
|
|
ItemPointerData curCtid; /* ctid of currently locked tuple, if any */
|
Add support for doing late row locking in FDWs.
Previously, FDWs could only do "early row locking", that is lock a row as
soon as it's fetched, even though local restriction/join conditions might
discard the row later. This patch adds callbacks that allow FDWs to do
late locking in the same way that it's done for regular tables.
To make use of this feature, an FDW must support the "ctid" column as a
unique row identifier. Currently, since ctid has to be of type TID,
the feature is of limited use, though in principle it could be used by
postgres_fdw. We may eventually allow FDWs to specify another data type
for ctid, which would make it possible for more FDWs to use this feature.
This commit does not modify postgres_fdw to use late locking. We've
tested some prototype code for that, but it's not in committable shape,
and besides it's quite unclear whether it actually makes sense to do late
locking against a remote server. The extra round trips required are likely
to outweigh any benefit from improved concurrency.
Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me
2015-05-12 20:10:10 +02:00
|
|
|
void *ermExtra; /* available for use by relation source node */
|
2011-01-13 02:47:02 +01:00
|
|
|
} ExecRowMark;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ExecAuxRowMark -
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* additional runtime representation of FOR [KEY] UPDATE/SHARE clauses
|
2011-01-13 02:47:02 +01:00
|
|
|
*
|
|
|
|
* Each LockRows and ModifyTable node keeps a list of the rowmarks it needs to
|
2018-10-08 16:41:34 +02:00
|
|
|
* deal with. In addition to a pointer to the related entry in es_rowmarks,
|
2011-01-13 02:47:02 +01:00
|
|
|
* this struct carries the column number(s) of the resjunk columns associated
|
|
|
|
* with the rowmark (see comments for PlanRowMark for more detail). In the
|
|
|
|
* case of ModifyTable, there has to be a separate ExecAuxRowMark list for
|
|
|
|
* each child plan, because the resjunk columns could be at different physical
|
|
|
|
* column positions in different subplans.
|
|
|
|
*/
|
|
|
|
typedef struct ExecAuxRowMark
|
|
|
|
{
|
2018-10-08 16:41:34 +02:00
|
|
|
ExecRowMark *rowmark; /* related entry in es_rowmarks */
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
AttrNumber ctidAttNo; /* resno of ctid junk attribute, if any */
|
2008-11-15 20:43:47 +01:00
|
|
|
AttrNumber toidAttNo; /* resno of tableoid junk attribute, if any */
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
AttrNumber wholeAttNo; /* resno of whole-row junk attribute, if any */
|
2011-01-13 02:47:02 +01:00
|
|
|
} ExecAuxRowMark;
|
2005-12-02 21:03:42 +01:00
|
|
|
|
|
|
|
|
2003-01-11 00:54:24 +01:00
|
|
|
/* ----------------------------------------------------------------
|
|
|
|
* Tuple Hash Tables
|
|
|
|
*
|
|
|
|
* All-in-memory tuple hash tables are used for a number of purposes.
|
2007-02-06 03:59:15 +01:00
|
|
|
*
|
|
|
|
* Note: tab_hash_funcs are for the key datatype(s) stored in the table,
|
|
|
|
* and tab_eq_funcs are non-cross-type equality operators for those types.
|
|
|
|
* Normally these are the only functions used, but FindTupleHashEntry()
|
|
|
|
* supports searching a hashtable using cross-data-type hashing. For that,
|
|
|
|
* the caller must supply hash functions for the LHS datatype as well as
|
2018-02-16 06:55:31 +01:00
|
|
|
* the cross-type equality operators to use. in_hash_funcs and cur_eq_func
|
2007-02-06 03:59:15 +01:00
|
|
|
* are set to point to the caller's function arrays while doing such a search.
|
|
|
|
* During LookupTupleHashEntry(), they point to tab_hash_funcs and
|
2018-02-16 06:55:31 +01:00
|
|
|
* tab_eq_func respectively.
|
2003-01-11 00:54:24 +01:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
typedef struct TupleHashEntryData *TupleHashEntry;
|
|
|
|
typedef struct TupleHashTableData *TupleHashTable;
|
|
|
|
|
|
|
|
typedef struct TupleHashEntryData
|
|
|
|
{
|
2006-06-28 19:05:49 +02:00
|
|
|
MinimalTuple firstTuple; /* copy of first tuple in this group */
|
2016-10-15 02:22:51 +02:00
|
|
|
void *additional; /* user data */
|
|
|
|
uint32 status; /* hash status */
|
|
|
|
uint32 hash; /* hash value (cached) */
|
|
|
|
} TupleHashEntryData;
|
|
|
|
|
2017-06-17 10:17:01 +02:00
|
|
|
/* define parameters necessary to generate the tuple hash table interface */
|
2016-10-15 02:22:51 +02:00
|
|
|
#define SH_PREFIX tuplehash
|
|
|
|
#define SH_ELEMENT_TYPE TupleHashEntryData
|
|
|
|
#define SH_KEY_TYPE MinimalTuple
|
|
|
|
#define SH_SCOPE extern
|
|
|
|
#define SH_DECLARE
|
|
|
|
#include "lib/simplehash.h"
|
2003-01-11 00:54:24 +01:00
|
|
|
|
|
|
|
typedef struct TupleHashTableData
|
|
|
|
{
|
2016-10-15 02:22:51 +02:00
|
|
|
tuplehash_hash *hashtab; /* underlying hash table */
|
2003-01-11 00:54:24 +01:00
|
|
|
int numCols; /* number of columns in lookup key */
|
|
|
|
AttrNumber *keyColIdx; /* attr numbers of key columns */
|
2007-11-15 22:14:46 +01:00
|
|
|
FmgrInfo *tab_hash_funcs; /* hash functions for table datatype(s) */
|
2018-02-16 06:55:31 +01:00
|
|
|
ExprState *tab_eq_func; /* comparator for table datatype(s) */
|
2019-05-22 18:55:34 +02:00
|
|
|
Oid *tab_collations; /* collations for hash and comparison */
|
2003-01-11 00:54:24 +01:00
|
|
|
MemoryContext tablecxt; /* memory context containing table */
|
|
|
|
MemoryContext tempcxt; /* context for function evaluations */
|
|
|
|
Size entrysize; /* actual size to make each hash entry */
|
2005-03-16 22:38:10 +01:00
|
|
|
TupleTableSlot *tableslot; /* slot for referencing table entries */
|
2007-02-06 03:59:15 +01:00
|
|
|
/* The following fields are set transiently for each table search: */
|
2005-03-16 22:38:10 +01:00
|
|
|
TupleTableSlot *inputslot; /* current input tuple's slot */
|
2007-02-06 03:59:15 +01:00
|
|
|
FmgrInfo *in_hash_funcs; /* hash functions for input datatype(s) */
|
2018-07-09 15:10:44 +02:00
|
|
|
ExprState *cur_eq_func; /* comparator for input vs. table */
|
2016-12-16 16:03:08 +01:00
|
|
|
uint32 hash_iv; /* hash-function IV */
|
2018-02-16 06:55:31 +01:00
|
|
|
ExprContext *exprcontext; /* expression context */
|
2017-06-21 20:39:04 +02:00
|
|
|
} TupleHashTableData;
|
2003-01-11 00:54:24 +01:00
|
|
|
|
2016-10-15 02:22:51 +02:00
|
|
|
typedef tuplehash_iterator TupleHashIterator;
|
2003-01-11 00:54:24 +01:00
|
|
|
|
2007-04-27 01:24:46 +02:00
|
|
|
/*
|
|
|
|
* Use InitTupleHashIterator/TermTupleHashIterator for a read/write scan.
|
|
|
|
* Use ResetTupleHashIterator if the table can be frozen (in this case no
|
|
|
|
* explicit scan termination is needed).
|
|
|
|
*/
|
|
|
|
#define InitTupleHashIterator(htable, iter) \
|
2016-10-15 02:22:51 +02:00
|
|
|
tuplehash_start_iterate(htable->hashtab, iter)
|
2007-04-27 01:24:46 +02:00
|
|
|
#define TermTupleHashIterator(iter) \
|
2016-10-15 02:22:51 +02:00
|
|
|
((void) 0)
|
2007-04-27 01:24:46 +02:00
|
|
|
#define ResetTupleHashIterator(htable, iter) \
|
2016-10-15 02:22:51 +02:00
|
|
|
InitTupleHashIterator(htable, iter)
|
|
|
|
#define ScanTupleHashTable(htable, iter) \
|
|
|
|
tuplehash_iterate(htable->hashtab, iter)
|
2003-01-11 00:54:24 +01:00
|
|
|
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------------------------------------------------------
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* Expression State Nodes
|
|
|
|
*
|
|
|
|
* Formerly, there was a separate executor expression state node corresponding
|
|
|
|
* to each node in a planned expression tree. That's no longer the case; for
|
|
|
|
* common expression node types, all the execution info is embedded into
|
|
|
|
* step(s) in a single ExprState node. But we still have a few executor state
|
|
|
|
* node types for selected expression node types, mostly those in which info
|
|
|
|
* has to be shared with other parts of the execution state tree.
|
2002-12-13 20:46:01 +01:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* AggrefExprState node
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct AggrefExprState
|
|
|
|
{
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
NodeTag type;
|
|
|
|
Aggref *aggref; /* expression plan node */
|
2002-12-13 20:46:01 +01:00
|
|
|
int aggno; /* ID number for agg within its plan node */
|
2003-08-08 23:42:59 +02:00
|
|
|
} AggrefExprState;
|
2002-12-13 20:46:01 +01:00
|
|
|
|
2008-12-28 19:54:01 +01:00
|
|
|
/* ----------------
|
|
|
|
* WindowFuncExprState node
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct WindowFuncExprState
|
|
|
|
{
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
NodeTag type;
|
|
|
|
WindowFunc *wfunc; /* expression plan node */
|
|
|
|
List *args; /* ExprStates for argument expressions */
|
2013-07-17 02:15:36 +02:00
|
|
|
ExprState *aggfilter; /* FILTER expression */
|
2008-12-28 19:54:01 +01:00
|
|
|
int wfuncno; /* ID number for wfunc within its plan node */
|
|
|
|
} WindowFuncExprState;
|
|
|
|
|
2002-12-13 20:46:01 +01:00
|
|
|
|
|
|
|
/* ----------------
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* SetExprState node
|
2002-12-13 20:46:01 +01:00
|
|
|
*
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* State for evaluating a potentially set-returning expression (like FuncExpr
|
|
|
|
* or OpExpr). In some cases, like some of the expressions in ROWS FROM(...)
|
|
|
|
* the expression might not be a SRF, but nonetheless it uses the same
|
|
|
|
* machinery as SRFs; it will be treated as a SRF returning a single row.
|
2002-12-13 20:46:01 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
typedef struct SetExprState
|
2002-12-13 20:46:01 +01:00
|
|
|
{
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
NodeTag type;
|
|
|
|
Expr *expr; /* expression plan node */
|
|
|
|
List *args; /* ExprStates for argument expressions */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In ROWS FROM, functions can be inlined, removing the FuncExpr normally
|
|
|
|
* inside. In such a case this is the compiled expression (which cannot
|
|
|
|
* return a set), which'll be evaluated using regular ExecEvalExpr().
|
|
|
|
*/
|
|
|
|
ExprState *elidedFuncState;
|
2002-12-13 20:46:01 +01:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Function manager's lookup info for the target function. If func.fn_oid
|
2008-10-28 23:02:06 +01:00
|
|
|
* is InvalidOid, we haven't initialized it yet (nor any of the following
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
* fields, except funcReturnsSet).
|
2002-12-13 20:46:01 +01:00
|
|
|
*/
|
|
|
|
FmgrInfo func;
|
|
|
|
|
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* For a set-returning function (SRF) that returns a tuplestore, we keep
|
|
|
|
* the tuplestore here and dole out the result rows one at a time. The
|
|
|
|
* slot holds the row currently being returned.
|
2008-10-28 23:02:06 +01:00
|
|
|
*/
|
|
|
|
Tuplestorestate *funcResultStore;
|
|
|
|
TupleTableSlot *funcResultSlot;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In some cases we need to compute a tuple descriptor for the function's
|
2014-05-06 18:12:18 +02:00
|
|
|
* output. If so, it's stored here.
|
2008-10-28 23:02:06 +01:00
|
|
|
*/
|
|
|
|
TupleDesc funcResultDesc;
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
bool funcReturnsTuple; /* valid when funcResultDesc isn't NULL */
|
2008-10-28 23:02:06 +01:00
|
|
|
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
/*
|
|
|
|
* Remember whether the function is declared to return a set. This is set
|
|
|
|
* by ExecInitExpr, and is valid even before the FmgrInfo is set up.
|
|
|
|
*/
|
|
|
|
bool funcReturnsSet;
|
|
|
|
|
2008-10-28 23:02:06 +01:00
|
|
|
/*
|
2010-11-01 18:54:21 +01:00
|
|
|
* setArgsValid is true when we are evaluating a set-returning function
|
|
|
|
* that uses value-per-call mode and we are in the middle of a call
|
|
|
|
* series; we want to pass the same argument values to the function again
|
|
|
|
* (and again, until it returns ExprEndResult). This indicates that
|
|
|
|
* fcinfo_data already contains valid argument data.
|
2002-12-13 20:46:01 +01:00
|
|
|
*/
|
|
|
|
bool setArgsValid;
|
|
|
|
|
2003-12-18 23:23:42 +01:00
|
|
|
/*
|
|
|
|
* Flag to remember whether we have registered a shutdown callback for
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* this SetExprState. We do so only if funcResultStore or setArgsValid
|
2008-10-28 23:02:06 +01:00
|
|
|
* has been set at least once (since all the callback is for is to release
|
|
|
|
* the tuplestore or clear setArgsValid).
|
2003-12-18 23:23:42 +01:00
|
|
|
*/
|
|
|
|
bool shutdown_reg; /* a shutdown callback is registered */
|
|
|
|
|
2002-12-13 20:46:01 +01:00
|
|
|
/*
|
2010-11-01 18:54:21 +01:00
|
|
|
* Call parameter structure for the function. This has been initialized
|
|
|
|
* (by InitFunctionCallInfoData) if func.fn_oid is valid. It also saves
|
|
|
|
* argument values between calls, when setArgsValid is true.
|
2002-12-13 20:46:01 +01:00
|
|
|
*/
|
Change function call information to be variable length.
Before this change FunctionCallInfoData, the struct arguments etc for
V1 function calls are stored in, always had space for
FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two
arrays. For nearly every function call 100 arguments is far more than
needed, therefore wasting memory. Arg and argnull being two separate
arrays also guarantees that to access a single argument, two
cachelines have to be touched.
Change the layout so there's a single variable-length array with pairs
of value / isnull. That drastically reduces memory consumption for
most function calls (on x86-64 a two argument function now uses
64bytes, previously 936 bytes), and makes it very likely that argument
value and its nullness are on the same cacheline.
Arguments are stored in a new NullableDatum struct, which, due to
padding, needs more memory per argument than before. But as usually
far fewer arguments are stored, and individual arguments are cheaper
to access, that's still a clear win. It's likely that there's other
places where conversion to NullableDatum arrays would make sense,
e.g. TupleTableSlots, but that's for another commit.
Because the function call information is now variable-length
allocations have to take the number of arguments into account. For
heap allocations that can be done with SizeForFunctionCallInfoData(),
for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro
that helps to allocate an appropriately sized and aligned variable.
Some places with stack allocation function call information don't know
the number of arguments at compile time, and currently variably sized
stack allocations aren't allowed in postgres. Therefore allow for
FUNC_MAX_ARGS space in these cases. They're not that common, so for
now that seems acceptable.
Because of the need to allocate FunctionCallInfo of the appropriate
size, older extensions may need to update their code. To avoid subtle
breakages, the FunctionCallInfoData struct has been renamed to
FunctionCallInfoBaseData. Most code only references FunctionCallInfo,
so that shouldn't cause much collateral damage.
This change is also a prerequisite for more efficient expression JIT
compilation (by allocating the function call information on the stack,
allowing LLVM to optimize it away); previously the size of the call
information caused problems inside LLVM's optimizer.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
2019-01-26 23:17:52 +01:00
|
|
|
FunctionCallInfo fcinfo;
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
} SetExprState;
|
2002-12-13 20:46:01 +01:00
|
|
|
|
|
|
|
/* ----------------
|
2002-12-14 01:17:59 +01:00
|
|
|
* SubPlanState node
|
2002-12-13 20:46:01 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
2002-12-14 01:17:59 +01:00
|
|
|
typedef struct SubPlanState
|
2002-12-13 20:46:01 +01:00
|
|
|
{
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
NodeTag type;
|
|
|
|
SubPlan *subplan; /* expression plan node */
|
2003-08-04 02:43:34 +02:00
|
|
|
struct PlanState *planstate; /* subselect plan's state tree */
|
Implement UPDATE tab SET (col1,col2,...) = (SELECT ...), ...
This SQL-standard feature allows a sub-SELECT yielding multiple columns
(but only one row) to be used to compute the new values of several columns
to be updated. While the same results can be had with an independent
sub-SELECT per column, such a workaround can require a great deal of
duplicated computation.
The standard actually says that the source for a multi-column assignment
could be any row-valued expression. The implementation used here is
tightly tied to our existing sub-SELECT support and can't handle other
cases; the Bison grammar would have some issues with them too. However,
I don't feel too bad about this since other cases can be converted into
sub-SELECTs. For instance, "SET (a,b,c) = row_valued_function(x)" could
be written "SET (a,b,c) = (SELECT * FROM row_valued_function(x))".
2014-06-18 19:22:25 +02:00
|
|
|
struct PlanState *parent; /* parent plan node's state tree */
|
2005-12-28 02:30:02 +01:00
|
|
|
ExprState *testexpr; /* state of combining expression */
|
2003-01-10 22:08:15 +01:00
|
|
|
List *args; /* states of argument expression(s) */
|
2002-12-13 20:46:01 +01:00
|
|
|
HeapTuple curTuple; /* copy of most recent tuple from subplan */
|
2012-06-21 23:26:07 +02:00
|
|
|
Datum curArray; /* most recent array from ARRAY() subplan */
|
2003-01-10 22:08:15 +01:00
|
|
|
/* these are used when hashing the subselect's output: */
|
2018-02-16 06:55:31 +01:00
|
|
|
TupleDesc descRight; /* subselect desc after projection */
|
2003-01-12 05:03:34 +01:00
|
|
|
ProjectionInfo *projLeft; /* for projecting lefthand exprs */
|
|
|
|
ProjectionInfo *projRight; /* for projecting subselect output */
|
2003-01-11 00:54:24 +01:00
|
|
|
TupleHashTable hashtable; /* hash table for no-nulls subselect rows */
|
|
|
|
TupleHashTable hashnulls; /* hash table for rows with null(s) */
|
2017-08-16 06:22:32 +02:00
|
|
|
bool havehashrows; /* true if hashtable is not empty */
|
|
|
|
bool havenullrows; /* true if hashnulls is not empty */
|
2011-04-10 17:42:00 +02:00
|
|
|
MemoryContext hashtablecxt; /* memory context containing hash tables */
|
2010-07-28 06:50:50 +02:00
|
|
|
MemoryContext hashtempcxt; /* temp memory context for hash tables */
|
|
|
|
ExprContext *innerecontext; /* econtext for computing inner tuples */
|
2003-01-12 05:03:34 +01:00
|
|
|
AttrNumber *keyColIdx; /* control data for hash tables */
|
2018-04-26 20:47:16 +02:00
|
|
|
Oid *tab_eq_funcoids; /* equality func oids for table
|
|
|
|
* datatype(s) */
|
2019-05-22 18:55:34 +02:00
|
|
|
Oid *tab_collations; /* collations for hash and comparison */
|
2007-11-15 22:14:46 +01:00
|
|
|
FmgrInfo *tab_hash_funcs; /* hash functions for table datatype(s) */
|
2007-02-06 03:59:15 +01:00
|
|
|
FmgrInfo *tab_eq_funcs; /* equality functions for table datatype(s) */
|
2007-11-15 22:14:46 +01:00
|
|
|
FmgrInfo *lhs_hash_funcs; /* hash functions for lefthand datatype(s) */
|
2007-02-06 03:59:15 +01:00
|
|
|
FmgrInfo *cur_eq_funcs; /* equality functions for LHS vs. table */
|
2018-02-16 06:55:31 +01:00
|
|
|
ExprState *cur_eq_comp; /* equality comparator for LHS vs. table */
|
2003-08-08 23:42:59 +02:00
|
|
|
} SubPlanState;
|
2002-12-13 20:46:01 +01:00
|
|
|
|
2008-08-22 02:16:04 +02:00
|
|
|
/* ----------------
|
|
|
|
* AlternativeSubPlanState node
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct AlternativeSubPlanState
|
|
|
|
{
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
NodeTag type;
|
|
|
|
AlternativeSubPlan *subplan; /* expression plan node */
|
|
|
|
List *subplans; /* SubPlanStates of alternative subplans */
|
2008-08-22 02:16:04 +02:00
|
|
|
int active; /* list index of the one we're using */
|
|
|
|
} AlternativeSubPlanState;
|
|
|
|
|
2003-02-03 22:15:45 +01:00
|
|
|
/*
|
|
|
|
* DomainConstraintState - one item to check during CoerceToDomain
|
|
|
|
*
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* Note: we consider this to be part of an ExprState tree, so we give it
|
|
|
|
* a name following the xxxState convention. But there's no directly
|
|
|
|
* associated plan-tree node.
|
2003-02-03 22:15:45 +01:00
|
|
|
*/
|
|
|
|
typedef enum DomainConstraintType
|
|
|
|
{
|
|
|
|
DOM_CONSTRAINT_NOTNULL,
|
|
|
|
DOM_CONSTRAINT_CHECK
|
2003-08-08 23:42:59 +02:00
|
|
|
} DomainConstraintType;
|
2003-02-03 22:15:45 +01:00
|
|
|
|
|
|
|
typedef struct DomainConstraintState
|
|
|
|
{
|
|
|
|
NodeTag type;
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
DomainConstraintType constrainttype; /* constraint type */
|
2003-02-03 22:15:45 +01:00
|
|
|
char *name; /* name of constraint (for error msgs) */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
Expr *check_expr; /* for CHECK, a boolean expression */
|
|
|
|
ExprState *check_exprstate; /* check_expr's eval state, or NULL */
|
2003-08-08 23:42:59 +02:00
|
|
|
} DomainConstraintState;
|
2002-12-13 20:46:01 +01:00
|
|
|
|
|
|
|
|
|
|
|
/* ----------------------------------------------------------------
|
|
|
|
* Executor State Trees
|
|
|
|
*
|
|
|
|
* An executing query has a PlanState tree paralleling the Plan tree
|
|
|
|
* that describes the plan.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
/* ----------------
|
|
|
|
* ExecProcNodeMtd
|
|
|
|
*
|
|
|
|
* This is the method called by ExecProcNode to return the next tuple
|
|
|
|
* from an executor node. It returns NULL, or an empty TupleTableSlot,
|
|
|
|
* if no more tuples are available.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
2002-12-05 16:50:39 +01:00
|
|
|
* PlanState node
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* We never actually instantiate any PlanState nodes; this is just the common
|
|
|
|
* abstract superclass for all PlanState-type nodes.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
typedef struct PlanState
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
NodeTag type;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
Plan *plan; /* associated Plan node */
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2008-10-23 16:34:34 +02:00
|
|
|
EState *state; /* at execution time, states of individual
|
2005-10-15 04:49:52 +02:00
|
|
|
* nodes point to one EState for the whole
|
|
|
|
* top-level plan */
|
2002-12-05 16:50:39 +01:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
ExecProcNodeMtd ExecProcNode; /* function to return next tuple */
|
|
|
|
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
|
|
|
|
* wrapper */
|
|
|
|
|
2011-09-22 17:29:18 +02:00
|
|
|
Instrumentation *instrument; /* Optional runtime stats for this node */
|
2016-06-10 00:02:36 +02:00
|
|
|
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
|
2002-12-05 16:50:39 +01:00
|
|
|
|
2018-09-25 21:54:29 +02:00
|
|
|
/* Per-worker JIT instrumentation */
|
|
|
|
struct SharedJitInstrumentation *worker_jit_instrument;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Common structural data for all Plan types. These links to subsidiary
|
|
|
|
* state trees parallel links in the associated plan tree (except for the
|
|
|
|
* subPlan list, which does not exist in the plan tree).
|
2002-12-05 16:50:39 +01:00
|
|
|
*/
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *qual; /* boolean qual condition */
|
2003-08-04 02:43:34 +02:00
|
|
|
struct PlanState *lefttree; /* input plan tree(s) */
|
2002-12-05 16:50:39 +01:00
|
|
|
struct PlanState *righttree;
|
2018-03-26 21:57:19 +02:00
|
|
|
|
2005-10-15 04:49:52 +02:00
|
|
|
List *initPlan; /* Init SubPlanState nodes (un-correlated expr
|
|
|
|
* subselects) */
|
2002-12-14 01:17:59 +01:00
|
|
|
List *subPlan; /* SubPlanState nodes in my expressions */
|
2002-12-05 16:50:39 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* State for management of parameter-change-driven rescanning
|
|
|
|
*/
|
2003-02-09 01:30:41 +01:00
|
|
|
Bitmapset *chgParam; /* set of IDs of changed Params */
|
2002-12-05 16:50:39 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Other run-time state needed by most if not all node types.
|
|
|
|
*/
|
2019-05-22 18:55:34 +02:00
|
|
|
TupleDesc ps_ResultTupleDesc; /* node's return type */
|
2003-08-04 02:43:34 +02:00
|
|
|
TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
|
|
|
|
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
|
|
|
|
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
|
2018-03-26 21:57:19 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Scanslot's descriptor if known. This is a bit of a hack, but otherwise
|
|
|
|
* it's hard for expression compilation to optimize based on the
|
|
|
|
* descriptor, without encoding knowledge about all executor nodes.
|
|
|
|
*/
|
|
|
|
TupleDesc scandesc;
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Define the slot types for inner, outer and scanslots for expression
|
|
|
|
* contexts with this state as a parent. If *opsset is set, then
|
|
|
|
* *opsfixed indicates whether *ops is guaranteed to be the type of slot
|
|
|
|
* used. That means that every slot in the corresponding
|
|
|
|
* ExprContext.ecxt_*tuple will point to a slot of that type, while
|
|
|
|
* evaluating the expression. If *opsfixed is false, but *ops is set,
|
|
|
|
* that indicates the most likely type of slot.
|
|
|
|
*
|
|
|
|
* The scan* fields are set by ExecInitScanTupleSlot(). If that's not
|
|
|
|
* called, nodes can initialize the fields themselves.
|
|
|
|
*
|
|
|
|
* If outer/inneropsset is false, the information is inferred on-demand
|
|
|
|
* using ExecGetResultSlotOps() on ->righttree/lefttree, using the
|
|
|
|
* corresponding node's resultops* fields.
|
|
|
|
*
|
|
|
|
* The result* fields are automatically set when ExecInitResultSlot is
|
|
|
|
* used (be it directly or when the slot is created by
|
|
|
|
* ExecAssignScanProjectionInfo() /
|
|
|
|
* ExecConditionalAssignProjectionInfo()). If no projection is necessary
|
|
|
|
* ExecConditionalAssignProjectionInfo() defaults those fields to the scan
|
|
|
|
* operations.
|
|
|
|
*/
|
|
|
|
const TupleTableSlotOps *scanops;
|
|
|
|
const TupleTableSlotOps *outerops;
|
|
|
|
const TupleTableSlotOps *innerops;
|
|
|
|
const TupleTableSlotOps *resultops;
|
2019-05-22 18:55:34 +02:00
|
|
|
bool scanopsfixed;
|
|
|
|
bool outeropsfixed;
|
|
|
|
bool inneropsfixed;
|
|
|
|
bool resultopsfixed;
|
|
|
|
bool scanopsset;
|
|
|
|
bool outeropsset;
|
|
|
|
bool inneropsset;
|
|
|
|
bool resultopsset;
|
2003-08-08 23:42:59 +02:00
|
|
|
} PlanState;
|
2002-12-05 16:50:39 +01:00
|
|
|
|
|
|
|
/* ----------------
|
2010-10-26 10:15:17 +02:00
|
|
|
* these are defined to avoid confusion problems with "left"
|
2002-12-05 16:50:39 +01:00
|
|
|
* and "right" and "inner" and "outer". The convention is that
|
|
|
|
* the "left" plan is the "outer" plan and the "right" plan is
|
|
|
|
* the inner plan, but these make the code more readable.
|
|
|
|
* ----------------
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
#define innerPlanState(node) (((PlanState *)(node))->righttree)
|
|
|
|
#define outerPlanState(node) (((PlanState *)(node))->lefttree)
|
|
|
|
|
2011-09-22 17:29:18 +02:00
|
|
|
/* Macros for inline access to certain instrumentation counters */
|
2018-04-10 20:56:15 +02:00
|
|
|
#define InstrCountTuples2(node, delta) \
|
|
|
|
do { \
|
|
|
|
if (((PlanState *)(node))->instrument) \
|
|
|
|
((PlanState *)(node))->instrument->ntuples2 += (delta); \
|
|
|
|
} while (0)
|
2011-09-22 17:29:18 +02:00
|
|
|
#define InstrCountFiltered1(node, delta) \
|
|
|
|
do { \
|
|
|
|
if (((PlanState *)(node))->instrument) \
|
|
|
|
((PlanState *)(node))->instrument->nfiltered1 += (delta); \
|
|
|
|
} while(0)
|
|
|
|
#define InstrCountFiltered2(node, delta) \
|
|
|
|
do { \
|
|
|
|
if (((PlanState *)(node))->instrument) \
|
|
|
|
((PlanState *)(node))->instrument->nfiltered2 += (delta); \
|
|
|
|
} while(0)
|
|
|
|
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
/*
|
|
|
|
* EPQState is state for executing an EvalPlanQual recheck on a candidate
|
|
|
|
* tuple in ModifyTable or LockRows. The estate and planstate fields are
|
|
|
|
* NULL if inactive.
|
|
|
|
*/
|
|
|
|
typedef struct EPQState
|
|
|
|
{
|
|
|
|
EState *estate; /* subsidiary EState */
|
|
|
|
PlanState *planstate; /* plan state tree ready to be executed */
|
|
|
|
TupleTableSlot *origslot; /* original output tuple to be rechecked */
|
|
|
|
Plan *plan; /* plan tree to be executed */
|
2011-01-13 02:47:02 +01:00
|
|
|
List *arowMarks; /* ExecAuxRowMarks (non-locking only) */
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
int epqParam; /* ID of Param to force scan node re-eval */
|
|
|
|
} EPQState;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* ResultState information
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct ResultState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
2002-12-13 20:46:01 +01:00
|
|
|
ExprState *resconstantqual;
|
2002-12-05 16:50:39 +01:00
|
|
|
bool rs_done; /* are we done? */
|
|
|
|
bool rs_checkqual; /* do we need to check the qual? */
|
1997-09-08 23:56:23 +02:00
|
|
|
} ResultState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
/* ----------------
|
|
|
|
* ProjectSetState information
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
*
|
|
|
|
* Note: at least one of the "elems" will be a SetExprState; the rest are
|
|
|
|
* regular ExprStates.
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct ProjectSetState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
Node **elems; /* array of expression states */
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
ExprDoneCond *elemdone; /* array of per-SRF is-done states */
|
|
|
|
int nelems; /* length of elemdone[] array */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
bool pending_srf_tuples; /* still evaluating srfs in tlist? */
|
2017-10-09 00:08:25 +02:00
|
|
|
MemoryContext argcontext; /* context for SRF arguments */
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
} ProjectSetState;
|
|
|
|
|
2009-10-10 03:43:50 +02:00
|
|
|
/* ----------------
|
|
|
|
* ModifyTableState information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct ModifyTableState
|
|
|
|
{
|
2010-02-26 03:01:40 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
2018-04-12 12:22:56 +02:00
|
|
|
CmdType operation; /* INSERT, UPDATE, or DELETE */
|
2011-02-26 00:56:23 +01:00
|
|
|
bool canSetTag; /* do we set the command tag/es_processed? */
|
|
|
|
bool mt_done; /* are we done? */
|
2010-02-26 03:01:40 +01:00
|
|
|
PlanState **mt_plans; /* subplans (one per target rel) */
|
|
|
|
int mt_nplans; /* number of plans in the array */
|
|
|
|
int mt_whichplan; /* which one is being executed (0..n-1) */
|
2019-05-22 18:55:34 +02:00
|
|
|
TupleTableSlot **mt_scans; /* input tuple corresponding to underlying
|
|
|
|
* plans */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
ResultRelInfo *resultRelInfo; /* per-subplan target relations */
|
2017-05-01 14:23:01 +02:00
|
|
|
ResultRelInfo *rootResultRelInfo; /* root target relation (partitioned
|
|
|
|
* table root) */
|
2011-01-13 02:47:02 +01:00
|
|
|
List **mt_arowmarks; /* per-subplan ExecAuxRowMark lists */
|
2010-02-26 03:01:40 +01:00
|
|
|
EPQState mt_epqstate; /* for evaluating EvalPlanQual rechecks */
|
|
|
|
bool fireBSTriggers; /* do we need to fire stmt triggers? */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
|
2018-03-19 21:43:57 +01:00
|
|
|
|
2018-11-16 18:54:15 +01:00
|
|
|
/*
|
|
|
|
* Slot for storing tuples in the root partitioned table's rowtype during
|
|
|
|
* an UPDATE of a partitioned table.
|
|
|
|
*/
|
|
|
|
TupleTableSlot *mt_root_tuple_slot;
|
|
|
|
|
2017-05-17 22:31:56 +02:00
|
|
|
/* Tuple-routing support info */
|
2018-03-19 21:43:57 +01:00
|
|
|
struct PartitionTupleRouting *mt_partition_tuple_routing;
|
|
|
|
|
Fix SQL-spec incompatibilities in new transition table feature.
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
2017-09-16 19:20:32 +02:00
|
|
|
/* controls transition table population for specified operation */
|
2018-03-19 21:43:57 +01:00
|
|
|
struct TransitionCaptureState *mt_transition_capture;
|
|
|
|
|
Fix SQL-spec incompatibilities in new transition table feature.
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
2017-09-16 19:20:32 +02:00
|
|
|
/* controls transition table population for INSERT...ON CONFLICT UPDATE */
|
2018-03-19 21:43:57 +01:00
|
|
|
struct TransitionCaptureState *mt_oc_transition_capture;
|
|
|
|
|
Allow UPDATE to move rows between partitions.
When an UPDATE causes a row to no longer match the partition
constraint, try to move it to a different partition where it does
match the partition constraint. In essence, the UPDATE is split into
a DELETE from the old partition and an INSERT into the new one. This
can lead to surprising behavior in concurrency scenarios because
EvalPlanQual rechecks won't work as they normally did; the known
problems are documented. (There is a pending patch to improve the
situation further, but it needs more review.)
Amit Khandekar, reviewed and tested by Amit Langote, David Rowley,
Rajkumar Raghuwanshi, Dilip Kumar, Amul Sul, Thomas Munro, Álvaro
Herrera, Amit Kapila, and me. A few final revisions by me.
Discussion: http://postgr.es/m/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com
2018-01-19 21:33:06 +01:00
|
|
|
/* Per plan map for tuple conversion from child to root */
|
2018-03-19 21:43:57 +01:00
|
|
|
TupleConversionMap **mt_per_subplan_tupconv_maps;
|
2009-10-10 03:43:50 +02:00
|
|
|
} ModifyTableState;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* AppendState information
|
|
|
|
*
|
Support partition pruning at execution time
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
2018-04-07 22:54:31 +02:00
|
|
|
* nplans how many plans are in the array
|
|
|
|
* whichplan which plan is being executed (0 .. n-1), or a
|
|
|
|
* special negative value. See nodeAppend.c.
|
2019-01-23 12:39:00 +01:00
|
|
|
* prune_state details required to allow partitions to be
|
Support partition pruning at execution time
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
2018-04-07 22:54:31 +02:00
|
|
|
* eliminated from the scan, or NULL if not possible.
|
|
|
|
* valid_subplans for runtime pruning, valid appendplans indexes to
|
|
|
|
* scan.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
|
|
|
|
struct AppendState;
|
|
|
|
typedef struct AppendState AppendState;
|
|
|
|
struct ParallelAppendState;
|
|
|
|
typedef struct ParallelAppendState ParallelAppendState;
|
Support partition pruning at execution time
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
2018-04-07 22:54:31 +02:00
|
|
|
struct PartitionPruneState;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
|
|
|
|
struct AppendState
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
PlanState **appendplans; /* array of PlanStates for my inputs */
|
|
|
|
int as_nplans;
|
1997-09-08 04:41:22 +02:00
|
|
|
int as_whichplan;
|
2018-04-26 20:47:16 +02:00
|
|
|
int as_first_partial_plan; /* Index of 'appendplans' containing
|
|
|
|
* the first partial plan */
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
ParallelAppendState *as_pstate; /* parallel coordination info */
|
|
|
|
Size pstate_len; /* size of parallel coordination info */
|
Support partition pruning at execution time
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
2018-04-07 22:54:31 +02:00
|
|
|
struct PartitionPruneState *as_prune_state;
|
|
|
|
Bitmapset *as_valid_subplans;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
bool (*choose_next_subplan) (AppendState *);
|
|
|
|
};
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2010-10-14 22:56:39 +02:00
|
|
|
/* ----------------
|
|
|
|
* MergeAppendState information
|
|
|
|
*
|
|
|
|
* nplans how many plans are in the array
|
|
|
|
* nkeys number of sort key columns
|
2011-12-07 06:18:38 +01:00
|
|
|
* sortkeys sort keys in SortSupport representation
|
2010-10-14 22:56:39 +02:00
|
|
|
* slots current output tuple of each subplan
|
2012-11-29 17:13:08 +01:00
|
|
|
* heap heap of active tuples
|
2010-10-14 22:56:39 +02:00
|
|
|
* initialized true if we have fetched first tuple from each subplan
|
2018-07-19 12:49:43 +02:00
|
|
|
* noopscan true if partition pruning proved that none of the
|
|
|
|
* mergeplans can contain a record to satisfy this query.
|
|
|
|
* prune_state details required to allow partitions to be
|
|
|
|
* eliminated from the scan, or NULL if not possible.
|
|
|
|
* valid_subplans for runtime pruning, valid mergeplans indexes to
|
|
|
|
* scan.
|
2010-10-14 22:56:39 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct MergeAppendState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
PlanState **mergeplans; /* array of PlanStates for my inputs */
|
|
|
|
int ms_nplans;
|
|
|
|
int ms_nkeys;
|
2012-06-10 21:20:04 +02:00
|
|
|
SortSupport ms_sortkeys; /* array of length ms_nkeys */
|
2010-10-14 22:56:39 +02:00
|
|
|
TupleTableSlot **ms_slots; /* array of length ms_nplans */
|
2012-11-29 17:13:08 +01:00
|
|
|
struct binaryheap *ms_heap; /* binary heap of slot indices */
|
2011-04-10 17:42:00 +02:00
|
|
|
bool ms_initialized; /* are subplans started? */
|
2018-07-19 12:49:43 +02:00
|
|
|
bool ms_noopscan;
|
|
|
|
struct PartitionPruneState *ms_prune_state;
|
|
|
|
Bitmapset *ms_valid_subplans;
|
2010-10-14 22:56:39 +02:00
|
|
|
} MergeAppendState;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/* ----------------
|
|
|
|
* RecursiveUnionState information
|
|
|
|
*
|
|
|
|
* RecursiveUnionState is used for performing a recursive union.
|
|
|
|
*
|
|
|
|
* recursing T when we're done scanning the non-recursive term
|
|
|
|
* intermediate_empty T if intermediate_table is currently empty
|
|
|
|
* working_table working table (to be scanned by recursive term)
|
|
|
|
* intermediate_table current recursive output (next generation of WT)
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct RecursiveUnionState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
bool recursing;
|
|
|
|
bool intermediate_empty;
|
|
|
|
Tuplestorestate *working_table;
|
|
|
|
Tuplestorestate *intermediate_table;
|
2008-10-07 21:27:04 +02:00
|
|
|
/* Remaining fields are unused in UNION ALL case */
|
2018-02-16 06:55:31 +01:00
|
|
|
Oid *eqfuncoids; /* per-grouping-field equality fns */
|
2008-10-07 21:27:04 +02:00
|
|
|
FmgrInfo *hashfunctions; /* per-grouping-field hash fns */
|
|
|
|
MemoryContext tempContext; /* short-term context for comparisons */
|
|
|
|
TupleHashTable hashtable; /* hash table for tuples already seen */
|
2009-06-11 16:49:15 +02:00
|
|
|
MemoryContext tableContext; /* memory context containing hash table */
|
2008-10-04 23:56:55 +02:00
|
|
|
} RecursiveUnionState;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
/* ----------------
|
|
|
|
* BitmapAndState information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct BitmapAndState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
PlanState **bitmapplans; /* array of PlanStates for my inputs */
|
|
|
|
int nplans; /* number of input plans */
|
|
|
|
} BitmapAndState;
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* BitmapOrState information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct BitmapOrState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
PlanState **bitmapplans; /* array of PlanStates for my inputs */
|
|
|
|
int nplans; /* number of input plans */
|
|
|
|
} BitmapOrState;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------------------------------------------------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* Scan State Information
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------------
|
2002-12-05 16:50:39 +01:00
|
|
|
* ScanState information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* ScanState extends PlanState for node types that represent
|
2000-07-12 04:37:39 +02:00
|
|
|
* scans of an underlying relation. It can also be used for nodes
|
|
|
|
* that scan the output of an underlying plan node --- in that case,
|
|
|
|
* only ScanTupleSlot is actually useful, and it refers to the tuple
|
|
|
|
* retrieved from the subplan.
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2000-07-12 04:37:39 +02:00
|
|
|
* currentRelation relation being scanned (NULL if none)
|
|
|
|
* currentScanDesc current scan descriptor for scan (NULL if none)
|
1997-09-07 07:04:48 +02:00
|
|
|
* ScanTupleSlot pointer to slot in tuple table holding scan tuple
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
typedef struct ScanState
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
Relation ss_currentRelation;
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
struct TableScanDescData *ss_currentScanDesc;
|
2002-12-05 16:50:39 +01:00
|
|
|
TupleTableSlot *ss_ScanTupleSlot;
|
2003-08-08 23:42:59 +02:00
|
|
|
} ScanState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2015-11-11 14:57:52 +01:00
|
|
|
/* ----------------
|
|
|
|
* SeqScanState information
|
|
|
|
* ----------------
|
2000-07-12 04:37:39 +02:00
|
|
|
*/
|
2015-11-11 14:57:52 +01:00
|
|
|
typedef struct SeqScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
|
|
|
Size pscan_len; /* size of parallel heap scan descriptor */
|
|
|
|
} SeqScanState;
|
2000-07-12 04:37:39 +02:00
|
|
|
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
/* ----------------
|
|
|
|
* SampleScanState information
|
|
|
|
* ----------------
|
2015-05-15 20:37:10 +02:00
|
|
|
*/
|
|
|
|
typedef struct SampleScanState
|
|
|
|
{
|
|
|
|
ScanState ss;
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
List *args; /* expr states for TABLESAMPLE params */
|
|
|
|
ExprState *repeatable; /* expr state for REPEATABLE expr */
|
|
|
|
/* use struct pointer to avoid including tsmapi.h here */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
struct TsmRoutine *tsmroutine; /* descriptor for tablesample method */
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
void *tsm_state; /* tablesample method can keep state here */
|
|
|
|
bool use_bulkread; /* use bulkread buffer access strategy? */
|
|
|
|
bool use_pagemode; /* use page-at-a-time visibility checking? */
|
|
|
|
bool begun; /* false means need to call BeginSampleScan */
|
|
|
|
uint32 seed; /* random seed */
|
2019-03-31 05:18:53 +02:00
|
|
|
int64 donetuples; /* number of tuples already returned */
|
|
|
|
bool haveblock; /* has a block for sampling been determined */
|
|
|
|
bool done; /* exhausted all tuples? */
|
2015-05-15 20:37:10 +02:00
|
|
|
} SampleScanState;
|
|
|
|
|
2005-11-25 20:47:50 +01:00
|
|
|
/*
|
|
|
|
* These structs store information about index quals that don't have simple
|
|
|
|
* constant right-hand sides. See comments for ExecIndexBuildScanKeys()
|
|
|
|
* for discussion.
|
|
|
|
*/
|
|
|
|
typedef struct
|
|
|
|
{
|
2019-05-22 18:55:34 +02:00
|
|
|
struct ScanKeyData *scan_key; /* scankey to put value into */
|
2005-11-25 20:47:50 +01:00
|
|
|
ExprState *key_expr; /* expr to evaluate to get value */
|
2009-08-23 20:26:08 +02:00
|
|
|
bool key_toastable; /* is expr's result a toastable datatype? */
|
2005-11-25 20:47:50 +01:00
|
|
|
} IndexRuntimeKeyInfo;
|
|
|
|
|
|
|
|
typedef struct
|
|
|
|
{
|
2019-05-22 18:55:34 +02:00
|
|
|
struct ScanKeyData *scan_key; /* scankey to put value into */
|
2005-11-25 20:47:50 +01:00
|
|
|
ExprState *array_expr; /* expr to evaluate to get array value */
|
|
|
|
int next_elem; /* next array element to use */
|
|
|
|
int num_elems; /* number of elems in current array value */
|
|
|
|
Datum *elem_values; /* array of num_elems Datums */
|
|
|
|
bool *elem_nulls; /* array of num_elems is-null flags */
|
|
|
|
} IndexArrayKeyInfo;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* IndexScanState information
|
|
|
|
*
|
2005-04-25 03:30:14 +02:00
|
|
|
* indexqualorig execution state for indexqualorig expressions
|
2015-05-15 13:26:51 +02:00
|
|
|
* indexorderbyorig execution state for indexorderbyorig expressions
|
2010-12-03 02:50:48 +01:00
|
|
|
* ScanKeys Skey structures for index quals
|
|
|
|
* NumScanKeys number of ScanKeys
|
|
|
|
* OrderByKeys Skey structures for index ordering operators
|
|
|
|
* NumOrderByKeys number of OrderByKeys
|
2005-11-25 20:47:50 +01:00
|
|
|
* RuntimeKeys info about Skeys that must be evaluated at runtime
|
2010-12-03 02:50:48 +01:00
|
|
|
* NumRuntimeKeys number of RuntimeKeys
|
2000-08-13 04:50:35 +02:00
|
|
|
* RuntimeKeysReady true if runtime Skeys have been computed
|
2005-11-25 20:47:50 +01:00
|
|
|
* RuntimeContext expr context for evaling runtime Skeys
|
2005-04-25 03:30:14 +02:00
|
|
|
* RelationDesc index relation descriptor
|
|
|
|
* ScanDesc index scan descriptor
|
2015-05-15 13:26:51 +02:00
|
|
|
*
|
|
|
|
* ReorderQueue tuples that need reordering due to re-check
|
|
|
|
* ReachedEnd have we fetched all tuples from index already?
|
|
|
|
* OrderByValues values of ORDER BY exprs of last fetched tuple
|
|
|
|
* OrderByNulls null flags for OrderByValues
|
|
|
|
* SortSupport for reordering ORDER BY exprs
|
|
|
|
* OrderByTypByVals is the datatype of order by expression pass-by-value?
|
|
|
|
* OrderByTypLens typlens of the datatypes of order by expressions
|
2019-06-06 08:46:52 +02:00
|
|
|
* PscanLen size of parallel index scan descriptor
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct IndexScanState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *indexqualorig;
|
2015-05-15 13:26:51 +02:00
|
|
|
List *indexorderbyorig;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct ScanKeyData *iss_ScanKeys;
|
2005-04-25 03:30:14 +02:00
|
|
|
int iss_NumScanKeys;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct ScanKeyData *iss_OrderByKeys;
|
2010-12-03 02:50:48 +01:00
|
|
|
int iss_NumOrderByKeys;
|
2005-11-25 20:47:50 +01:00
|
|
|
IndexRuntimeKeyInfo *iss_RuntimeKeys;
|
|
|
|
int iss_NumRuntimeKeys;
|
2000-08-13 04:50:35 +02:00
|
|
|
bool iss_RuntimeKeysReady;
|
2005-11-25 20:47:50 +01:00
|
|
|
ExprContext *iss_RuntimeContext;
|
2005-04-25 03:30:14 +02:00
|
|
|
Relation iss_RelationDesc;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct IndexScanDescData *iss_ScanDesc;
|
2015-05-15 13:26:51 +02:00
|
|
|
|
|
|
|
/* These are needed for re-checking ORDER BY expr ordering */
|
|
|
|
pairingheap *iss_ReorderQueue;
|
|
|
|
bool iss_ReachedEnd;
|
|
|
|
Datum *iss_OrderByValues;
|
|
|
|
bool *iss_OrderByNulls;
|
|
|
|
SortSupport iss_SortSupport;
|
|
|
|
bool *iss_OrderByTypByVals;
|
|
|
|
int16 *iss_OrderByTypLens;
|
2017-02-15 19:53:24 +01:00
|
|
|
Size iss_PscanLen;
|
1997-09-08 23:56:23 +02:00
|
|
|
} IndexScanState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2011-10-11 20:20:06 +02:00
|
|
|
/* ----------------
|
|
|
|
* IndexOnlyScanState information
|
|
|
|
*
|
|
|
|
* indexqual execution state for indexqual expressions
|
|
|
|
* ScanKeys Skey structures for index quals
|
|
|
|
* NumScanKeys number of ScanKeys
|
|
|
|
* OrderByKeys Skey structures for index ordering operators
|
|
|
|
* NumOrderByKeys number of OrderByKeys
|
|
|
|
* RuntimeKeys info about Skeys that must be evaluated at runtime
|
|
|
|
* NumRuntimeKeys number of RuntimeKeys
|
|
|
|
* RuntimeKeysReady true if runtime Skeys have been computed
|
|
|
|
* RuntimeContext expr context for evaling runtime Skeys
|
|
|
|
* RelationDesc index relation descriptor
|
|
|
|
* ScanDesc index scan descriptor
|
2019-06-06 08:46:52 +02:00
|
|
|
* TableSlot slot for holding tuples fetched from the table
|
2011-10-11 20:20:06 +02:00
|
|
|
* VMBuffer buffer in use for visibility map testing, if any
|
2019-06-06 08:46:52 +02:00
|
|
|
* PscanLen size of parallel index-only scan descriptor
|
2011-10-11 20:20:06 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct IndexOnlyScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *indexqual;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct ScanKeyData *ioss_ScanKeys;
|
2011-10-11 20:20:06 +02:00
|
|
|
int ioss_NumScanKeys;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct ScanKeyData *ioss_OrderByKeys;
|
2011-10-11 20:20:06 +02:00
|
|
|
int ioss_NumOrderByKeys;
|
|
|
|
IndexRuntimeKeyInfo *ioss_RuntimeKeys;
|
|
|
|
int ioss_NumRuntimeKeys;
|
|
|
|
bool ioss_RuntimeKeysReady;
|
|
|
|
ExprContext *ioss_RuntimeContext;
|
|
|
|
Relation ioss_RelationDesc;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct IndexScanDescData *ioss_ScanDesc;
|
2019-06-06 08:46:52 +02:00
|
|
|
TupleTableSlot *ioss_TableSlot;
|
2011-10-11 20:20:06 +02:00
|
|
|
Buffer ioss_VMBuffer;
|
2017-02-19 11:23:59 +01:00
|
|
|
Size ioss_PscanLen;
|
2011-10-11 20:20:06 +02:00
|
|
|
} IndexOnlyScanState;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
/* ----------------
|
|
|
|
* BitmapIndexScanState information
|
|
|
|
*
|
2005-04-20 17:48:36 +02:00
|
|
|
* result bitmap to return output into, or NULL
|
2010-12-03 02:50:48 +01:00
|
|
|
* ScanKeys Skey structures for index quals
|
|
|
|
* NumScanKeys number of ScanKeys
|
2005-11-25 20:47:50 +01:00
|
|
|
* RuntimeKeys info about Skeys that must be evaluated at runtime
|
2010-12-03 02:50:48 +01:00
|
|
|
* NumRuntimeKeys number of RuntimeKeys
|
2005-11-25 20:47:50 +01:00
|
|
|
* ArrayKeys info about Skeys that come from ScalarArrayOpExprs
|
2010-12-03 02:50:48 +01:00
|
|
|
* NumArrayKeys number of ArrayKeys
|
2005-04-20 00:35:18 +02:00
|
|
|
* RuntimeKeysReady true if runtime Skeys have been computed
|
2005-11-25 20:47:50 +01:00
|
|
|
* RuntimeContext expr context for evaling runtime Skeys
|
2005-05-05 05:37:23 +02:00
|
|
|
* RelationDesc index relation descriptor
|
|
|
|
* ScanDesc index scan descriptor
|
2005-04-20 00:35:18 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct BitmapIndexScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2005-04-20 17:48:36 +02:00
|
|
|
TIDBitmap *biss_result;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct ScanKeyData *biss_ScanKeys;
|
2005-04-20 00:35:18 +02:00
|
|
|
int biss_NumScanKeys;
|
2005-11-25 20:47:50 +01:00
|
|
|
IndexRuntimeKeyInfo *biss_RuntimeKeys;
|
|
|
|
int biss_NumRuntimeKeys;
|
|
|
|
IndexArrayKeyInfo *biss_ArrayKeys;
|
|
|
|
int biss_NumArrayKeys;
|
2005-04-20 00:35:18 +02:00
|
|
|
bool biss_RuntimeKeysReady;
|
2005-11-25 20:47:50 +01:00
|
|
|
ExprContext *biss_RuntimeContext;
|
2005-05-05 05:37:23 +02:00
|
|
|
Relation biss_RelationDesc;
|
2019-01-15 02:02:12 +01:00
|
|
|
struct IndexScanDescData *biss_ScanDesc;
|
2005-04-20 00:35:18 +02:00
|
|
|
} BitmapIndexScanState;
|
|
|
|
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
/* ----------------
|
|
|
|
* SharedBitmapState information
|
|
|
|
*
|
|
|
|
* BM_INITIAL TIDBitmap creation is not yet started, so first worker
|
|
|
|
* to see this state will set the state to BM_INPROGRESS
|
|
|
|
* and that process will be responsible for creating
|
|
|
|
* TIDBitmap.
|
|
|
|
* BM_INPROGRESS TIDBitmap creation is in progress; workers need to
|
|
|
|
* sleep until it's finished.
|
|
|
|
* BM_FINISHED TIDBitmap creation is done, so now all workers can
|
|
|
|
* proceed to iterate over TIDBitmap.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
BM_INITIAL,
|
|
|
|
BM_INPROGRESS,
|
|
|
|
BM_FINISHED
|
|
|
|
} SharedBitmapState;
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* ParallelBitmapHeapState information
|
|
|
|
* tbmiterator iterator for scanning current pages
|
|
|
|
* prefetch_iterator iterator for prefetching ahead of current page
|
|
|
|
* mutex mutual exclusion for the prefetching variable
|
|
|
|
* and state
|
|
|
|
* prefetch_pages # pages prefetch iterator is ahead of current
|
|
|
|
* prefetch_target current target prefetch distance
|
|
|
|
* state current state of the TIDBitmap
|
|
|
|
* cv conditional wait variable
|
|
|
|
* phs_snapshot_data snapshot data shared to workers
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct ParallelBitmapHeapState
|
|
|
|
{
|
|
|
|
dsa_pointer tbmiterator;
|
|
|
|
dsa_pointer prefetch_iterator;
|
|
|
|
slock_t mutex;
|
|
|
|
int prefetch_pages;
|
|
|
|
int prefetch_target;
|
|
|
|
SharedBitmapState state;
|
|
|
|
ConditionVariable cv;
|
|
|
|
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
|
|
|
|
} ParallelBitmapHeapState;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
/* ----------------
|
|
|
|
* BitmapHeapScanState information
|
|
|
|
*
|
|
|
|
* bitmapqualorig execution state for bitmapqualorig expressions
|
|
|
|
* tbm bitmap obtained from child index scan(s)
|
2009-01-10 22:08:36 +01:00
|
|
|
* tbmiterator iterator for scanning current pages
|
2005-04-20 00:35:18 +02:00
|
|
|
* tbmres current-page data
|
Allow bitmap scans to operate as index-only scans when possible.
If we don't have to return any columns from heap tuples, and there's
no need to recheck qual conditions, and the heap page is all-visible,
then we can skip fetching the heap page altogether.
Skip prefetching pages too, when possible, on the assumption that the
recheck flag will remain the same from one page to the next. While that
assumption is hardly bulletproof, it seems like a good bet most of the
time, and better than prefetching pages we don't need.
This commit installs the executor infrastructure, but doesn't change
any planner cost estimates, thus possibly causing bitmap scans to
not be chosen in cases where this change renders them the best choice.
I (tgl) am not entirely convinced that we need to account for this
behavior in the planner, because I think typically the bitmap scan would
get chosen anyway if it's the best bet. In any case the submitted patch
took way too many shortcuts, resulting in too many clearly-bad choices,
to be committable.
Alexander Kuzmenkov, reviewed by Alexey Chernyshov, and whacked around
rather heavily by me.
Discussion: https://postgr.es/m/239a8955-c0fc-f506-026d-c837e86c827b@postgrespro.ru
2017-11-01 22:38:12 +01:00
|
|
|
* can_skip_fetch can we potentially skip tuple fetches in this scan?
|
2019-04-01 02:51:49 +02:00
|
|
|
* return_empty_tuples number of empty tuples to return
|
Allow bitmap scans to operate as index-only scans when possible.
If we don't have to return any columns from heap tuples, and there's
no need to recheck qual conditions, and the heap page is all-visible,
then we can skip fetching the heap page altogether.
Skip prefetching pages too, when possible, on the assumption that the
recheck flag will remain the same from one page to the next. While that
assumption is hardly bulletproof, it seems like a good bet most of the
time, and better than prefetching pages we don't need.
This commit installs the executor infrastructure, but doesn't change
any planner cost estimates, thus possibly causing bitmap scans to
not be chosen in cases where this change renders them the best choice.
I (tgl) am not entirely convinced that we need to account for this
behavior in the planner, because I think typically the bitmap scan would
get chosen anyway if it's the best bet. In any case the submitted patch
took way too many shortcuts, resulting in too many clearly-bad choices,
to be committable.
Alexander Kuzmenkov, reviewed by Alexey Chernyshov, and whacked around
rather heavily by me.
Discussion: https://postgr.es/m/239a8955-c0fc-f506-026d-c837e86c827b@postgrespro.ru
2017-11-01 22:38:12 +01:00
|
|
|
* vmbuffer buffer for visibility-map lookups
|
|
|
|
* pvmbuffer ditto, for prefetched pages
|
2014-01-13 20:42:16 +01:00
|
|
|
* exact_pages total number of exact pages retrieved
|
|
|
|
* lossy_pages total number of lossy pages retrieved
|
2009-01-12 06:10:45 +01:00
|
|
|
* prefetch_iterator iterator for prefetching ahead of current page
|
|
|
|
* prefetch_pages # pages prefetch iterator is ahead of current
|
2015-09-08 17:51:42 +02:00
|
|
|
* prefetch_target current target prefetch distance
|
|
|
|
* prefetch_maximum maximum value for prefetch_target
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
* pscan_len size of the shared memory for parallel bitmap
|
|
|
|
* initialized is node is ready to iterate
|
|
|
|
* shared_tbmiterator shared iterator
|
|
|
|
* shared_prefetch_iterator shared iterator for prefetching
|
|
|
|
* pstate shared state for parallel bitmap scan
|
2005-04-20 00:35:18 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct BitmapHeapScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *bitmapqualorig;
|
2005-04-20 00:35:18 +02:00
|
|
|
TIDBitmap *tbm;
|
2009-01-10 22:08:36 +01:00
|
|
|
TBMIterator *tbmiterator;
|
2005-04-20 00:35:18 +02:00
|
|
|
TBMIterateResult *tbmres;
|
Allow bitmap scans to operate as index-only scans when possible.
If we don't have to return any columns from heap tuples, and there's
no need to recheck qual conditions, and the heap page is all-visible,
then we can skip fetching the heap page altogether.
Skip prefetching pages too, when possible, on the assumption that the
recheck flag will remain the same from one page to the next. While that
assumption is hardly bulletproof, it seems like a good bet most of the
time, and better than prefetching pages we don't need.
This commit installs the executor infrastructure, but doesn't change
any planner cost estimates, thus possibly causing bitmap scans to
not be chosen in cases where this change renders them the best choice.
I (tgl) am not entirely convinced that we need to account for this
behavior in the planner, because I think typically the bitmap scan would
get chosen anyway if it's the best bet. In any case the submitted patch
took way too many shortcuts, resulting in too many clearly-bad choices,
to be committable.
Alexander Kuzmenkov, reviewed by Alexey Chernyshov, and whacked around
rather heavily by me.
Discussion: https://postgr.es/m/239a8955-c0fc-f506-026d-c837e86c827b@postgrespro.ru
2017-11-01 22:38:12 +01:00
|
|
|
bool can_skip_fetch;
|
2019-04-01 02:51:49 +02:00
|
|
|
int return_empty_tuples;
|
Allow bitmap scans to operate as index-only scans when possible.
If we don't have to return any columns from heap tuples, and there's
no need to recheck qual conditions, and the heap page is all-visible,
then we can skip fetching the heap page altogether.
Skip prefetching pages too, when possible, on the assumption that the
recheck flag will remain the same from one page to the next. While that
assumption is hardly bulletproof, it seems like a good bet most of the
time, and better than prefetching pages we don't need.
This commit installs the executor infrastructure, but doesn't change
any planner cost estimates, thus possibly causing bitmap scans to
not be chosen in cases where this change renders them the best choice.
I (tgl) am not entirely convinced that we need to account for this
behavior in the planner, because I think typically the bitmap scan would
get chosen anyway if it's the best bet. In any case the submitted patch
took way too many shortcuts, resulting in too many clearly-bad choices,
to be committable.
Alexander Kuzmenkov, reviewed by Alexey Chernyshov, and whacked around
rather heavily by me.
Discussion: https://postgr.es/m/239a8955-c0fc-f506-026d-c837e86c827b@postgrespro.ru
2017-11-01 22:38:12 +01:00
|
|
|
Buffer vmbuffer;
|
|
|
|
Buffer pvmbuffer;
|
2014-01-13 20:42:16 +01:00
|
|
|
long exact_pages;
|
|
|
|
long lossy_pages;
|
2009-01-12 06:10:45 +01:00
|
|
|
TBMIterator *prefetch_iterator;
|
|
|
|
int prefetch_pages;
|
|
|
|
int prefetch_target;
|
2015-09-08 17:51:42 +02:00
|
|
|
int prefetch_maximum;
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
Size pscan_len;
|
|
|
|
bool initialized;
|
|
|
|
TBMSharedIterator *shared_tbmiterator;
|
|
|
|
TBMSharedIterator *shared_prefetch_iterator;
|
|
|
|
ParallelBitmapHeapState *pstate;
|
2005-04-20 00:35:18 +02:00
|
|
|
} BitmapHeapScanState;
|
|
|
|
|
1999-11-23 21:07:06 +01:00
|
|
|
/* ----------------
|
|
|
|
* TidScanState information
|
|
|
|
*
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* tidexprs list of TidExpr structs (see nodeTidscan.c)
|
2007-11-15 22:14:46 +01:00
|
|
|
* isCurrentOf scan has a CurrentOfExpr qual
|
1999-11-23 21:07:06 +01:00
|
|
|
* NumTids number of tids in this scan
|
2005-11-26 23:14:57 +01:00
|
|
|
* TidPtr index of currently fetched tid
|
|
|
|
* TidList evaluated item pointers (array of size NumTids)
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* htup currently-fetched tuple, if any
|
1999-11-23 21:07:06 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct TidScanState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
List *tss_tidexprs;
|
2007-10-24 20:37:09 +02:00
|
|
|
bool tss_isCurrentOf;
|
1999-11-23 21:07:06 +01:00
|
|
|
int tss_NumTids;
|
|
|
|
int tss_TidPtr;
|
2001-10-25 07:50:21 +02:00
|
|
|
ItemPointerData *tss_TidList;
|
2000-04-12 19:17:23 +02:00
|
|
|
HeapTupleData tss_htup;
|
1999-11-23 21:07:06 +01:00
|
|
|
} TidScanState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2000-09-29 20:21:41 +02:00
|
|
|
/* ----------------
|
|
|
|
* SubqueryScanState information
|
|
|
|
*
|
|
|
|
* SubqueryScanState is used for scanning a sub-query in the range table.
|
|
|
|
* ScanTupleSlot references the current output tuple of the sub-query.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct SubqueryScanState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
|
|
|
PlanState *subplan;
|
2000-09-29 20:21:41 +02:00
|
|
|
} SubqueryScanState;
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/* ----------------
|
|
|
|
* FunctionScanState information
|
|
|
|
*
|
|
|
|
* Function nodes are used to scan the results of a
|
|
|
|
* function appearing in FROM (typically a function returning set).
|
|
|
|
*
|
2008-10-29 01:00:39 +01:00
|
|
|
* eflags node's capability flags
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
* ordinality is this scan WITH ORDINALITY?
|
|
|
|
* simple true if we have 1 function and no ordinality
|
|
|
|
* ordinal current ordinal column value
|
|
|
|
* nfuncs number of functions being executed
|
|
|
|
* funcstates per-function execution states (private in
|
|
|
|
* nodeFunctionscan.c)
|
2014-06-20 04:13:41 +02:00
|
|
|
* argcontext memory context to evaluate function arguments in
|
2002-05-12 22:10:05 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
struct FunctionScanPerFuncState;
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
typedef struct FunctionScanState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2008-10-29 01:00:39 +01:00
|
|
|
int eflags;
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
bool ordinality;
|
|
|
|
bool simple;
|
|
|
|
int64 ordinal;
|
|
|
|
int nfuncs;
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
struct FunctionScanPerFuncState *funcstates; /* array of length nfuncs */
|
2014-06-20 04:13:41 +02:00
|
|
|
MemoryContext argcontext;
|
2002-05-12 22:10:05 +02:00
|
|
|
} FunctionScanState;
|
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
/* ----------------
|
|
|
|
* ValuesScanState information
|
|
|
|
*
|
2006-08-02 20:58:21 +02:00
|
|
|
* ValuesScan nodes are used to scan the results of a VALUES list
|
2006-08-02 03:59:48 +02:00
|
|
|
*
|
2006-08-02 20:58:21 +02:00
|
|
|
* rowcontext per-expression-list context
|
2006-08-02 03:59:48 +02:00
|
|
|
* exprlists array of expression lists being evaluated
|
|
|
|
* array_len size of array
|
|
|
|
* curr_idx current array index (0-based)
|
2006-08-02 20:58:21 +02:00
|
|
|
*
|
|
|
|
* Note: ss.ps.ps_ExprContext is used to evaluate any qual or projection
|
|
|
|
* expressions attached to the node. We create a second ExprContext,
|
|
|
|
* rowcontext, in which to build the executor expression state for each
|
|
|
|
* Values sublist. Resetting this context lets us get rid of expression
|
|
|
|
* state for each row, avoiding major memory leakage over a long values list.
|
2006-08-02 03:59:48 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct ValuesScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2006-08-02 20:58:21 +02:00
|
|
|
ExprContext *rowcontext;
|
2006-08-02 03:59:48 +02:00
|
|
|
List **exprlists;
|
|
|
|
int array_len;
|
|
|
|
int curr_idx;
|
|
|
|
} ValuesScanState;
|
|
|
|
|
2017-03-08 16:39:37 +01:00
|
|
|
/* ----------------
|
|
|
|
* TableFuncScanState node
|
|
|
|
*
|
|
|
|
* Used in table-expression functions like XMLTABLE.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct TableFuncScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
|
|
|
ExprState *docexpr; /* state for document expression */
|
|
|
|
ExprState *rowexpr; /* state for row-generating expression */
|
|
|
|
List *colexprs; /* state for column-generating expression */
|
|
|
|
List *coldefexprs; /* state for column default expressions */
|
2018-09-17 19:16:32 +02:00
|
|
|
List *ns_names; /* same as TableFunc.ns_names */
|
|
|
|
List *ns_uris; /* list of states of namespace URI exprs */
|
2017-03-08 16:39:37 +01:00
|
|
|
Bitmapset *notnulls; /* nullability flag for each output column */
|
|
|
|
void *opaque; /* table builder private space */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
const struct TableFuncRoutine *routine; /* table builder methods */
|
2017-03-08 16:39:37 +01:00
|
|
|
FmgrInfo *in_functions; /* input function for each column */
|
|
|
|
Oid *typioparams; /* typioparam for each column */
|
|
|
|
int64 ordinal; /* row number to be output next */
|
2018-08-13 02:45:35 +02:00
|
|
|
MemoryContext perTableCxt; /* per-table context */
|
2017-03-08 16:39:37 +01:00
|
|
|
Tuplestorestate *tupstore; /* output tuple store */
|
|
|
|
} TableFuncScanState;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/* ----------------
|
|
|
|
* CteScanState information
|
|
|
|
*
|
|
|
|
* CteScan nodes are used to scan a CommonTableExpr query.
|
|
|
|
*
|
|
|
|
* Multiple CteScan nodes can read out from the same CTE query. We use
|
|
|
|
* a tuplestore to hold rows that have been read from the CTE query but
|
|
|
|
* not yet consumed by all readers.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct CteScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
|
|
|
int eflags; /* capability flags to pass to tuplestore */
|
|
|
|
int readptr; /* index of my tuplestore read pointer */
|
|
|
|
PlanState *cteplanstate; /* PlanState for the CTE query itself */
|
|
|
|
/* Link to the "leader" CteScanState (possibly this same node) */
|
|
|
|
struct CteScanState *leader;
|
|
|
|
/* The remaining fields are only valid in the "leader" CteScanState */
|
2009-06-11 16:49:15 +02:00
|
|
|
Tuplestorestate *cte_table; /* rows already read from the CTE query */
|
2008-10-04 23:56:55 +02:00
|
|
|
bool eof_cte; /* reached end of CTE query? */
|
|
|
|
} CteScanState;
|
|
|
|
|
2017-04-01 06:17:18 +02:00
|
|
|
/* ----------------
|
|
|
|
* NamedTuplestoreScanState information
|
|
|
|
*
|
|
|
|
* NamedTuplestoreScan nodes are used to scan a Tuplestore created and
|
|
|
|
* named prior to execution of the query. An example is a transition
|
|
|
|
* table for an AFTER trigger.
|
|
|
|
*
|
|
|
|
* Multiple NamedTuplestoreScan nodes can read out from the same Tuplestore.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct NamedTuplestoreScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
|
|
|
int readptr; /* index of my tuplestore read pointer */
|
|
|
|
TupleDesc tupdesc; /* format of the tuples in the tuplestore */
|
|
|
|
Tuplestorestate *relation; /* the rows */
|
|
|
|
} NamedTuplestoreScanState;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/* ----------------
|
|
|
|
* WorkTableScanState information
|
|
|
|
*
|
|
|
|
* WorkTableScan nodes are used to scan the work table created by
|
2014-05-06 18:12:18 +02:00
|
|
|
* a RecursiveUnion node. We locate the RecursiveUnion node
|
2008-10-04 23:56:55 +02:00
|
|
|
* during executor startup.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct WorkTableScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
|
|
|
RecursiveUnionState *rustate;
|
|
|
|
} WorkTableScanState;
|
|
|
|
|
2011-02-20 06:17:18 +01:00
|
|
|
/* ----------------
|
|
|
|
* ForeignScanState information
|
|
|
|
*
|
|
|
|
* ForeignScan nodes are used to scan foreign-data tables.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct ForeignScanState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
ExprState *fdw_recheck_quals; /* original quals not in ss.ps.qual */
|
2016-02-03 18:46:18 +01:00
|
|
|
Size pscan_len; /* size of parallel coordination information */
|
2011-02-20 06:17:18 +01:00
|
|
|
/* use struct pointer to avoid including fdwapi.h here */
|
|
|
|
struct FdwRoutine *fdwroutine;
|
|
|
|
void *fdw_state; /* foreign-data wrapper can keep state here */
|
|
|
|
} ForeignScanState;
|
|
|
|
|
2014-11-07 23:26:02 +01:00
|
|
|
/* ----------------
|
2014-11-22 00:21:46 +01:00
|
|
|
* CustomScanState information
|
2014-11-07 23:26:02 +01:00
|
|
|
*
|
|
|
|
* CustomScan nodes are used to execute custom code within executor.
|
2014-11-22 00:21:46 +01:00
|
|
|
*
|
|
|
|
* Core code must avoid assuming that the CustomScanState is only as large as
|
|
|
|
* the structure declared here; providers are allowed to make it the first
|
|
|
|
* element in a larger structure, and typically would need to do so. The
|
|
|
|
* struct is actually allocated by the CreateCustomScanState method associated
|
|
|
|
* with the plan node. Any additional fields can be initialized there, or in
|
|
|
|
* the BeginCustomScan method.
|
2014-11-07 23:26:02 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
2016-03-29 17:00:18 +02:00
|
|
|
struct CustomExecMethods;
|
2014-11-07 23:26:02 +01:00
|
|
|
|
2014-11-21 00:36:07 +01:00
|
|
|
typedef struct CustomScanState
|
|
|
|
{
|
|
|
|
ScanState ss;
|
2016-08-31 09:06:18 +02:00
|
|
|
uint32 flags; /* mask of CUSTOMPATH_* flags, see
|
|
|
|
* nodes/extensible.h */
|
2015-06-26 15:40:47 +02:00
|
|
|
List *custom_ps; /* list of child PlanState nodes, if any */
|
2016-02-03 18:46:18 +01:00
|
|
|
Size pscan_len; /* size of parallel coordination information */
|
2016-03-29 17:00:18 +02:00
|
|
|
const struct CustomExecMethods *methods;
|
2014-11-21 00:36:07 +01:00
|
|
|
} CustomScanState;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------------------------------------------------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* Join State Information
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* JoinState information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* Superclass for state nodes of join plans.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
typedef struct JoinState
|
|
|
|
{
|
|
|
|
PlanState ps;
|
|
|
|
JoinType jointype;
|
2017-04-08 04:20:03 +02:00
|
|
|
bool single_match; /* True if we should skip to next outer tuple
|
|
|
|
* after finding one inner match */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *joinqual; /* JOIN quals (in addition to ps.qual) */
|
2002-12-05 16:50:39 +01:00
|
|
|
} JoinState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* NestLoopState information
|
2000-09-12 23:07:18 +02:00
|
|
|
*
|
2001-03-22 05:01:46 +01:00
|
|
|
* NeedNewOuter true if need new outer tuple on next call
|
|
|
|
* MatchedOuter true if found a join match for current outer tuple
|
2000-09-12 23:07:18 +02:00
|
|
|
* NullInnerTupleSlot prepared null tuple for left outer joins
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct NestLoopState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
JoinState js; /* its first field is NodeTag */
|
2000-09-12 23:07:18 +02:00
|
|
|
bool nl_NeedNewOuter;
|
|
|
|
bool nl_MatchedOuter;
|
|
|
|
TupleTableSlot *nl_NullInnerTupleSlot;
|
1997-09-08 23:56:23 +02:00
|
|
|
} NestLoopState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* MergeJoinState information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2005-05-13 23:20:16 +02:00
|
|
|
* NumClauses number of mergejoinable join clauses
|
|
|
|
* Clauses info for each mergejoinable clause
|
2010-12-31 04:12:40 +01:00
|
|
|
* JoinState current state of ExecMergeJoin state machine
|
2017-04-08 04:20:03 +02:00
|
|
|
* SkipMarkRestore true if we may skip Mark and Restore operations
|
2007-05-21 19:57:35 +02:00
|
|
|
* ExtraMarks true to issue extra Mark operations on inner scan
|
2010-01-06 00:25:36 +01:00
|
|
|
* ConstFalseJoin true if we have a constant-false joinqual
|
2005-05-14 23:29:23 +02:00
|
|
|
* FillOuter true if should emit unjoined outer tuples anyway
|
|
|
|
* FillInner true if should emit unjoined inner tuples anyway
|
2001-03-22 05:01:46 +01:00
|
|
|
* MatchedOuter true if found a join match for current outer tuple
|
|
|
|
* MatchedInner true if found a join match for current inner tuple
|
2005-05-13 23:20:16 +02:00
|
|
|
* OuterTupleSlot slot in tuple table for cur outer tuple
|
|
|
|
* InnerTupleSlot slot in tuple table for cur inner tuple
|
|
|
|
* MarkedTupleSlot slot in tuple table for marked tuple
|
2000-09-12 23:07:18 +02:00
|
|
|
* NullOuterTupleSlot prepared null tuple for right outer joins
|
|
|
|
* NullInnerTupleSlot prepared null tuple for left outer joins
|
2005-05-13 23:20:16 +02:00
|
|
|
* OuterEContext workspace for computing outer tuple's join values
|
|
|
|
* InnerEContext workspace for computing inner tuple's join values
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2005-05-13 23:20:16 +02:00
|
|
|
/* private in nodeMergejoin.c: */
|
|
|
|
typedef struct MergeJoinClauseData *MergeJoinClause;
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct MergeJoinState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
JoinState js; /* its first field is NodeTag */
|
2005-05-13 23:20:16 +02:00
|
|
|
int mj_NumClauses;
|
2005-10-15 04:49:52 +02:00
|
|
|
MergeJoinClause mj_Clauses; /* array of length mj_NumClauses */
|
1997-09-08 04:41:22 +02:00
|
|
|
int mj_JoinState;
|
2017-04-08 04:20:03 +02:00
|
|
|
bool mj_SkipMarkRestore;
|
2007-05-21 19:57:35 +02:00
|
|
|
bool mj_ExtraMarks;
|
2010-01-06 00:25:36 +01:00
|
|
|
bool mj_ConstFalseJoin;
|
2005-05-14 23:29:23 +02:00
|
|
|
bool mj_FillOuter;
|
|
|
|
bool mj_FillInner;
|
2000-09-12 23:07:18 +02:00
|
|
|
bool mj_MatchedOuter;
|
|
|
|
bool mj_MatchedInner;
|
|
|
|
TupleTableSlot *mj_OuterTupleSlot;
|
|
|
|
TupleTableSlot *mj_InnerTupleSlot;
|
1997-09-07 07:04:48 +02:00
|
|
|
TupleTableSlot *mj_MarkedTupleSlot;
|
2000-09-12 23:07:18 +02:00
|
|
|
TupleTableSlot *mj_NullOuterTupleSlot;
|
|
|
|
TupleTableSlot *mj_NullInnerTupleSlot;
|
2005-05-13 23:20:16 +02:00
|
|
|
ExprContext *mj_OuterEContext;
|
|
|
|
ExprContext *mj_InnerEContext;
|
1997-09-08 23:56:23 +02:00
|
|
|
} MergeJoinState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* HashJoinState information
|
|
|
|
*
|
2010-12-31 02:24:55 +01:00
|
|
|
* hashclauses original form of the hashjoin condition
|
|
|
|
* hj_OuterHashKeys the outer hash keys in the hashjoin condition
|
|
|
|
* hj_HashOperators the join operators in the hashjoin condition
|
1999-05-18 23:34:29 +02:00
|
|
|
* hj_HashTable hash table for the hashjoin
|
2005-03-06 23:15:05 +01:00
|
|
|
* (NULL if table not built yet)
|
|
|
|
* hj_CurHashValue hash value for current outer tuple
|
2009-03-21 01:04:40 +01:00
|
|
|
* hj_CurBucketNo regular bucket# for current outer tuple
|
|
|
|
* hj_CurSkewBucketNo skew bucket# for current outer tuple
|
1999-05-18 23:34:29 +02:00
|
|
|
* hj_CurTuple last inner tuple matched to current outer
|
|
|
|
* tuple, or NULL if starting search
|
2009-03-21 01:04:40 +01:00
|
|
|
* (hj_CurXXX variables are undefined if
|
|
|
|
* OuterTupleSlot is empty!)
|
1997-09-07 07:04:48 +02:00
|
|
|
* hj_OuterTupleSlot tuple slot for outer tuples
|
2010-12-31 02:24:55 +01:00
|
|
|
* hj_HashTupleSlot tuple slot for inner (hashed) tuples
|
|
|
|
* hj_NullOuterTupleSlot prepared null tuple for right/full outer joins
|
|
|
|
* hj_NullInnerTupleSlot prepared null tuple for left/full outer joins
|
2005-09-25 21:37:35 +02:00
|
|
|
* hj_FirstOuterTupleSlot first tuple retrieved from outer plan
|
2010-12-31 02:24:55 +01:00
|
|
|
* hj_JoinState current state of ExecHashJoin state machine
|
2001-03-22 05:01:46 +01:00
|
|
|
* hj_MatchedOuter true if found a join match for current outer
|
2005-11-29 00:46:03 +01:00
|
|
|
* hj_OuterNotEmpty true if outer relation known not empty
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2005-03-06 23:15:05 +01:00
|
|
|
|
|
|
|
/* these structs are defined in executor/hashjoin.h: */
|
|
|
|
typedef struct HashJoinTupleData *HashJoinTuple;
|
|
|
|
typedef struct HashJoinTableData *HashJoinTable;
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct HashJoinState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
JoinState js; /* its first field is NodeTag */
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExprState *hashclauses;
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
List *hj_OuterHashKeys; /* list of ExprState nodes */
|
|
|
|
List *hj_HashOperators; /* list of operator OIDs */
|
2019-03-22 12:09:32 +01:00
|
|
|
List *hj_Collations;
|
1999-05-25 18:15:34 +02:00
|
|
|
HashJoinTable hj_HashTable;
|
2005-03-06 23:15:05 +01:00
|
|
|
uint32 hj_CurHashValue;
|
1999-05-25 18:15:34 +02:00
|
|
|
int hj_CurBucketNo;
|
2009-03-21 01:04:40 +01:00
|
|
|
int hj_CurSkewBucketNo;
|
1999-05-25 18:15:34 +02:00
|
|
|
HashJoinTuple hj_CurTuple;
|
1997-09-07 07:04:48 +02:00
|
|
|
TupleTableSlot *hj_OuterTupleSlot;
|
|
|
|
TupleTableSlot *hj_HashTupleSlot;
|
2010-12-31 02:24:55 +01:00
|
|
|
TupleTableSlot *hj_NullOuterTupleSlot;
|
2000-09-12 23:07:18 +02:00
|
|
|
TupleTableSlot *hj_NullInnerTupleSlot;
|
2005-09-25 21:37:35 +02:00
|
|
|
TupleTableSlot *hj_FirstOuterTupleSlot;
|
2010-12-31 02:24:55 +01:00
|
|
|
int hj_JoinState;
|
2000-09-12 23:07:18 +02:00
|
|
|
bool hj_MatchedOuter;
|
2005-11-29 00:46:03 +01:00
|
|
|
bool hj_OuterNotEmpty;
|
1997-09-08 23:56:23 +02:00
|
|
|
} HashJoinState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
|
|
|
|
/* ----------------------------------------------------------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* Materialization State Information
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* MaterialState information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* materialize nodes are used to materialize the results
|
2000-06-19 00:44:35 +02:00
|
|
|
* of a subplan into a temporary file.
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* ss.ss_ScanTupleSlot refers to output of underlying plan.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct MaterialState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2007-05-21 19:57:35 +02:00
|
|
|
int eflags; /* capability flags to pass to tuplestore */
|
2003-08-04 02:43:34 +02:00
|
|
|
bool eof_underlying; /* reached end of underlying plan? */
|
2008-10-01 21:51:50 +02:00
|
|
|
Tuplestorestate *tuplestorestate;
|
1997-09-08 23:56:23 +02:00
|
|
|
} MaterialState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2017-08-29 19:22:49 +02:00
|
|
|
/* ----------------
|
|
|
|
* Shared memory container for per-worker sort information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct SharedSortInfo
|
|
|
|
{
|
|
|
|
int num_workers;
|
|
|
|
TuplesortInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
|
|
|
|
} SharedSortInfo;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
/* ----------------
|
|
|
|
* SortState information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct SortState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2006-02-28 06:48:44 +01:00
|
|
|
bool randomAccess; /* need random access to sort output? */
|
2007-05-04 03:13:45 +02:00
|
|
|
bool bounded; /* is the result set bounded? */
|
|
|
|
int64 bound; /* if bounded, how many tuples are needed */
|
2002-12-05 16:50:39 +01:00
|
|
|
bool sort_Done; /* sort completed yet? */
|
2007-05-04 03:13:45 +02:00
|
|
|
bool bounded_Done; /* value of bounded we did the sort with */
|
|
|
|
int64 bound_Done; /* value of bound we did the sort with */
|
2003-08-04 02:43:34 +02:00
|
|
|
void *tuplesortstate; /* private state of tuplesort.c */
|
2017-08-29 19:22:49 +02:00
|
|
|
bool am_worker; /* are we a worker? */
|
|
|
|
SharedSortInfo *shared_info; /* one entry per worker */
|
2002-12-05 16:50:39 +01:00
|
|
|
} SortState;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ---------------------
|
2002-12-05 16:50:39 +01:00
|
|
|
* GroupState information
|
2016-09-05 19:09:54 +02:00
|
|
|
* ---------------------
|
2002-12-05 16:50:39 +01:00
|
|
|
*/
|
|
|
|
typedef struct GroupState
|
|
|
|
{
|
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2018-02-16 06:55:31 +01:00
|
|
|
ExprState *eqfunction; /* equality function */
|
2002-12-05 16:50:39 +01:00
|
|
|
bool grp_done; /* indicates completion of Group scan */
|
|
|
|
} GroupState;
|
|
|
|
|
|
|
|
/* ---------------------
|
|
|
|
* AggState information
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* ss.ss_ScanTupleSlot refers to output of underlying plan.
|
2000-07-12 04:37:39 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* Note: ss.ps.ps_ExprContext contains ecxt_aggvalues and
|
2002-11-06 23:31:24 +01:00
|
|
|
* ecxt_aggnulls arrays, which hold the computed agg values for the current
|
|
|
|
* input group during evaluation of an Agg node's output tuple(s). We
|
|
|
|
* create a second ExprContext, tmpcontext, in which to evaluate input
|
|
|
|
* expressions and run the aggregate transition functions.
|
2016-09-05 19:09:54 +02:00
|
|
|
* ---------------------
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
2002-11-06 23:31:24 +01:00
|
|
|
/* these structs are private in nodeAgg.c: */
|
|
|
|
typedef struct AggStatePerAggData *AggStatePerAgg;
|
2015-08-04 16:53:10 +02:00
|
|
|
typedef struct AggStatePerTransData *AggStatePerTrans;
|
2002-11-06 23:31:24 +01:00
|
|
|
typedef struct AggStatePerGroupData *AggStatePerGroup;
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
typedef struct AggStatePerPhaseData *AggStatePerPhase;
|
2017-03-27 05:20:54 +02:00
|
|
|
typedef struct AggStatePerHashData *AggStatePerHash;
|
1999-09-26 23:21:15 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct AggState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
1999-08-21 05:49:17 +02:00
|
|
|
List *aggs; /* all Aggref nodes in targetlist & quals */
|
1999-09-26 23:21:15 +02:00
|
|
|
int numaggs; /* length of list (could be zero!) */
|
2015-08-04 16:53:10 +02:00
|
|
|
int numtrans; /* number of pertrans items */
|
2017-03-27 05:20:54 +02:00
|
|
|
AggStrategy aggstrategy; /* strategy mode */
|
2016-06-26 20:33:38 +02:00
|
|
|
AggSplit aggsplit; /* agg-splitting mode, see nodes.h */
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
AggStatePerPhase phase; /* pointer to current phase data */
|
2017-03-27 05:20:54 +02:00
|
|
|
int numphases; /* number of phases (including phase 0) */
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
int current_phase; /* current phase number */
|
2002-11-06 23:31:24 +01:00
|
|
|
AggStatePerAgg peragg; /* per-Aggref information */
|
2015-08-04 16:53:10 +02:00
|
|
|
AggStatePerTrans pertrans; /* per-Trans state information */
|
2017-03-27 05:20:54 +02:00
|
|
|
ExprContext *hashcontext; /* econtexts for long-lived data (hashtable) */
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
ExprContext **aggcontexts; /* econtexts for long-lived data (per GS) */
|
2002-11-06 23:31:24 +01:00
|
|
|
ExprContext *tmpcontext; /* econtext for input expressions */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_AGGSTATE_CURAGGCONTEXT 14
|
2017-03-27 05:20:54 +02:00
|
|
|
ExprContext *curaggcontext; /* currently active aggcontext */
|
2017-10-12 21:20:04 +02:00
|
|
|
AggStatePerAgg curperagg; /* currently active aggregate, if any */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_AGGSTATE_CURPERTRANS 16
|
2017-10-12 21:20:04 +02:00
|
|
|
AggStatePerTrans curpertrans; /* currently active trans state, if any */
|
2015-05-24 03:35:49 +02:00
|
|
|
bool input_done; /* indicates end of input */
|
1999-09-26 23:21:15 +02:00
|
|
|
bool agg_done; /* indicates completion of Agg scan */
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
int projected_set; /* The last projected grouping set */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_AGGSTATE_CURRENT_SET 20
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
int current_set; /* The current grouping set being evaluated */
|
|
|
|
Bitmapset *grouped_cols; /* grouped cols in current projection */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
List *all_grouped_cols; /* list of all grouped cols in DESC order */
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
/* These fields are for grouping set phase data */
|
|
|
|
int maxsets; /* The max number of sets in any phase */
|
|
|
|
AggStatePerPhase phases; /* array of all phases */
|
2017-03-27 05:20:54 +02:00
|
|
|
Tuplesortstate *sort_in; /* sorted input to phases > 1 */
|
Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:40:59 +02:00
|
|
|
Tuplesortstate *sort_out; /* input is copied here for next phase */
|
|
|
|
TupleTableSlot *sort_slot; /* slot for sort results */
|
2002-11-06 23:31:24 +01:00
|
|
|
/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
|
2018-01-03 03:02:37 +01:00
|
|
|
AggStatePerGroup *pergroups; /* grouping set indexed array of per-group
|
|
|
|
* pointers */
|
2003-08-04 02:43:34 +02:00
|
|
|
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
|
2017-03-27 05:20:54 +02:00
|
|
|
/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
|
2002-11-06 23:31:24 +01:00
|
|
|
bool table_filled; /* hash table filled yet? */
|
2017-03-27 05:20:54 +02:00
|
|
|
int num_hashes;
|
2018-01-09 22:25:38 +01:00
|
|
|
AggStatePerHash perhash; /* array of per-hashtable data */
|
2018-01-03 03:02:37 +01:00
|
|
|
AggStatePerGroup *hash_pergroup; /* grouping set indexed array of
|
|
|
|
* per-group pointers */
|
2018-01-09 22:25:38 +01:00
|
|
|
|
2017-10-16 21:24:36 +02:00
|
|
|
/* support for evaluation of agg input expressions: */
|
2018-01-24 08:20:02 +01:00
|
|
|
#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
|
2018-01-09 22:25:38 +01:00
|
|
|
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
|
|
|
|
* ->hash_pergroup */
|
2017-10-16 21:24:36 +02:00
|
|
|
ProjectionInfo *combinedproj; /* projection machinery */
|
1997-09-08 22:59:27 +02:00
|
|
|
} AggState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2008-12-28 19:54:01 +01:00
|
|
|
/* ----------------
|
|
|
|
* WindowAggState information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
/* these structs are private in nodeWindowAgg.c: */
|
|
|
|
typedef struct WindowStatePerFuncData *WindowStatePerFunc;
|
|
|
|
typedef struct WindowStatePerAggData *WindowStatePerAgg;
|
|
|
|
|
|
|
|
typedef struct WindowAggState
|
|
|
|
{
|
2009-06-11 16:49:15 +02:00
|
|
|
ScanState ss; /* its first field is NodeTag */
|
2008-12-28 19:54:01 +01:00
|
|
|
|
|
|
|
/* these fields are filled in by ExecInitExpr: */
|
2009-06-11 16:49:15 +02:00
|
|
|
List *funcs; /* all WindowFunc nodes in targetlist */
|
|
|
|
int numfuncs; /* total number of window functions */
|
|
|
|
int numaggs; /* number that are plain aggregates */
|
2008-12-28 19:54:01 +01:00
|
|
|
|
2009-06-11 16:49:15 +02:00
|
|
|
WindowStatePerFunc perfunc; /* per-window-function information */
|
|
|
|
WindowStatePerAgg peragg; /* per-plain-aggregate information */
|
2018-04-26 20:47:16 +02:00
|
|
|
ExprState *partEqfunction; /* equality funcs for partition columns */
|
|
|
|
ExprState *ordEqfunction; /* equality funcs for ordering columns */
|
2009-06-11 16:49:15 +02:00
|
|
|
Tuplestorestate *buffer; /* stores rows of current partition */
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
int current_ptr; /* read pointer # for current row */
|
|
|
|
int framehead_ptr; /* read pointer # for frame head, if used */
|
|
|
|
int frametail_ptr; /* read pointer # for frame tail, if used */
|
|
|
|
int grouptail_ptr; /* read pointer # for group tail, if used */
|
2009-06-11 16:49:15 +02:00
|
|
|
int64 spooled_rows; /* total # of rows in buffer */
|
|
|
|
int64 currentpos; /* position of current row in partition */
|
2010-02-12 18:33:21 +01:00
|
|
|
int64 frameheadpos; /* current frame head position */
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
int64 frametailpos; /* current frame tail position (frame end+1) */
|
2010-02-12 18:33:21 +01:00
|
|
|
/* use struct pointer to avoid including windowapi.h here */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
struct WindowObjectData *agg_winobj; /* winobj for aggregate fetches */
|
2010-02-12 18:33:21 +01:00
|
|
|
int64 aggregatedbase; /* start row for current aggregates */
|
2009-06-11 16:49:15 +02:00
|
|
|
int64 aggregatedupto; /* rows before this one are aggregated */
|
|
|
|
|
2010-02-12 18:33:21 +01:00
|
|
|
int frameOptions; /* frame_clause options, see WindowDef */
|
|
|
|
ExprState *startOffset; /* expression for starting bound offset */
|
|
|
|
ExprState *endOffset; /* expression for ending bound offset */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
Datum startOffsetValue; /* result of startOffset evaluation */
|
2010-02-26 03:01:40 +01:00
|
|
|
Datum endOffsetValue; /* result of endOffset evaluation */
|
2010-02-12 18:33:21 +01:00
|
|
|
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
|
|
|
|
FmgrInfo startInRangeFunc; /* in_range function for startOffset */
|
|
|
|
FmgrInfo endInRangeFunc; /* in_range function for endOffset */
|
|
|
|
Oid inRangeColl; /* collation for in_range tests */
|
|
|
|
bool inRangeAsc; /* use ASC sort order for in_range tests? */
|
|
|
|
bool inRangeNullsFirst; /* nulls sort first for in_range tests? */
|
|
|
|
|
|
|
|
/* these fields are used in GROUPS mode: */
|
|
|
|
int64 currentgroup; /* peer group # of current row in partition */
|
|
|
|
int64 frameheadgroup; /* peer group # of frame head row */
|
|
|
|
int64 frametailgroup; /* peer group # of frame tail row */
|
|
|
|
int64 groupheadpos; /* current row's peer group head position */
|
|
|
|
int64 grouptailpos; /* " " " " tail position (group end+1) */
|
|
|
|
|
2010-02-12 18:33:21 +01:00
|
|
|
MemoryContext partcontext; /* context for partition-lifespan data */
|
2014-04-12 17:58:53 +02:00
|
|
|
MemoryContext aggcontext; /* shared context for aggregate working data */
|
|
|
|
MemoryContext curaggcontext; /* current aggregate's working data */
|
2009-06-11 16:49:15 +02:00
|
|
|
ExprContext *tmpcontext; /* short-term evaluation context */
|
|
|
|
|
2010-02-12 18:33:21 +01:00
|
|
|
bool all_first; /* true if the scan is starting */
|
2009-06-11 16:49:15 +02:00
|
|
|
bool all_done; /* true if the scan is finished */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
bool partition_spooled; /* true if all tuples in current partition
|
|
|
|
* have been spooled into tuplestore */
|
2017-06-21 20:39:04 +02:00
|
|
|
bool more_partitions; /* true if there's more partitions after
|
|
|
|
* this one */
|
|
|
|
bool framehead_valid; /* true if frameheadpos is known up to
|
|
|
|
* date for current row */
|
|
|
|
bool frametail_valid; /* true if frametailpos is known up to
|
|
|
|
* date for current row */
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
bool grouptail_valid; /* true if grouptailpos is known up to
|
|
|
|
* date for current row */
|
2008-12-28 19:54:01 +01:00
|
|
|
|
|
|
|
TupleTableSlot *first_part_slot; /* first tuple of current or next
|
|
|
|
* partition */
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
TupleTableSlot *framehead_slot; /* first tuple of current frame */
|
|
|
|
TupleTableSlot *frametail_slot; /* first tuple after current frame */
|
2008-12-28 19:54:01 +01:00
|
|
|
|
|
|
|
/* temporary slots for tuples fetched back from tuplestore */
|
2008-12-31 01:08:39 +01:00
|
|
|
TupleTableSlot *agg_row_slot;
|
2008-12-28 19:54:01 +01:00
|
|
|
TupleTableSlot *temp_slot_1;
|
|
|
|
TupleTableSlot *temp_slot_2;
|
|
|
|
} WindowAggState;
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/* ----------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* UniqueState information
|
|
|
|
*
|
|
|
|
* Unique nodes are used "on top of" sort nodes to discard
|
2014-05-06 18:12:18 +02:00
|
|
|
* duplicate tuples returned from the sort phase. Basically
|
1997-09-07 07:04:48 +02:00
|
|
|
* all it does is compare the current tuple from the subplan
|
2005-03-16 22:38:10 +01:00
|
|
|
* with the previously fetched tuple (stored in its result slot).
|
2000-01-27 19:11:50 +01:00
|
|
|
* If the two are identical in all interesting fields, then
|
|
|
|
* we just fetch another tuple from the sort and try again.
|
1996-08-28 03:59:28 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2000-01-27 19:11:50 +01:00
|
|
|
typedef struct UniqueState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
2018-04-26 20:47:16 +02:00
|
|
|
ExprState *eqfunction; /* tuple equality qual */
|
2000-01-27 19:11:50 +01:00
|
|
|
} UniqueState;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
/* ----------------
|
|
|
|
* GatherState information
|
|
|
|
*
|
|
|
|
* Gather nodes launch 1 or more parallel workers, run a subplan
|
|
|
|
* in those workers, and collect the results.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct GatherState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
Code review for nodeGatherMerge.c.
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
2017-08-30 23:21:08 +02:00
|
|
|
bool initialized; /* workers launched? */
|
|
|
|
bool need_to_scan_locally; /* need to read from local plan? */
|
2017-08-29 19:12:23 +02:00
|
|
|
int64 tuples_needed; /* tuple bound, see ExecSetTupleBound */
|
Code review for nodeGatherMerge.c.
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
2017-08-30 23:21:08 +02:00
|
|
|
/* these fields are set up once: */
|
|
|
|
TupleTableSlot *funnel_slot;
|
|
|
|
struct ParallelExecutorInfo *pei;
|
|
|
|
/* all remaining fields are reinitialized during a rescan: */
|
|
|
|
int nworkers_launched; /* original number of workers */
|
|
|
|
int nreaders; /* number of still-active workers */
|
|
|
|
int nextreader; /* next one to try to read from */
|
|
|
|
struct TupleQueueReader **reader; /* array with nreaders active entries */
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
} GatherState;
|
|
|
|
|
2017-03-09 13:40:36 +01:00
|
|
|
/* ----------------
|
|
|
|
* GatherMergeState information
|
|
|
|
*
|
|
|
|
* Gather merge nodes launch 1 or more parallel workers, run a
|
|
|
|
* subplan which produces sorted output in each worker, and then
|
|
|
|
* merge the results into a single sorted stream.
|
|
|
|
* ----------------
|
|
|
|
*/
|
Code review for nodeGatherMerge.c.
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
2017-08-30 23:21:08 +02:00
|
|
|
struct GMReaderTupleBuffer; /* private in nodeGatherMerge.c */
|
2017-03-09 13:40:36 +01:00
|
|
|
|
|
|
|
typedef struct GatherMergeState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
Code review for nodeGatherMerge.c.
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
2017-08-30 23:21:08 +02:00
|
|
|
bool initialized; /* workers launched? */
|
|
|
|
bool gm_initialized; /* gather_merge_init() done? */
|
|
|
|
bool need_to_scan_locally; /* need to read from local plan? */
|
|
|
|
int64 tuples_needed; /* tuple bound, see ExecSetTupleBound */
|
|
|
|
/* these fields are set up once: */
|
|
|
|
TupleDesc tupDesc; /* descriptor for subplan result tuples */
|
|
|
|
int gm_nkeys; /* number of sort columns */
|
|
|
|
SortSupport gm_sortkeys; /* array of length gm_nkeys */
|
2017-03-09 13:40:36 +01:00
|
|
|
struct ParallelExecutorInfo *pei;
|
Avoid memory leaks when a GatherMerge node is rescanned.
Rescanning a GatherMerge led to leaking some memory in the executor's
query-lifespan context, because most of the node's working data structures
were simply abandoned and rebuilt from scratch. In practice, this might
never amount to much, given the cost of relaunching worker processes ---
but it's still pretty messy, so let's fix it.
We can rearrange things so that the tuple arrays are simply cleared and
reused, and we don't need to rebuild the TupleTableSlots either, just
clear them. One small complication is that because we might get a
different number of workers on each iteration, we can't keep the old
convention that the leader's gm_slots[] entry is the last one; the leader
might clobber a TupleTableSlot that we need for a worker in a future
iteration. Hence, adjust the logic so that the leader has slot 0 always,
while the active workers have slots 1..n.
Back-patch to v10 to keep all the existing versions of nodeGatherMerge.c
in sync --- because of the renumbering of the slots, there would otherwise
be a very large risk that any future backpatches in this module would
introduce bugs.
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
2017-08-31 22:20:58 +02:00
|
|
|
/* all remaining fields are reinitialized during a rescan */
|
|
|
|
/* (but the arrays are not reallocated, just cleared) */
|
Code review for nodeGatherMerge.c.
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
2017-08-30 23:21:08 +02:00
|
|
|
int nworkers_launched; /* original number of workers */
|
|
|
|
int nreaders; /* number of active workers */
|
|
|
|
TupleTableSlot **gm_slots; /* array with nreaders+1 entries */
|
|
|
|
struct TupleQueueReader **reader; /* array with nreaders active entries */
|
|
|
|
struct GMReaderTupleBuffer *gm_tuple_buffers; /* nreaders tuple buffers */
|
2017-03-09 13:40:36 +01:00
|
|
|
struct binaryheap *gm_heap; /* binary heap of slot indices */
|
|
|
|
} GatherMergeState;
|
|
|
|
|
2017-12-05 19:55:56 +01:00
|
|
|
/* ----------------
|
|
|
|
* Values displayed by EXPLAIN ANALYZE
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct HashInstrumentation
|
|
|
|
{
|
|
|
|
int nbuckets; /* number of buckets at end of execution */
|
|
|
|
int nbuckets_original; /* planned number of buckets */
|
|
|
|
int nbatch; /* number of batches at end of execution */
|
|
|
|
int nbatch_original; /* planned number of batches */
|
|
|
|
size_t space_peak; /* speak memory usage in bytes */
|
|
|
|
} HashInstrumentation;
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* Shared memory container for per-worker hash information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct SharedHashInfo
|
|
|
|
{
|
|
|
|
int num_workers;
|
|
|
|
HashInstrumentation hinstrument[FLEXIBLE_ARRAY_MEMBER];
|
|
|
|
} SharedHashInfo;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
/* ----------------
|
|
|
|
* HashState information
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct HashState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
|
|
|
HashJoinTable hashtable; /* hash table for the hashjoin */
|
2002-12-13 20:46:01 +01:00
|
|
|
List *hashkeys; /* list of ExprState nodes */
|
2017-12-05 19:55:56 +01:00
|
|
|
|
|
|
|
SharedHashInfo *shared_info; /* one entry per worker */
|
|
|
|
HashInstrumentation *hinstrument; /* this worker's entry */
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
|
|
|
|
/* Parallel hash state. */
|
|
|
|
struct ParallelHashJoinState *parallel_state;
|
2002-12-05 16:50:39 +01:00
|
|
|
} HashState;
|
|
|
|
|
2000-10-05 21:11:39 +02:00
|
|
|
/* ----------------
|
|
|
|
* SetOpState information
|
|
|
|
*
|
2008-08-07 05:04:04 +02:00
|
|
|
* Even in "sorted" mode, SetOp nodes are more complex than a simple
|
|
|
|
* Unique, since we have to count how many duplicates to return. But
|
|
|
|
* we also support hashing, so this is really more like a cut-down
|
|
|
|
* form of Agg.
|
2000-10-05 21:11:39 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2008-08-07 05:04:04 +02:00
|
|
|
/* this struct is private in nodeSetOp.c: */
|
|
|
|
typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
|
|
|
|
|
2000-10-05 21:11:39 +02:00
|
|
|
typedef struct SetOpState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
2018-02-16 06:55:31 +01:00
|
|
|
ExprState *eqfunction; /* equality comparator */
|
|
|
|
Oid *eqfuncoids; /* per-grouping-field equality fns */
|
2008-08-07 05:04:04 +02:00
|
|
|
FmgrInfo *hashfunctions; /* per-grouping-field hash fns */
|
|
|
|
bool setop_done; /* indicates completion of output scan */
|
2000-10-05 21:11:39 +02:00
|
|
|
long numOutput; /* number of dups left to output */
|
2008-08-07 05:04:04 +02:00
|
|
|
/* these fields are used in SETOP_SORTED mode: */
|
|
|
|
SetOpStatePerGroup pergroup; /* per-group working state */
|
|
|
|
HeapTuple grp_firstTuple; /* copy of first tuple of current group */
|
|
|
|
/* these fields are used in SETOP_HASHED mode: */
|
|
|
|
TupleHashTable hashtable; /* hash table with one entry per group */
|
2009-06-11 16:49:15 +02:00
|
|
|
MemoryContext tableContext; /* memory context containing hash table */
|
2008-08-07 05:04:04 +02:00
|
|
|
bool table_filled; /* hash table filled yet? */
|
|
|
|
TupleHashIterator hashiter; /* for iterating through hash table */
|
2000-10-05 21:11:39 +02:00
|
|
|
} SetOpState;
|
|
|
|
|
2009-10-12 20:10:51 +02:00
|
|
|
/* ----------------
|
|
|
|
* LockRowsState information
|
|
|
|
*
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* LockRows nodes are used to enforce FOR [KEY] UPDATE/SHARE locking.
|
2009-10-12 20:10:51 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
typedef struct LockRowsState
|
|
|
|
{
|
|
|
|
PlanState ps; /* its first field is NodeTag */
|
2011-01-13 02:47:02 +01:00
|
|
|
List *lr_arowMarks; /* List of ExecAuxRowMarks */
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
EPQState lr_epqstate; /* for evaluating EvalPlanQual rechecks */
|
2009-10-12 20:10:51 +02:00
|
|
|
} LockRowsState;
|
|
|
|
|
2000-10-26 23:38:24 +02:00
|
|
|
/* ----------------
|
|
|
|
* LimitState information
|
|
|
|
*
|
|
|
|
* Limit nodes are used to enforce LIMIT/OFFSET clauses.
|
|
|
|
* They just select the desired subrange of their subplan's output.
|
|
|
|
*
|
|
|
|
* offset is the number of initial tuples to skip (0 does nothing).
|
|
|
|
* count is the number of tuples to return after skipping the offset tuples.
|
|
|
|
* If no limit count was specified, count is undefined and noCount is true.
|
2002-11-22 23:10:01 +01:00
|
|
|
* When lstate == LIMIT_INITIAL, offset/count/noCount haven't been set yet.
|
2000-10-26 23:38:24 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2002-11-22 23:10:01 +01:00
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
LIMIT_INITIAL, /* initial state for LIMIT node */
|
2007-05-17 21:35:08 +02:00
|
|
|
LIMIT_RESCAN, /* rescan after recomputing parameters */
|
2002-11-22 23:10:01 +01:00
|
|
|
LIMIT_EMPTY, /* there are no returnable rows */
|
|
|
|
LIMIT_INWINDOW, /* have returned a row in the window */
|
|
|
|
LIMIT_SUBPLANEOF, /* at EOF of subplan (within window) */
|
|
|
|
LIMIT_WINDOWEND, /* stepped off end of window */
|
|
|
|
LIMIT_WINDOWSTART /* stepped off beginning of window */
|
2003-08-08 23:42:59 +02:00
|
|
|
} LimitStateCond;
|
2002-11-22 23:10:01 +01:00
|
|
|
|
2000-10-26 23:38:24 +02:00
|
|
|
typedef struct LimitState
|
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState ps; /* its first field is NodeTag */
|
2002-12-13 20:46:01 +01:00
|
|
|
ExprState *limitOffset; /* OFFSET parameter, or NULL if none */
|
|
|
|
ExprState *limitCount; /* COUNT parameter, or NULL if none */
|
2006-07-26 02:34:48 +02:00
|
|
|
int64 offset; /* current OFFSET value */
|
|
|
|
int64 count; /* current COUNT, if any */
|
2000-10-26 23:38:24 +02:00
|
|
|
bool noCount; /* if true, ignore count */
|
2002-11-22 23:10:01 +01:00
|
|
|
LimitStateCond lstate; /* state machine status, as above */
|
2006-07-26 02:34:48 +02:00
|
|
|
int64 position; /* 1-based index of last tuple returned */
|
2002-11-22 23:10:01 +01:00
|
|
|
TupleTableSlot *subSlot; /* tuple last obtained from subplan */
|
2000-10-26 23:38:24 +02:00
|
|
|
} LimitState;
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* EXECNODES_H */
|