1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* execProcnode.c
|
1997-09-07 07:04:48 +02:00
|
|
|
* contains dispatch functions which call the appropriate "initialize",
|
|
|
|
* "get a tuple", and "cleanup" routines for the given node type.
|
|
|
|
* If the node has children, then it will presumably call ExecInitNode,
|
2000-01-05 19:23:54 +01:00
|
|
|
* ExecProcNode, or ExecEndNode on its subnodes and do the appropriate
|
2002-12-05 16:50:39 +01:00
|
|
|
* processing.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2020-01-01 18:21:45 +01:00
|
|
|
* Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/executor/execProcnode.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
/*
|
1997-09-07 07:04:48 +02:00
|
|
|
* NOTES
|
|
|
|
* This used to be three files. It is now all combined into
|
2017-07-17 09:33:49 +02:00
|
|
|
* one file so that it is easier to keep the dispatch routines
|
|
|
|
* in sync when new nodes are added.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* EXAMPLE
|
2005-12-07 16:27:42 +01:00
|
|
|
* Suppose we want the age of the manager of the shoe department and
|
|
|
|
* the number of employees in that department. So we have the query:
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2005-12-07 16:27:42 +01:00
|
|
|
* select DEPT.no_emps, EMP.age
|
2013-01-14 21:48:12 +01:00
|
|
|
* from DEPT, EMP
|
1997-09-07 07:04:48 +02:00
|
|
|
* where EMP.name = DEPT.mgr and
|
|
|
|
* DEPT.name = "shoe"
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* Suppose the planner gives us the following plan:
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* Nest Loop (DEPT.mgr = EMP.name)
|
|
|
|
* / \
|
|
|
|
* / \
|
|
|
|
* Seq Scan Seq Scan
|
|
|
|
* DEPT EMP
|
|
|
|
* (name = "shoe")
|
|
|
|
*
|
2006-02-28 05:10:28 +01:00
|
|
|
* ExecutorStart() is called first.
|
1997-09-07 07:04:48 +02:00
|
|
|
* It calls InitPlan() which calls ExecInitNode() on
|
|
|
|
* the root of the plan -- the nest loop node.
|
|
|
|
*
|
|
|
|
* * ExecInitNode() notices that it is looking at a nest loop and
|
|
|
|
* as the code below demonstrates, it calls ExecInitNestLoop().
|
|
|
|
* Eventually this calls ExecInitNode() on the right and left subplans
|
2014-05-06 18:12:18 +02:00
|
|
|
* and so forth until the entire plan is initialized. The result
|
2002-12-05 16:50:39 +01:00
|
|
|
* of ExecInitNode() is a plan state tree built with the same structure
|
|
|
|
* as the underlying plan tree.
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
2011-09-26 19:12:22 +02:00
|
|
|
* * Then when ExecutorRun() is called, it calls ExecutePlan() which calls
|
2002-12-05 16:50:39 +01:00
|
|
|
* ExecProcNode() repeatedly on the top node of the plan state tree.
|
1997-09-07 07:04:48 +02:00
|
|
|
* Each time this happens, ExecProcNode() will end up calling
|
|
|
|
* ExecNestLoop(), which calls ExecProcNode() on its subplans.
|
|
|
|
* Each of these subplans is a sequential scan so ExecSeqScan() is
|
|
|
|
* called. The slots returned by ExecSeqScan() may contain
|
|
|
|
* tuples which contain the attributes ExecNestLoop() uses to
|
|
|
|
* form the tuples it returns.
|
|
|
|
*
|
|
|
|
* * Eventually ExecSeqScan() stops returning tuples and the nest
|
2011-09-26 19:12:22 +02:00
|
|
|
* loop join ends. Lastly, ExecutorEnd() calls ExecEndNode() which
|
1997-09-07 07:04:48 +02:00
|
|
|
* calls ExecEndNestLoop() which in turn calls ExecEndNode() on
|
|
|
|
* its subplans which result in ExecEndSeqScan().
|
|
|
|
*
|
|
|
|
* This should show how the executor works by having
|
|
|
|
* ExecInitNode(), ExecProcNode() and ExecEndNode() dispatch
|
2016-04-26 10:38:32 +02:00
|
|
|
* their work to the appropriate node support routines which may
|
1997-09-07 07:04:48 +02:00
|
|
|
* in turn call these routines themselves on their subplans.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1996-10-31 11:12:26 +01:00
|
|
|
#include "postgres.h"
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
#include "executor/executor.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeAgg.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
#include "executor/nodeAppend.h"
|
2005-04-20 00:35:18 +02:00
|
|
|
#include "executor/nodeBitmapAnd.h"
|
|
|
|
#include "executor/nodeBitmapHeapscan.h"
|
|
|
|
#include "executor/nodeBitmapIndexscan.h"
|
|
|
|
#include "executor/nodeBitmapOr.h"
|
2008-12-28 19:54:01 +01:00
|
|
|
#include "executor/nodeCtescan.h"
|
2014-11-07 23:26:02 +01:00
|
|
|
#include "executor/nodeCustom.h"
|
2011-02-20 06:17:18 +01:00
|
|
|
#include "executor/nodeForeignscan.h"
|
2002-12-05 16:50:39 +01:00
|
|
|
#include "executor/nodeFunctionscan.h"
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
#include "executor/nodeGather.h"
|
2017-03-09 13:40:36 +01:00
|
|
|
#include "executor/nodeGatherMerge.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
#include "executor/nodeGroup.h"
|
|
|
|
#include "executor/nodeHash.h"
|
|
|
|
#include "executor/nodeHashjoin.h"
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
#include "executor/nodeIncrementalSort.h"
|
2011-10-11 20:20:06 +02:00
|
|
|
#include "executor/nodeIndexonlyscan.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeIndexscan.h"
|
2000-10-26 23:38:24 +02:00
|
|
|
#include "executor/nodeLimit.h"
|
2009-10-12 20:10:51 +02:00
|
|
|
#include "executor/nodeLockRows.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeMaterial.h"
|
2010-10-14 22:56:39 +02:00
|
|
|
#include "executor/nodeMergeAppend.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeMergejoin.h"
|
2009-10-10 03:43:50 +02:00
|
|
|
#include "executor/nodeModifyTable.h"
|
2017-04-01 06:17:18 +02:00
|
|
|
#include "executor/nodeNamedtuplestorescan.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeNestloop.h"
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
#include "executor/nodeProjectSet.h"
|
2008-10-04 23:56:55 +02:00
|
|
|
#include "executor/nodeRecursiveunion.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeResult.h"
|
2015-05-15 20:37:10 +02:00
|
|
|
#include "executor/nodeSamplescan.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeSeqscan.h"
|
2000-10-05 21:11:39 +02:00
|
|
|
#include "executor/nodeSetOp.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeSort.h"
|
1998-02-13 04:26:53 +01:00
|
|
|
#include "executor/nodeSubplan.h"
|
2000-09-29 20:21:41 +02:00
|
|
|
#include "executor/nodeSubqueryscan.h"
|
2017-03-08 16:39:37 +01:00
|
|
|
#include "executor/nodeTableFuncscan.h"
|
2002-12-05 16:50:39 +01:00
|
|
|
#include "executor/nodeTidscan.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "executor/nodeUnique.h"
|
2006-08-02 03:59:48 +02:00
|
|
|
#include "executor/nodeValuesscan.h"
|
2008-12-28 19:54:01 +01:00
|
|
|
#include "executor/nodeWindowAgg.h"
|
2008-10-04 23:56:55 +02:00
|
|
|
#include "executor/nodeWorktablescan.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "miscadmin.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "nodes/nodeFuncs.h"
|
2008-10-04 23:56:55 +02:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
|
|
|
|
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
|
|
|
|
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* ------------------------------------------------------------------------
|
1997-09-07 07:04:48 +02:00
|
|
|
* ExecInitNode
|
|
|
|
*
|
2006-02-28 05:10:28 +01:00
|
|
|
* Recursively initializes all the nodes in the plan tree rooted
|
1997-09-07 07:04:48 +02:00
|
|
|
* at 'node'.
|
|
|
|
*
|
2006-02-28 05:10:28 +01:00
|
|
|
* Inputs:
|
|
|
|
* 'node' is the current node of the plan produced by the query planner
|
|
|
|
* 'estate' is the shared execution state for the plan tree
|
|
|
|
* 'eflags' is a bitwise OR of flag bits described in executor.h
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
2002-12-05 16:50:39 +01:00
|
|
|
* Returns a PlanState node corresponding to the given Plan node.
|
1996-07-09 08:22:35 +02:00
|
|
|
* ------------------------------------------------------------------------
|
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState *
|
2006-02-28 05:10:28 +01:00
|
|
|
ExecInitNode(Plan *node, EState *estate, int eflags)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState *result;
|
|
|
|
List *subps;
|
2004-05-26 06:41:50 +02:00
|
|
|
ListCell *l;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* do nothing when we get to the end of a leaf on tree.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
if (node == NULL)
|
2002-12-05 16:50:39 +01:00
|
|
|
return NULL;
|
1998-02-26 05:46:47 +01:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
/*
|
|
|
|
* Make sure there's enough stack available. Need to check here, in
|
|
|
|
* addition to ExecProcNode() (via ExecProcNodeFirst()), to ensure the
|
|
|
|
* stack isn't overrun while initializing the node tree.
|
|
|
|
*/
|
|
|
|
check_stack_depth();
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
switch (nodeTag(node))
|
|
|
|
{
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* control nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
|
|
|
case T_Result:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitResult((Result *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
case T_ProjectSet:
|
|
|
|
result = (PlanState *) ExecInitProjectSet((ProjectSet *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2009-10-10 03:43:50 +02:00
|
|
|
case T_ModifyTable:
|
|
|
|
result = (PlanState *) ExecInitModifyTable((ModifyTable *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
1997-09-08 04:41:22 +02:00
|
|
|
case T_Append:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitAppend((Append *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2010-10-14 22:56:39 +02:00
|
|
|
case T_MergeAppend:
|
|
|
|
result = (PlanState *) ExecInitMergeAppend((MergeAppend *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
case T_RecursiveUnion:
|
|
|
|
result = (PlanState *) ExecInitRecursiveUnion((RecursiveUnion *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
case T_BitmapAnd:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitBitmapAnd((BitmapAnd *) node,
|
|
|
|
estate, eflags);
|
2005-04-20 00:35:18 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case T_BitmapOr:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitBitmapOr((BitmapOr *) node,
|
|
|
|
estate, eflags);
|
2005-04-20 00:35:18 +02:00
|
|
|
break;
|
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* scan nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
|
|
|
case T_SeqScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2015-05-15 20:37:10 +02:00
|
|
|
case T_SampleScan:
|
|
|
|
result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
1997-09-08 04:41:22 +02:00
|
|
|
case T_IndexScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2011-10-11 20:20:06 +02:00
|
|
|
case T_IndexOnlyScan:
|
|
|
|
result = (PlanState *) ExecInitIndexOnlyScan((IndexOnlyScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
case T_BitmapIndexScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
|
|
|
|
estate, eflags);
|
2005-04-20 00:35:18 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case T_BitmapHeapScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitBitmapHeapScan((BitmapHeapScan *) node,
|
|
|
|
estate, eflags);
|
2005-04-20 00:35:18 +02:00
|
|
|
break;
|
|
|
|
|
2000-09-29 20:21:41 +02:00
|
|
|
case T_TidScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitTidScan((TidScan *) node,
|
|
|
|
estate, eflags);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case T_SubqueryScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitSubqueryScan((SubqueryScan *) node,
|
|
|
|
estate, eflags);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
case T_FunctionScan:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitFunctionScan((FunctionScan *) node,
|
|
|
|
estate, eflags);
|
2002-05-12 22:10:05 +02:00
|
|
|
break;
|
|
|
|
|
2017-03-08 16:39:37 +01:00
|
|
|
case T_TableFuncScan:
|
|
|
|
result = (PlanState *) ExecInitTableFuncScan((TableFuncScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
case T_ValuesScan:
|
|
|
|
result = (PlanState *) ExecInitValuesScan((ValuesScan *) node,
|
2006-10-04 02:30:14 +02:00
|
|
|
estate, eflags);
|
2006-08-02 03:59:48 +02:00
|
|
|
break;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
case T_CteScan:
|
|
|
|
result = (PlanState *) ExecInitCteScan((CteScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2017-04-01 06:17:18 +02:00
|
|
|
case T_NamedTuplestoreScan:
|
|
|
|
result = (PlanState *) ExecInitNamedTuplestoreScan((NamedTuplestoreScan *) node,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
estate, eflags);
|
2017-04-01 06:17:18 +02:00
|
|
|
break;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
case T_WorkTableScan:
|
|
|
|
result = (PlanState *) ExecInitWorkTableScan((WorkTableScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2011-02-20 06:17:18 +01:00
|
|
|
case T_ForeignScan:
|
|
|
|
result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2014-11-07 23:26:02 +01:00
|
|
|
case T_CustomScan:
|
|
|
|
result = (PlanState *) ExecInitCustomScan((CustomScan *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* join nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
|
|
|
case T_NestLoop:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitNestLoop((NestLoop *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case T_MergeJoin:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitMergeJoin((MergeJoin *) node,
|
|
|
|
estate, eflags);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case T_HashJoin:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitHashJoin((HashJoin *) node,
|
|
|
|
estate, eflags);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* materialization nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
|
|
|
case T_Material:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitMaterial((Material *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case T_Sort:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitSort((Sort *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
case T_IncrementalSort:
|
|
|
|
result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_Group:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitGroup((Group *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_Agg:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitAgg((Agg *) node,
|
|
|
|
estate, eflags);
|
2000-10-05 21:11:39 +02:00
|
|
|
break;
|
|
|
|
|
2008-12-28 19:54:01 +01:00
|
|
|
case T_WindowAgg:
|
|
|
|
result = (PlanState *) ExecInitWindowAgg((WindowAgg *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_Unique:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitUnique((Unique *) node,
|
|
|
|
estate, eflags);
|
2000-10-26 23:38:24 +02:00
|
|
|
break;
|
|
|
|
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
case T_Gather:
|
|
|
|
result = (PlanState *) ExecInitGather((Gather *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2017-03-09 13:40:36 +01:00
|
|
|
case T_GatherMerge:
|
|
|
|
result = (PlanState *) ExecInitGatherMerge((GatherMerge *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_Hash:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitHash((Hash *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_SetOp:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitSetOp((SetOp *) node,
|
|
|
|
estate, eflags);
|
2002-12-05 16:50:39 +01:00
|
|
|
break;
|
|
|
|
|
2009-10-12 20:10:51 +02:00
|
|
|
case T_LockRows:
|
|
|
|
result = (PlanState *) ExecInitLockRows((LockRows *) node,
|
|
|
|
estate, eflags);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_Limit:
|
2006-02-28 05:10:28 +01:00
|
|
|
result = (PlanState *) ExecInitLimit((Limit *) node,
|
|
|
|
estate, eflags);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
2003-07-21 19:05:12 +02:00
|
|
|
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
|
2002-12-05 16:50:39 +01:00
|
|
|
result = NULL; /* keep compiler quiet */
|
2001-09-18 03:59:07 +02:00
|
|
|
break;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1998-02-26 05:46:47 +01:00
|
|
|
|
2017-12-14 00:47:01 +01:00
|
|
|
ExecSetExecProcNode(result, result->ExecProcNode);
|
2017-07-17 09:33:49 +02:00
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Initialize any initPlans present in this node. The planner put them in
|
|
|
|
* a separate list for us.
|
2002-12-05 16:50:39 +01:00
|
|
|
*/
|
|
|
|
subps = NIL;
|
2004-05-26 06:41:50 +02:00
|
|
|
foreach(l, node->initPlan)
|
1998-02-13 04:26:53 +01:00
|
|
|
{
|
2004-05-26 06:41:50 +02:00
|
|
|
SubPlan *subplan = (SubPlan *) lfirst(l);
|
2002-12-14 01:17:59 +01:00
|
|
|
SubPlanState *sstate;
|
2002-12-05 16:50:39 +01:00
|
|
|
|
2002-12-14 01:17:59 +01:00
|
|
|
Assert(IsA(subplan, SubPlan));
|
2007-02-27 02:11:26 +01:00
|
|
|
sstate = ExecInitSubPlan(subplan, result);
|
2002-12-13 20:46:01 +01:00
|
|
|
subps = lappend(subps, sstate);
|
1998-02-13 04:26:53 +01:00
|
|
|
}
|
2002-12-05 16:50:39 +01:00
|
|
|
result->initPlan = subps;
|
|
|
|
|
|
|
|
/* Set up instrumentation for this node if requested */
|
|
|
|
if (estate->es_instrument)
|
2009-12-15 05:57:48 +01:00
|
|
|
result->instrument = InstrAlloc(1, estate->es_instrument);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
return result;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2017-12-14 00:47:01 +01:00
|
|
|
/*
|
|
|
|
* If a node wants to change its ExecProcNode function after ExecInitNode()
|
|
|
|
* has finished, it should do so with this function. That way any wrapper
|
|
|
|
* functions can be reinstalled, without the node having to know how that
|
|
|
|
* works.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Add a wrapper around the ExecProcNode callback that checks stack depth
|
2018-04-26 20:47:16 +02:00
|
|
|
* during the first execution and maybe adds an instrumentation wrapper.
|
|
|
|
* When the callback is changed after execution has already begun that
|
|
|
|
* means we'll superfluously execute ExecProcNodeFirst, but that seems ok.
|
2017-12-14 00:47:01 +01:00
|
|
|
*/
|
|
|
|
node->ExecProcNodeReal = function;
|
|
|
|
node->ExecProcNode = ExecProcNodeFirst;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
/*
|
|
|
|
* ExecProcNode wrapper that performs some one-time checks, before calling
|
|
|
|
* the relevant node method (possibly via an instrumentation wrapper).
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2017-07-17 09:33:49 +02:00
|
|
|
static TupleTableSlot *
|
|
|
|
ExecProcNodeFirst(PlanState *node)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2017-07-17 09:33:49 +02:00
|
|
|
/*
|
|
|
|
* Perform stack depth check during the first execution of the node. We
|
|
|
|
* only do so the first time round because it turns out to not be cheap on
|
2017-08-14 23:29:33 +02:00
|
|
|
* some common architectures (eg. x86). This relies on the assumption
|
|
|
|
* that ExecProcNode calls for a given plan node will always be made at
|
|
|
|
* roughly the same stack depth.
|
2017-07-17 09:33:49 +02:00
|
|
|
*/
|
|
|
|
check_stack_depth();
|
1998-02-26 05:46:47 +01:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
/*
|
|
|
|
* If instrumentation is required, change the wrapper to one that just
|
|
|
|
* does instrumentation. Otherwise we can dispense with all wrappers and
|
|
|
|
* have ExecProcNode() directly call the relevant function from now on.
|
|
|
|
*/
|
2001-09-18 03:59:07 +02:00
|
|
|
if (node->instrument)
|
2017-07-17 09:33:49 +02:00
|
|
|
node->ExecProcNode = ExecProcNodeInstr;
|
|
|
|
else
|
|
|
|
node->ExecProcNode = node->ExecProcNodeReal;
|
1997-09-08 04:41:22 +02:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
return node->ExecProcNode(node);
|
|
|
|
}
|
1997-09-08 04:41:22 +02:00
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
/*
|
|
|
|
* ExecProcNode wrapper that performs instrumentation calls. By keeping
|
|
|
|
* this a separate function, we avoid overhead in the normal case where
|
|
|
|
* no instrumentation is wanted.
|
|
|
|
*/
|
|
|
|
static TupleTableSlot *
|
|
|
|
ExecProcNodeInstr(PlanState *node)
|
|
|
|
{
|
|
|
|
TupleTableSlot *result;
|
2009-10-12 20:10:51 +02:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
InstrStartNode(node->instrument);
|
1997-09-08 04:41:22 +02:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
result = node->ExecProcNodeReal(node);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
|
2001-09-18 03:59:07 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
return result;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:07:35 +02:00
|
|
|
|
|
|
|
/* ----------------------------------------------------------------
|
|
|
|
* MultiExecProcNode
|
|
|
|
*
|
|
|
|
* Execute a node that doesn't return individual tuples
|
|
|
|
* (it might return a hashtable, bitmap, etc). Caller should
|
|
|
|
* check it got back the expected kind of Node.
|
|
|
|
*
|
|
|
|
* This has essentially the same responsibilities as ExecProcNode,
|
|
|
|
* but it does not do InstrStartNode/InstrStopNode (mainly because
|
|
|
|
* it can't tell how many returned tuples to count). Each per-node
|
|
|
|
* function must provide its own instrumentation support.
|
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
Node *
|
|
|
|
MultiExecProcNode(PlanState *node)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
Node *result;
|
2005-04-16 22:07:35 +02:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
check_stack_depth();
|
|
|
|
|
2005-04-16 22:07:35 +02:00
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
|
|
|
if (node->chgParam != NULL) /* something changed */
|
2010-07-12 19:01:06 +02:00
|
|
|
ExecReScan(node); /* let ReScan handle this */
|
2005-04-16 22:07:35 +02:00
|
|
|
|
|
|
|
switch (nodeTag(node))
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
/*
|
|
|
|
* Only node types that actually support multiexec will be listed
|
|
|
|
*/
|
2005-04-16 22:07:35 +02:00
|
|
|
|
|
|
|
case T_HashState:
|
|
|
|
result = MultiExecHash((HashState *) node);
|
|
|
|
break;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
case T_BitmapIndexScanState:
|
|
|
|
result = MultiExecBitmapIndexScan((BitmapIndexScanState *) node);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case T_BitmapAndState:
|
|
|
|
result = MultiExecBitmapAnd((BitmapAndState *) node);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case T_BitmapOrState:
|
|
|
|
result = MultiExecBitmapOr((BitmapOrState *) node);
|
|
|
|
break;
|
|
|
|
|
2005-04-16 22:07:35 +02:00
|
|
|
default:
|
|
|
|
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
|
|
|
|
result = NULL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/* ----------------------------------------------------------------
|
|
|
|
* ExecEndNode
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* Recursively cleans up all the nodes in the plan rooted
|
|
|
|
* at 'node'.
|
|
|
|
*
|
2009-10-10 03:43:50 +02:00
|
|
|
* After this operation, the query plan will not be able to be
|
2014-05-06 18:12:18 +02:00
|
|
|
* processed any further. This should be called only after
|
1997-09-07 07:04:48 +02:00
|
|
|
* the query plan has been fully executed.
|
|
|
|
* ----------------------------------------------------------------
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
|
|
|
void
|
2003-08-08 23:42:59 +02:00
|
|
|
ExecEndNode(PlanState *node)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* do nothing when we get to the end of a leaf on tree.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
if (node == NULL)
|
|
|
|
return;
|
1998-02-26 05:46:47 +01:00
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
/*
|
|
|
|
* Make sure there's enough stack available. Need to check here, in
|
|
|
|
* addition to ExecProcNode() (via ExecProcNodeFirst()), because it's not
|
|
|
|
* guaranteed that ExecProcNode() is reached for all nodes.
|
|
|
|
*/
|
|
|
|
check_stack_depth();
|
|
|
|
|
2003-02-09 01:30:41 +01:00
|
|
|
if (node->chgParam != NULL)
|
1998-02-13 04:26:53 +01:00
|
|
|
{
|
2003-02-09 01:30:41 +01:00
|
|
|
bms_free(node->chgParam);
|
|
|
|
node->chgParam = NULL;
|
1998-02-13 04:26:53 +01:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
switch (nodeTag(node))
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
/*
|
|
|
|
* control nodes
|
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_ResultState:
|
|
|
|
ExecEndResult((ResultState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
case T_ProjectSetState:
|
|
|
|
ExecEndProjectSet((ProjectSetState *) node);
|
|
|
|
break;
|
|
|
|
|
2009-10-10 03:43:50 +02:00
|
|
|
case T_ModifyTableState:
|
|
|
|
ExecEndModifyTable((ModifyTableState *) node);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_AppendState:
|
|
|
|
ExecEndAppend((AppendState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2010-10-14 22:56:39 +02:00
|
|
|
case T_MergeAppendState:
|
|
|
|
ExecEndMergeAppend((MergeAppendState *) node);
|
|
|
|
break;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
case T_RecursiveUnionState:
|
|
|
|
ExecEndRecursiveUnion((RecursiveUnionState *) node);
|
|
|
|
break;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
case T_BitmapAndState:
|
|
|
|
ExecEndBitmapAnd((BitmapAndState *) node);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case T_BitmapOrState:
|
|
|
|
ExecEndBitmapOr((BitmapOrState *) node);
|
|
|
|
break;
|
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* scan nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_SeqScanState:
|
|
|
|
ExecEndSeqScan((SeqScanState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2015-05-15 20:37:10 +02:00
|
|
|
case T_SampleScanState:
|
|
|
|
ExecEndSampleScan((SampleScanState *) node);
|
|
|
|
break;
|
|
|
|
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
case T_GatherState:
|
|
|
|
ExecEndGather((GatherState *) node);
|
|
|
|
break;
|
|
|
|
|
2017-03-09 13:40:36 +01:00
|
|
|
case T_GatherMergeState:
|
|
|
|
ExecEndGatherMerge((GatherMergeState *) node);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_IndexScanState:
|
|
|
|
ExecEndIndexScan((IndexScanState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2011-10-11 20:20:06 +02:00
|
|
|
case T_IndexOnlyScanState:
|
|
|
|
ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
case T_BitmapIndexScanState:
|
|
|
|
ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case T_BitmapHeapScanState:
|
|
|
|
ExecEndBitmapHeapScan((BitmapHeapScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_TidScanState:
|
|
|
|
ExecEndTidScan((TidScanState *) node);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_SubqueryScanState:
|
|
|
|
ExecEndSubqueryScan((SubqueryScanState *) node);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_FunctionScanState:
|
|
|
|
ExecEndFunctionScan((FunctionScanState *) node);
|
2002-05-12 22:10:05 +02:00
|
|
|
break;
|
|
|
|
|
2017-03-08 16:39:37 +01:00
|
|
|
case T_TableFuncScanState:
|
|
|
|
ExecEndTableFuncScan((TableFuncScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
case T_ValuesScanState:
|
|
|
|
ExecEndValuesScan((ValuesScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
case T_CteScanState:
|
|
|
|
ExecEndCteScan((CteScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2017-04-01 06:17:18 +02:00
|
|
|
case T_NamedTuplestoreScanState:
|
|
|
|
ExecEndNamedTuplestoreScan((NamedTuplestoreScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
case T_WorkTableScanState:
|
|
|
|
ExecEndWorkTableScan((WorkTableScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2011-02-20 06:17:18 +01:00
|
|
|
case T_ForeignScanState:
|
|
|
|
ExecEndForeignScan((ForeignScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2014-11-07 23:26:02 +01:00
|
|
|
case T_CustomScanState:
|
|
|
|
ExecEndCustomScan((CustomScanState *) node);
|
|
|
|
break;
|
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* join nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_NestLoopState:
|
|
|
|
ExecEndNestLoop((NestLoopState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_MergeJoinState:
|
|
|
|
ExecEndMergeJoin((MergeJoinState *) node);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_HashJoinState:
|
|
|
|
ExecEndHashJoin((HashJoinState *) node);
|
2000-09-29 20:21:41 +02:00
|
|
|
break;
|
|
|
|
|
2001-03-22 07:16:21 +01:00
|
|
|
/*
|
|
|
|
* materialization nodes
|
1997-09-08 04:41:22 +02:00
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_MaterialState:
|
|
|
|
ExecEndMaterial((MaterialState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_SortState:
|
|
|
|
ExecEndSort((SortState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
case T_IncrementalSortState:
|
|
|
|
ExecEndIncrementalSort((IncrementalSortState *) node);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_GroupState:
|
|
|
|
ExecEndGroup((GroupState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_AggState:
|
|
|
|
ExecEndAgg((AggState *) node);
|
2000-10-05 21:11:39 +02:00
|
|
|
break;
|
|
|
|
|
2008-12-28 19:54:01 +01:00
|
|
|
case T_WindowAggState:
|
|
|
|
ExecEndWindowAgg((WindowAggState *) node);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_UniqueState:
|
|
|
|
ExecEndUnique((UniqueState *) node);
|
2000-10-26 23:38:24 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_HashState:
|
|
|
|
ExecEndHash((HashState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_SetOpState:
|
|
|
|
ExecEndSetOp((SetOpState *) node);
|
|
|
|
break;
|
|
|
|
|
2009-10-12 20:10:51 +02:00
|
|
|
case T_LockRowsState:
|
|
|
|
ExecEndLockRows((LockRowsState *) node);
|
|
|
|
break;
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
case T_LimitState:
|
|
|
|
ExecEndLimit((LimitState *) node);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
2003-07-21 19:05:12 +02:00
|
|
|
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ExecShutdownNode
|
|
|
|
*
|
|
|
|
* Give execution nodes a chance to stop asynchronous resource consumption
|
2018-08-13 06:34:39 +02:00
|
|
|
* and release any resources still held.
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
ExecShutdownNode(PlanState *node)
|
|
|
|
{
|
|
|
|
if (node == NULL)
|
|
|
|
return false;
|
|
|
|
|
2017-07-17 09:33:49 +02:00
|
|
|
check_stack_depth();
|
|
|
|
|
2017-02-22 03:29:27 +01:00
|
|
|
planstate_tree_walker(node, ExecShutdownNode, NULL);
|
|
|
|
|
2018-08-03 07:32:02 +02:00
|
|
|
/*
|
|
|
|
* Treat the node as running while we shut it down, but only if it's run
|
|
|
|
* at least once already. We don't expect much CPU consumption during
|
|
|
|
* node shutdown, but in the case of Gather or Gather Merge, we may shut
|
|
|
|
* down workers at this stage. If so, their buffer usage will get
|
|
|
|
* propagated into pgBufferUsage at this point, and we want to make sure
|
|
|
|
* that it gets associated with the Gather node. We skip this if the node
|
|
|
|
* has never been executed, so as to avoid incorrectly making it appear
|
|
|
|
* that it has.
|
|
|
|
*/
|
|
|
|
if (node->instrument && node->instrument->running)
|
|
|
|
InstrStartNode(node->instrument);
|
|
|
|
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
switch (nodeTag(node))
|
|
|
|
{
|
|
|
|
case T_GatherState:
|
2015-10-22 16:49:20 +02:00
|
|
|
ExecShutdownGather((GatherState *) node);
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
break;
|
2017-02-26 09:06:49 +01:00
|
|
|
case T_ForeignScanState:
|
|
|
|
ExecShutdownForeignScan((ForeignScanState *) node);
|
|
|
|
break;
|
|
|
|
case T_CustomScanState:
|
|
|
|
ExecShutdownCustomScan((CustomScanState *) node);
|
|
|
|
break;
|
2017-03-09 13:40:36 +01:00
|
|
|
case T_GatherMergeState:
|
|
|
|
ExecShutdownGatherMerge((GatherMergeState *) node);
|
|
|
|
break;
|
2017-12-05 19:55:56 +01:00
|
|
|
case T_HashState:
|
|
|
|
ExecShutdownHash((HashState *) node);
|
|
|
|
break;
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
case T_HashJoinState:
|
|
|
|
ExecShutdownHashJoin((HashJoinState *) node);
|
|
|
|
break;
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2018-08-03 07:32:02 +02:00
|
|
|
/* Stop the node if we started it above, reporting 0 tuples. */
|
|
|
|
if (node->instrument && node->instrument->running)
|
|
|
|
InstrStopNode(node->instrument, 0);
|
|
|
|
|
2017-02-22 03:29:27 +01:00
|
|
|
return false;
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
}
|
2017-08-29 19:12:23 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ExecSetTupleBound
|
|
|
|
*
|
|
|
|
* Set a tuple bound for a planstate node. This lets child plan nodes
|
|
|
|
* optimize based on the knowledge that the maximum number of tuples that
|
|
|
|
* their parent will demand is limited. The tuple bound for a node may
|
|
|
|
* only be changed between scans (i.e., after node initialization or just
|
|
|
|
* before an ExecReScan call).
|
|
|
|
*
|
|
|
|
* Any negative tuples_needed value means "no limit", which should be the
|
|
|
|
* default assumption when this is not called at all for a particular node.
|
|
|
|
*
|
|
|
|
* Note: if this is called repeatedly on a plan tree, the exact same set
|
|
|
|
* of nodes must be updated with the new limit each time; be careful that
|
|
|
|
* only unchanging conditions are tested here.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Since this function recurses, in principle we should check stack depth
|
|
|
|
* here. In practice, it's probably pointless since the earlier node
|
|
|
|
* initialization tree traversal would surely have consumed more stack.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (IsA(child_node, SortState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If it is a Sort node, notify it that it can use bounded sort.
|
|
|
|
*
|
|
|
|
* Note: it is the responsibility of nodeSort.c to react properly to
|
|
|
|
* changes of these parameters. If we ever redesign this, it'd be a
|
|
|
|
* good idea to integrate this signaling with the parameter-change
|
|
|
|
* mechanism.
|
|
|
|
*/
|
|
|
|
SortState *sortState = (SortState *) child_node;
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
|
|
|
|
if (tuples_needed < 0)
|
|
|
|
{
|
|
|
|
/* make sure flag gets reset if needed upon rescan */
|
|
|
|
sortState->bounded = false;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
sortState->bounded = true;
|
|
|
|
sortState->bound = tuples_needed;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (IsA(child_node, IncrementalSortState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If it is an IncrementalSort node, notify it that it can use bounded
|
|
|
|
* sort.
|
|
|
|
*
|
|
|
|
* Note: it is the responsibility of nodeIncrementalSort.c to react
|
|
|
|
* properly to changes of these parameters. If we ever redesign this,
|
|
|
|
* it'd be a good idea to integrate this signaling with the
|
|
|
|
* parameter-change mechanism.
|
|
|
|
*/
|
|
|
|
IncrementalSortState *sortState = (IncrementalSortState *) child_node;
|
2017-08-29 19:12:23 +02:00
|
|
|
|
|
|
|
if (tuples_needed < 0)
|
|
|
|
{
|
|
|
|
/* make sure flag gets reset if needed upon rescan */
|
|
|
|
sortState->bounded = false;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
sortState->bounded = true;
|
|
|
|
sortState->bound = tuples_needed;
|
|
|
|
}
|
|
|
|
}
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
else if (IsA(child_node, AppendState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If it is an Append, we can apply the bound to any nodes that are
|
|
|
|
* children of the Append, since the Append surely need read no more
|
|
|
|
* than that many tuples from any one input.
|
|
|
|
*/
|
|
|
|
AppendState *aState = (AppendState *) child_node;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < aState->as_nplans; i++)
|
|
|
|
ExecSetTupleBound(tuples_needed, aState->appendplans[i]);
|
|
|
|
}
|
2017-08-29 19:12:23 +02:00
|
|
|
else if (IsA(child_node, MergeAppendState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If it is a MergeAppend, we can apply the bound to any nodes that
|
|
|
|
* are children of the MergeAppend, since the MergeAppend surely need
|
|
|
|
* read no more than that many tuples from any one input.
|
|
|
|
*/
|
|
|
|
MergeAppendState *maState = (MergeAppendState *) child_node;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < maState->ms_nplans; i++)
|
|
|
|
ExecSetTupleBound(tuples_needed, maState->mergeplans[i]);
|
|
|
|
}
|
|
|
|
else if (IsA(child_node, ResultState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Similarly, for a projecting Result, we can apply the bound to its
|
|
|
|
* child node.
|
|
|
|
*
|
|
|
|
* If Result supported qual checking, we'd have to punt on seeing a
|
|
|
|
* qual. Note that having a resconstantqual is not a showstopper: if
|
|
|
|
* that condition succeeds it affects nothing, while if it fails, no
|
|
|
|
* rows will be demanded from the Result child anyway.
|
|
|
|
*/
|
|
|
|
if (outerPlanState(child_node))
|
|
|
|
ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
|
|
|
|
}
|
|
|
|
else if (IsA(child_node, SubqueryScanState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We can also descend through SubqueryScan, but only if it has no
|
|
|
|
* qual (otherwise it might discard rows).
|
|
|
|
*/
|
|
|
|
SubqueryScanState *subqueryState = (SubqueryScanState *) child_node;
|
|
|
|
|
|
|
|
if (subqueryState->ss.ps.qual == NULL)
|
|
|
|
ExecSetTupleBound(tuples_needed, subqueryState->subplan);
|
|
|
|
}
|
|
|
|
else if (IsA(child_node, GatherState))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* A Gather node can propagate the bound to its workers. As with
|
|
|
|
* MergeAppend, no one worker could possibly need to return more
|
|
|
|
* tuples than the Gather itself needs to.
|
|
|
|
*
|
|
|
|
* Note: As with Sort, the Gather node is responsible for reacting
|
|
|
|
* properly to changes to this parameter.
|
|
|
|
*/
|
|
|
|
GatherState *gstate = (GatherState *) child_node;
|
|
|
|
|
|
|
|
gstate->tuples_needed = tuples_needed;
|
|
|
|
|
|
|
|
/* Also pass down the bound to our own copy of the child plan */
|
|
|
|
ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
|
|
|
|
}
|
|
|
|
else if (IsA(child_node, GatherMergeState))
|
|
|
|
{
|
|
|
|
/* Same comments as for Gather */
|
|
|
|
GatherMergeState *gstate = (GatherMergeState *) child_node;
|
|
|
|
|
|
|
|
gstate->tuples_needed = tuples_needed;
|
|
|
|
|
|
|
|
ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In principle we could descend through any plan node type that is
|
|
|
|
* certain not to discard or combine input rows; but on seeing a node that
|
|
|
|
* can do that, we can't propagate the bound any further. For the moment
|
|
|
|
* it's unclear that any other cases are worth checking here.
|
|
|
|
*/
|
|
|
|
}
|