postgresql/src/backend/executor
Robert Haas 61c2e1a95f Improve access to parallel query from procedural languages.
In SQL, the ability to use parallel query was previous contingent on
fcache->readonly_func, which is only set for non-volatile functions;
but the volatility of a function has no bearing on whether queries
inside it can use parallelism.  Remove that condition.

SPI_execute and SPI_execute_with_args always run the plan just once,
though not necessarily to completion.  Given the changes in commit
691b8d5928, it's sensible to pass
CURSOR_OPT_PARALLEL_OK here, so do that.  This improves access to
parallelism for any caller that uses these functions to execute
queries.  Such callers include plperl, plpython, pltcl, and plpgsql,
though it's not the case that they all use these functions
exclusively.

In plpgsql, allow parallel query for plain SELECT queries (as
opposed to PERFORM, which already worked) and for plain expressions
(which probably won't go through the executor at all, because they
will likely be simple expressions, but if they do then this helps).

Rafia Sabih and Robert Haas, reviewed by Dilip Kumar and Amit Kapila

Discussion: http://postgr.es/m/CAOGQiiMfJ+4SQwgG=6CVHWoisiU0+7jtXSuiyXBM3y=A=eJzmg@mail.gmail.com
2017-03-24 14:46:33 -04:00
..
Makefile Add a Gather Merge executor node. 2017-03-09 07:49:29 -05:00
README Fix executor/README to reflect disallowing SRFs in UPDATE. 2016-09-13 14:25:35 -04:00
execAmi.c Support XMLTABLE query expression 2017-03-08 12:40:26 -03:00
execCurrent.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
execGrouping.c simplehash: Additional tweaks to make specifying an allocator work. 2017-02-09 14:59:57 -05:00
execIndexing.c Allow index AMs to cache data across aminsert calls within a SQL command. 2017-02-09 11:52:12 -05:00
execJunk.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
execMain.c Allow for parallel execution whenever ExecutorRun() is done only once. 2017-03-23 13:14:36 -04:00
execParallel.c Allow for parallel execution whenever ExecutorRun() is done only once. 2017-03-23 13:14:36 -04:00
execProcnode.c Add a Gather Merge executor node. 2017-03-09 07:49:29 -05:00
execQual.c Use wrappers of PG_DETOAST_DATUM_PACKED() more. 2017-03-12 19:35:34 -04:00
execReplication.c Use proper enum constants for LockWaitPolicy 2017-02-28 13:28:17 -05:00
execScan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
execTuples.c Use the new castNode() macro in a number of places. 2017-01-26 16:47:03 -08:00
execUtils.c Don't scan partitioned tables. 2017-03-21 09:48:04 -04:00
functions.c Improve access to parallel query from procedural languages. 2017-03-24 14:46:33 -04:00
instrument.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeAgg.c Spelling fixes 2017-03-14 12:58:39 -04:00
nodeAppend.c Don't scan partitioned tables. 2017-03-21 09:48:04 -04:00
nodeBitmapAnd.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeBitmapHeapscan.c Silence compiler warnings in BitmapHeapNext(). 2017-03-08 12:43:39 -05:00
nodeBitmapIndexscan.c Support parallel bitmap heap scans. 2017-03-08 12:05:43 -05:00
nodeBitmapOr.c Support parallel bitmap heap scans. 2017-03-08 12:05:43 -05:00
nodeCtescan.c Use the new castNode() macro in a number of places. 2017-01-26 16:47:03 -08:00
nodeCustom.c Allow custom and foreign scans to have shutdown callbacks. 2017-02-26 13:41:12 +05:30
nodeForeignscan.c Allow custom and foreign scans to have shutdown callbacks. 2017-02-26 13:41:12 +05:30
nodeFunctionscan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeGather.c Shut down Gather's children before shutting down Gather itself. 2017-02-22 08:08:07 +05:30
nodeGatherMerge.c Remove dead code in nodeGatherMerge.c. 2017-03-12 15:52:50 -04:00
nodeGroup.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeHash.c Make sure that hash join's bulk-tuple-transfer loops are interruptible. 2017-02-15 16:40:05 -05:00
nodeHashjoin.c Make sure that hash join's bulk-tuple-transfer loops are interruptible. 2017-02-15 16:40:05 -05:00
nodeIndexonlyscan.c Fix parallel index and index-only scans to fall back to serial. 2017-03-08 08:15:24 -05:00
nodeIndexscan.c Fix parallel index and index-only scans to fall back to serial. 2017-03-08 08:15:24 -05:00
nodeLimit.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeLockRows.c Use the new castNode() macro in a number of places. 2017-01-26 16:47:03 -08:00
nodeMaterial.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeMergeAppend.c Don't scan partitioned tables. 2017-03-21 09:48:04 -04:00
nodeMergejoin.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeModifyTable.c Don't scan partitioned tables. 2017-03-21 09:48:04 -04:00
nodeNestloop.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeProjectSet.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeRecursiveunion.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeResult.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeSamplescan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeSeqscan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeSetOp.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeSort.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeSubplan.c Use the new castNode() macro in a number of places. 2017-01-26 16:47:03 -08:00
nodeSubqueryscan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeTableFuncscan.c Support XMLTABLE query expression 2017-03-08 12:40:26 -03:00
nodeTidscan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeUnique.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
nodeValuesscan.c Remove obsoleted code relating to targetlist SRF evaluation. 2017-01-19 14:40:41 -08:00
nodeWindowAgg.c Fix typos in comments. 2017-02-06 11:33:58 +02:00
nodeWorktablescan.c Use the new castNode() macro in a number of places. 2017-01-26 16:47:03 -08:00
spi.c Improve access to parallel query from procedural languages. 2017-03-24 14:46:33 -04:00
tqueue.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
tstoreReceiver.c Update copyright via script for 2017 2017-01-03 13:48:53 -05:00

README

src/backend/executor/README

The Postgres Executor
=====================

The executor processes a tree of "plan nodes".  The plan tree is essentially
a demand-pull pipeline of tuple processing operations.  Each node, when
called, will produce the next tuple in its output sequence, or NULL if no
more tuples are available.  If the node is not a primitive relation-scanning
node, it will have child node(s) that it calls in turn to obtain input
tuples.

Refinements on this basic model include:

* Choice of scan direction (forwards or backwards).  Caution: this is not
currently well-supported.  It works for primitive scan nodes, but not very
well for joins, aggregates, etc.

* Rescan command to reset a node and make it generate its output sequence
over again.

* Parameters that can alter a node's results.  After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).

For a SELECT, it is only necessary to deliver the top-level result tuples
to the client.  For INSERT/UPDATE/DELETE, the actual table modification
operations happen in a top-level ModifyTable plan node.  If the query
includes a RETURNING clause, the ModifyTable node delivers the computed
RETURNING rows as output, otherwise it returns nothing.  Handling INSERT
is pretty straightforward: the tuples returned from the plan tree below
ModifyTable are inserted into the correct result relation.  For UPDATE,
the plan tree returns the computed tuples to be updated, plus a "junk"
(hidden) CTID column identifying which table row is to be replaced by each
one.  For DELETE, the plan tree need only deliver a CTID column, and the
ModifyTable node visits each of those rows and marks the row deleted.

XXX a great deal more documentation needs to be written here...


Plan Trees and State Trees
--------------------------

The plan tree delivered by the planner contains a tree of Plan nodes (struct
types derived from struct Plan).  Each Plan node may have expression trees
associated with it, to represent its target list, qualification conditions,
etc.  During executor startup we build a parallel tree of identical structure
containing executor state nodes --- every plan and expression node type has
a corresponding executor state node type.  Each node in the state tree has a
pointer to its corresponding node in the plan tree, plus executor state data
as needed to implement that node type.  This arrangement allows the plan
tree to be completely read-only as far as the executor is concerned: all data
that is modified during execution is in the state tree.  Read-only plan trees
make life much simpler for plan caching and reuse.

Altogether there are four classes of nodes used in these trees: Plan nodes,
their corresponding PlanState nodes, Expr nodes, and their corresponding
ExprState nodes.  (Actually, there are also List nodes, which are used as
"glue" in all four kinds of tree.)


Memory Management
-----------------

A "per query" memory context is created during CreateExecutorState();
all storage allocated during an executor invocation is allocated in that
context or a child context.  This allows easy reclamation of storage
during executor shutdown --- rather than messing with retail pfree's and
probable storage leaks, we just destroy the memory context.

In particular, the plan state trees and expression state trees described
in the previous section are allocated in the per-query memory context.

To avoid intra-query memory leaks, most processing while a query runs
is done in "per tuple" memory contexts, which are so-called because they
are typically reset to empty once per tuple.  Per-tuple contexts are usually
associated with ExprContexts, and commonly each PlanState node has its own
ExprContext to evaluate its qual and targetlist expressions in.


Query Processing Control Flow
-----------------------------

This is a sketch of control flow for full query processing:

	CreateQueryDesc

	ExecutorStart
		CreateExecutorState
			creates per-query context
		switch to per-query context to run ExecInitNode
		ExecInitNode --- recursively scans plan tree
			CreateExprContext
				creates per-tuple context
			ExecInitExpr
		AfterTriggerBeginQuery

	ExecutorRun
		ExecProcNode --- recursively called in per-query context
			ExecEvalExpr --- called in per-tuple context
			ResetExprContext --- to free memory

	ExecutorFinish
		ExecPostprocessPlan --- run any unfinished ModifyTable nodes
		AfterTriggerEndQuery

	ExecutorEnd
		ExecEndNode --- recursively releases resources
		FreeExecutorState
			frees per-query context and child contexts

	FreeQueryDesc

Per above comments, it's not really critical for ExecEndNode to free any
memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
be careful to close relations, drop buffer pins, etc, so we do need to scan
the plan state tree to find these sorts of resources.


The executor can also be used to evaluate simple expressions without any Plan
tree ("simple" meaning "no aggregates and no sub-selects", though such might
be hidden inside function calls).  This case has a flow of control like

	CreateExecutorState
		creates per-query context

	CreateExprContext	-- or use GetPerTupleExprContext(estate)
		creates per-tuple context

	ExecPrepareExpr
		temporarily switch to per-query context
		run the expression through expression_planner
		ExecInitExpr

	Repeatedly do:
		ExecEvalExprSwitchContext
			ExecEvalExpr --- called in per-tuple context
		ResetExprContext --- to free memory

	FreeExecutorState
		frees per-query context, as well as ExprContext
		(a separate FreeExprContext call is not necessary)


EvalPlanQual (READ COMMITTED Update Checking)
---------------------------------------------

For simple SELECTs, the executor need only pay attention to tuples that are
valid according to the snapshot seen by the current transaction (ie, they
were inserted by a previously committed transaction, and not deleted by any
previously committed transaction).  However, for UPDATE and DELETE it is not
cool to modify or delete a tuple that's been modified by an open or
concurrently-committed transaction.  If we are running in SERIALIZABLE
isolation level then we just raise an error when this condition is seen to
occur.  In READ COMMITTED isolation level, we must work a lot harder.

The basic idea in READ COMMITTED mode is to take the modified tuple
committed by the concurrent transaction (after waiting for it to commit,
if need be) and re-evaluate the query qualifications to see if it would
still meet the quals.  If so, we regenerate the updated tuple (if we are
doing an UPDATE) from the modified tuple, and finally update/delete the
modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
action is just to lock the modified tuple and return results based on that
version of the tuple.

To implement this checking, we actually re-run the query from scratch for
each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
relation scan nodes tweaked to return only the current tuples --- either
the original ones, or the updated (and now locked) versions of the modified
tuple(s).  If this query returns a tuple, then the modified tuple(s) pass
the quals (and the query output is the suitably modified update tuple, if
we're doing UPDATE).  If no tuple is returned, then the modified tuple(s)
fail the quals, so we ignore the current result tuple and continue the
original query.

In UPDATE/DELETE, only the target relation needs to be handled this way.
In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
so we obtain lock on the current tuple version in each such relation before
executing the recheck.

It is also possible that there are relations in the query that are not
to be locked (they are neither the UPDATE/DELETE target nor specified to
be locked in SELECT FOR UPDATE/SHARE).  When re-running the test query
we want to use the same rows from these relations that were joined to
the locked rows.  For ordinary relations this can be implemented relatively
cheaply by including the row TID in the join outputs and re-fetching that
TID.  (The re-fetch is expensive, but we're trying to optimize the normal
case where no re-test is needed.)  We have also to consider non-table
relations, such as a ValuesScan or FunctionScan.  For these, since there
is no equivalent of TID, the only practical solution seems to be to include
the entire row value in the join output row.

We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
so as to ensure that at most one tuple can be returned for any particular
set of scan tuples.  Otherwise we'd get duplicates due to the original
query returning the same set of scan tuples multiple times.  Likewise,
SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.