postgresql/src/backend/executor/README

$PostgreSQL: pgsql/src/backend/executor/README,v 1.5 2005/04/28 21:47:12 tgl Exp $

The Postgres Executor
---------------------

The executor processes a tree of "plan nodes".  The plan tree is essentially
a demand-pull pipeline of tuple processing operations.  Each node, when
called, will produce the next tuple in its output sequence, or NULL if no
more tuples are available.  If the node is not a primitive relation-scanning
node, it will have child node(s) that it calls in turn to obtain input
tuples.

Refinements on this basic model include:

* Choice of scan direction (forwards or backwards).  Caution: this is not
currently well-supported.  It works for primitive scan nodes, but not very
well for joins, aggregates, etc.

* Rescan command to reset a node and make it generate its output sequence
over again.

* Parameters that can alter a node's results.  After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).

The plan tree concept implements SELECT directly: it is only necessary to
deliver the top-level result tuples to the client, or insert them into
another table in the case of INSERT ... SELECT.  (INSERT ... VALUES is
handled similarly, but the plan tree is just a Result node with no source
tables.)  For UPDATE, the plan tree selects the tuples that need to be
updated (WHERE condition) and delivers a new calculated tuple value for each
such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple.
The executor's top level then uses this information to update the correct
tuple.  DELETE is similar to UPDATE except that only a CTID need be
delivered by the plan tree.

XXX a great deal more documentation needs to be written here...


Plan Trees and State Trees
--------------------------

The plan tree delivered by the planner contains a tree of Plan nodes (struct
types derived from struct Plan).  Each Plan node may have expression trees
associated with it, to represent its target list, qualification conditions,
etc.  During executor startup we build a parallel tree of identical structure
containing executor state nodes --- every plan and expression node type has
a corresponding executor state node type.  Each node in the state tree has a
pointer to its corresponding node in the plan tree, plus executor state data
as needed to implement that node type.  This arrangement allows the plan
tree to be completely read-only as far as the executor is concerned: all data
that is modified during execution is in the state tree.  Read-only plan trees
make life much simpler for plan caching and reuse.

Altogether there are four classes of nodes used in these trees: Plan nodes,
their corresponding PlanState nodes, Expr nodes, and their corresponding
ExprState nodes.  (Actually, there are also List nodes, which are used as
"glue" in all four kinds of tree.)


Memory Management
-----------------

A "per query" memory context is created during CreateExecutorState();
all storage allocated during an executor invocation is allocated in that
context or a child context.  This allows easy reclamation of storage
during executor shutdown --- rather than messing with retail pfree's and
probable storage leaks, we just destroy the memory context.

In particular, the plan state trees and expression state trees described
in the previous section are allocated in the per-query memory context.

To avoid intra-query memory leaks, most processing while a query runs
is done in "per tuple" memory contexts, which are so-called because they
are typically reset to empty once per tuple.  Per-tuple contexts are usually
associated with ExprContexts, and commonly each PlanState node has its own
ExprContext to evaluate its qual and targetlist expressions in.


Query Processing Control Flow
-----------------------------

This is a sketch of control flow for full query processing:

	CreateQueryDesc

	ExecutorStart
		CreateExecutorState
			creates per-query context
		switch to per-query context to run ExecInitNode
		ExecInitNode --- recursively scans plan tree
			CreateExprContext
				creates per-tuple context
			ExecInitExpr

	ExecutorRun
		ExecProcNode --- recursively called in per-query context
			ExecEvalExpr --- called in per-tuple context
			ResetExprContext --- to free memory

	ExecutorEnd
		ExecEndNode --- recursively releases resources
		FreeExecutorState
			frees per-query context and child contexts

	FreeQueryDesc

Per above comments, it's not really critical for ExecEndNode to free any
memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
be careful to close relations, drop buffer pins, etc, so we do need to scan
the plan state tree to find these sorts of resources.


The executor can also be used to evaluate simple expressions without any Plan
tree ("simple" meaning "no aggregates and no sub-selects", though such might
be hidden inside function calls).  This case has a flow of control like

	CreateExecutorState
		creates per-query context

	CreateExprContext	-- or use GetPerTupleExprContext(estate)
		creates per-tuple context

	ExecPrepareExpr
		switch to per-query context to run ExecInitExpr
		ExecInitExpr

	Repeatedly do:
		ExecEvalExprSwitchContext
			ExecEvalExpr --- called in per-tuple context
		ResetExprContext --- to free memory

	FreeExecutorState
		frees per-query context, as well as ExprContext
		(a separate FreeExprContext call is not necessary)


EvalPlanQual (READ COMMITTED update checking)
---------------------------------------------

For simple SELECTs, the executor need only pay attention to tuples that are
valid according to the snapshot seen by the current transaction (ie, they
were inserted by a previously committed transaction, and not deleted by any
previously committed transaction).  However, for UPDATE and DELETE it is not
cool to modify or delete a tuple that's been modified by an open or
concurrently-committed transaction.  If we are running in SERIALIZABLE
isolation level then we just raise an error when this condition is seen to
occur.  In READ COMMITTED isolation level, we must work a lot harder.

The basic idea in READ COMMITTED mode is to take the modified tuple
committed by the concurrent transaction (after waiting for it to commit,
if need be) and re-evaluate the query qualifications to see if it would
still meet the quals.  If so, we regenerate the updated tuple (if we are
doing an UPDATE) from the modified tuple, and finally update/delete the
modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
action is just to lock the modified tuple.

To implement this checking, we actually re-run the entire query from scratch
for each modified tuple, but with the scan node that sourced the original
tuple set to return only the modified tuple, not the original tuple or any
of the rest of the relation.  If this query returns a tuple, then the
modified tuple passes the quals (and the query output is the suitably
modified update tuple, if we're doing UPDATE).  If no tuple is returned,
then the modified tuple fails the quals, so we ignore it and continue the
original query.  (This is reasonably efficient for simple queries, but may
be horribly slow for joins.  A better design would be nice; one thought for
future investigation is to treat the tuple substitution like a parameter,
so that we can avoid rescanning unrelated nodes.)

Note a fundamental bogosity of this approach: if the relation containing
the original tuple is being used in a self-join, the other instance(s) of
the relation will be treated as still containing the original tuple, whereas
logical consistency would demand that the modified tuple appear in them too.
But we'd have to actually substitute the modified tuple for the original,
while still returning all the rest of the relation, to ensure consistent
answers.  Implementing this correctly is a task for future work.

In UPDATE/DELETE, only the target relation needs to be handled this way,
so only one special recheck query needs to execute at a time.  In SELECT FOR
UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible
that while we are executing a recheck query for one modified tuple, we will
hit another modified tuple in another relation.  In this case we "stack up"
recheck queries: a sub-recheck query is spawned in which both the first and
second modified tuples will be returned as the only components of their
relations.  (In event of success, all these modified tuples will be locked.)
Again, this isn't necessarily quite the right thing ... but in simple cases
it works.  Potentially, recheck queries could get nested to the depth of the
number of FOR UPDATE/SHARE relations in the query.

It should be noted also that UPDATE/DELETE expect at most one tuple to
result from the modified query, whereas in the FOR UPDATE case it's possible
for multiple tuples to result (since we could be dealing with a join in
which multiple tuples join to the modified tuple).  We want FOR UPDATE to
lock all relevant tuples, so we pass all tuples output by all the stacked
recheck queries back to the executor toplevel for locking.
Implement sharable row-level locks, and use them for foreign key references to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU data structure (managed much like pg_subtrans) to represent multiple- transaction-ID sets. When more than one transaction is holding a shared lock on a particular row, we create a MultiXactId representing that set of transactions and store its ID in the row's XMAX. This scheme allows an effectively unlimited number of row locks, just as we did before, while not costing any extra overhead except when a shared lock actually has to be shared. Still TODO: use the regular lock manager to control the grant order when multiple backends are waiting for a row lock. Alvaro Herrera and Tom Lane. 2005-04-28 23:47:18 +02:00			$PostgreSQL: pgsql/src/backend/executor/README,v 1.5 2005/04/28 21:47:12 tgl Exp $
Some badly needed documentation about EvalPlanQual. 2001-05-15 02:35:50 +02:00
			`The Postgres Executor`
			`---------------------`

			`The executor processes a tree of "plan nodes". The plan tree is essentially`
			`a demand-pull pipeline of tuple processing operations. Each node, when`
			`called, will produce the next tuple in its output sequence, or NULL if no`
			`more tuples are available. If the node is not a primitive relation-scanning`
			`node, it will have child node(s) that it calls in turn to obtain input`
			`tuples.`

			`Refinements on this basic model include:`

			`* Choice of scan direction (forwards or backwards). Caution: this is not`
			`currently well-supported. It works for primitive scan nodes, but not very`
			`well for joins, aggregates, etc.`

			`* Rescan command to reset a node and make it generate its output sequence`
			`over again.`

			`* Parameters that can alter a node's results. After adjusting a parameter,`
			`the rescan command must be applied to that node and all nodes above it.`
			`There is a moderately intelligent scheme to avoid rescanning nodes`
			`unnecessarily (for example, Sort does not rescan its input if no parameters`
			`of the input have changed, since it can just reread its stored sorted data).`

			`The plan tree concept implements SELECT directly: it is only necessary to`
			`deliver the top-level result tuples to the client, or insert them into`
			`another table in the case of INSERT ... SELECT. (INSERT ... VALUES is`
			`handled similarly, but the plan tree is just a Result node with no source`
			`tables.) For UPDATE, the plan tree selects the tuples that need to be`
			`updated (WHERE condition) and delivers a new calculated tuple value for each`
			`such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple.`
			`The executor's top level then uses this information to update the correct`
			`tuple. DELETE is similar to UPDATE except that only a CTID need be`
			`delivered by the plan tree.`

			`XXX a great deal more documentation needs to be written here...`


Phase 1 of read-only-plans project: cause executor state nodes to point to plan nodes, not vice-versa. All executor state nodes now inherit from struct PlanState. Copying of plan trees has been simplified by not storing a list of SubPlans in Plan nodes (eliminating duplicate links). The executor still needs such a list, but it can build it during ExecutorStart since it has to scan the plan tree anyway. No initdb forced since no stored-on-disk structures changed, but you will need a full recompile because of node-numbering changes. 2002-12-05 16:50:39 +01:00			`Plan Trees and State Trees`
			`--------------------------`

			`The plan tree delivered by the planner contains a tree of Plan nodes (struct`
			`types derived from struct Plan). Each Plan node may have expression trees`
			`associated with it, to represent its target list, qualification conditions,`
			`etc. During executor startup we build a parallel tree of identical structure`
			`containing executor state nodes --- every plan and expression node type has`
			`a corresponding executor state node type. Each node in the state tree has a`
			`pointer to its corresponding node in the plan tree, plus executor state data`
			`as needed to implement that node type. This arrangement allows the plan`
			`tree to be completely read-only as far as the executor is concerned: all data`
			`that is modified during execution is in the state tree. Read-only plan trees`
			`make life much simpler for plan caching and reuse.`

			`Altogether there are four classes of nodes used in these trees: Plan nodes,`
			`their corresponding PlanState nodes, Expr nodes, and their corresponding`
			`ExprState nodes. (Actually, there are also List nodes, which are used as`
			`"glue" in all four kinds of tree.)`


Revise executor APIs so that all per-query state structure is built in a per-query memory context created by CreateExecutorState --- and destroyed by FreeExecutorState. This provides a final solution to the longstanding problem of memory leaked by various ExecEndNode calls. 2002-12-15 17:17:59 +01:00			`Memory Management`
			`-----------------`

			`A "per query" memory context is created during CreateExecutorState();`
			`all storage allocated during an executor invocation is allocated in that`
			`context or a child context. This allows easy reclamation of storage`
			`during executor shutdown --- rather than messing with retail pfree's and`
			`probable storage leaks, we just destroy the memory context.`

			`In particular, the plan state trees and expression state trees described`
			`in the previous section are allocated in the per-query memory context.`

			`To avoid intra-query memory leaks, most processing while a query runs`
			`is done in "per tuple" memory contexts, which are so-called because they`
			`are typically reset to empty once per tuple. Per-tuple contexts are usually`
			`associated with ExprContexts, and commonly each PlanState node has its own`
			`ExprContext to evaluate its qual and targetlist expressions in.`


			`Query Processing Control Flow`
			`-----------------------------`

			`This is a sketch of control flow for full query processing:`

			`CreateQueryDesc`

			`ExecutorStart`
			`CreateExecutorState`
			`creates per-query context`
			`switch to per-query context to run ExecInitNode`
			`ExecInitNode --- recursively scans plan tree`
			`CreateExprContext`
			`creates per-tuple context`
			`ExecInitExpr`

			`ExecutorRun`
			`ExecProcNode --- recursively called in per-query context`
			`ExecEvalExpr --- called in per-tuple context`
			`ResetExprContext --- to free memory`

			`ExecutorEnd`
			`ExecEndNode --- recursively releases resources`
			`FreeExecutorState`
			`frees per-query context and child contexts`

			`FreeQueryDesc`

			`Per above comments, it's not really critical for ExecEndNode to free any`
			`memory; it'll all go away in FreeExecutorState anyway. However, we do need to`
			`be careful to close relations, drop buffer pins, etc, so we do need to scan`
			`the plan state tree to find these sorts of resources.`


			`The executor can also be used to evaluate simple expressions without any Plan`
			`tree ("simple" meaning "no aggregates and no sub-selects", though such might`
			`be hidden inside function calls). This case has a flow of control like`

			`CreateExecutorState`
			`creates per-query context`

			`CreateExprContext -- or use GetPerTupleExprContext(estate)`
			`creates per-tuple context`

			`ExecPrepareExpr`
			`switch to per-query context to run ExecInitExpr`
			`ExecInitExpr`

			`Repeatedly do:`
			`ExecEvalExprSwitchContext`
			`ExecEvalExpr --- called in per-tuple context`
			`ResetExprContext --- to free memory`

			`FreeExecutorState`
			`frees per-query context, as well as ExprContext`
			`(a separate FreeExprContext call is not necessary)`


Some badly needed documentation about EvalPlanQual. 2001-05-15 02:35:50 +02:00			`EvalPlanQual (READ COMMITTED update checking)`
			`---------------------------------------------`

			`For simple SELECTs, the executor need only pay attention to tuples that are`
			`valid according to the snapshot seen by the current transaction (ie, they`
			`were inserted by a previously committed transaction, and not deleted by any`
			`previously committed transaction). However, for UPDATE and DELETE it is not`
			`cool to modify or delete a tuple that's been modified by an open or`
			`concurrently-committed transaction. If we are running in SERIALIZABLE`
			`isolation level then we just raise an error when this condition is seen to`
			`occur. In READ COMMITTED isolation level, we must work a lot harder.`

			`The basic idea in READ COMMITTED mode is to take the modified tuple`
			`committed by the concurrent transaction (after waiting for it to commit,`
			`if need be) and re-evaluate the query qualifications to see if it would`
			`still meet the quals. If so, we regenerate the updated tuple (if we are`
			`doing an UPDATE) from the modified tuple, and finally update/delete the`
Implement sharable row-level locks, and use them for foreign key references to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU data structure (managed much like pg_subtrans) to represent multiple- transaction-ID sets. When more than one transaction is holding a shared lock on a particular row, we create a MultiXactId representing that set of transactions and store its ID in the row's XMAX. This scheme allows an effectively unlimited number of row locks, just as we did before, while not costing any extra overhead except when a shared lock actually has to be shared. Still TODO: use the regular lock manager to control the grant order when multiple backends are waiting for a row lock. Alvaro Herrera and Tom Lane. 2005-04-28 23:47:18 +02:00			`modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its`
			`action is just to lock the modified tuple.`
Some badly needed documentation about EvalPlanQual. 2001-05-15 02:35:50 +02:00
			`To implement this checking, we actually re-run the entire query from scratch`
			`for each modified tuple, but with the scan node that sourced the original`
			`tuple set to return only the modified tuple, not the original tuple or any`
			`of the rest of the relation. If this query returns a tuple, then the`
			`modified tuple passes the quals (and the query output is the suitably`
			`modified update tuple, if we're doing UPDATE). If no tuple is returned,`
			`then the modified tuple fails the quals, so we ignore it and continue the`
			`original query. (This is reasonably efficient for simple queries, but may`
			`be horribly slow for joins. A better design would be nice; one thought for`
			`future investigation is to treat the tuple substitution like a parameter,`
			`so that we can avoid rescanning unrelated nodes.)`

			`Note a fundamental bogosity of this approach: if the relation containing`
			`the original tuple is being used in a self-join, the other instance(s) of`
			`the relation will be treated as still containing the original tuple, whereas`
			`logical consistency would demand that the modified tuple appear in them too.`
			`But we'd have to actually substitute the modified tuple for the original,`
			`while still returning all the rest of the relation, to ensure consistent`
			`answers. Implementing this correctly is a task for future work.`

			`In UPDATE/DELETE, only the target relation needs to be handled this way,`
			`so only one special recheck query needs to execute at a time. In SELECT FOR`
			`UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible`
			`that while we are executing a recheck query for one modified tuple, we will`
			`hit another modified tuple in another relation. In this case we "stack up"`
			`recheck queries: a sub-recheck query is spawned in which both the first and`
			`second modified tuples will be returned as the only components of their`
Implement sharable row-level locks, and use them for foreign key references to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU data structure (managed much like pg_subtrans) to represent multiple- transaction-ID sets. When more than one transaction is holding a shared lock on a particular row, we create a MultiXactId representing that set of transactions and store its ID in the row's XMAX. This scheme allows an effectively unlimited number of row locks, just as we did before, while not costing any extra overhead except when a shared lock actually has to be shared. Still TODO: use the regular lock manager to control the grant order when multiple backends are waiting for a row lock. Alvaro Herrera and Tom Lane. 2005-04-28 23:47:18 +02:00			`relations. (In event of success, all these modified tuples will be locked.)`
			`Again, this isn't necessarily quite the right thing ... but in simple cases`
			`it works. Potentially, recheck queries could get nested to the depth of the`
			`number of FOR UPDATE/SHARE relations in the query.`
Some badly needed documentation about EvalPlanQual. 2001-05-15 02:35:50 +02:00
			`It should be noted also that UPDATE/DELETE expect at most one tuple to`
			`result from the modified query, whereas in the FOR UPDATE case it's possible`
			`for multiple tuples to result (since we could be dealing with a join in`
			`which multiple tuples join to the modified tuple). We want FOR UPDATE to`
Implement sharable row-level locks, and use them for foreign key references to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU data structure (managed much like pg_subtrans) to represent multiple- transaction-ID sets. When more than one transaction is holding a shared lock on a particular row, we create a MultiXactId representing that set of transactions and store its ID in the row's XMAX. This scheme allows an effectively unlimited number of row locks, just as we did before, while not costing any extra overhead except when a shared lock actually has to be shared. Still TODO: use the regular lock manager to control the grant order when multiple backends are waiting for a row lock. Alvaro Herrera and Tom Lane. 2005-04-28 23:47:18 +02:00			`lock all relevant tuples, so we pass all tuples output by all the stacked`
			`recheck queries back to the executor toplevel for locking.`