2010-09-20 22:08:53 +02:00
|
|
|
src/backend/executor/README
|
2001-05-15 02:35:50 +02:00
|
|
|
|
|
|
|
The Postgres Executor
|
2008-03-21 14:23:29 +01:00
|
|
|
=====================
|
2001-05-15 02:35:50 +02:00
|
|
|
|
|
|
|
The executor processes a tree of "plan nodes". The plan tree is essentially
|
|
|
|
a demand-pull pipeline of tuple processing operations. Each node, when
|
|
|
|
called, will produce the next tuple in its output sequence, or NULL if no
|
|
|
|
more tuples are available. If the node is not a primitive relation-scanning
|
|
|
|
node, it will have child node(s) that it calls in turn to obtain input
|
|
|
|
tuples.
|
|
|
|
|
|
|
|
Refinements on this basic model include:
|
|
|
|
|
|
|
|
* Choice of scan direction (forwards or backwards). Caution: this is not
|
|
|
|
currently well-supported. It works for primitive scan nodes, but not very
|
|
|
|
well for joins, aggregates, etc.
|
|
|
|
|
|
|
|
* Rescan command to reset a node and make it generate its output sequence
|
|
|
|
over again.
|
|
|
|
|
|
|
|
* Parameters that can alter a node's results. After adjusting a parameter,
|
|
|
|
the rescan command must be applied to that node and all nodes above it.
|
|
|
|
There is a moderately intelligent scheme to avoid rescanning nodes
|
|
|
|
unnecessarily (for example, Sort does not rescan its input if no parameters
|
|
|
|
of the input have changed, since it can just reread its stored sorted data).
|
|
|
|
|
2009-10-10 03:43:50 +02:00
|
|
|
For a SELECT, it is only necessary to deliver the top-level result tuples
|
|
|
|
to the client. For INSERT/UPDATE/DELETE, the actual table modification
|
|
|
|
operations happen in a top-level ModifyTable plan node. If the query
|
|
|
|
includes a RETURNING clause, the ModifyTable node delivers the computed
|
|
|
|
RETURNING rows as output, otherwise it returns nothing. Handling INSERT
|
|
|
|
is pretty straightforward: the tuples returned from the plan tree below
|
|
|
|
ModifyTable are inserted into the correct result relation. For UPDATE,
|
|
|
|
the plan tree returns the computed tuples to be updated, plus a "junk"
|
|
|
|
(hidden) CTID column identifying which table row is to be replaced by each
|
|
|
|
one. For DELETE, the plan tree need only deliver a CTID column, and the
|
|
|
|
ModifyTable node visits each of those rows and marks the row deleted.
|
2001-05-15 02:35:50 +02:00
|
|
|
|
|
|
|
XXX a great deal more documentation needs to be written here...
|
|
|
|
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
Plan Trees and State Trees
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
The plan tree delivered by the planner contains a tree of Plan nodes (struct
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
types derived from struct Plan). During executor startup we build a parallel
|
|
|
|
tree of identical structure containing executor state nodes --- every plan
|
|
|
|
node type has a corresponding executor state node type. Each node in the
|
|
|
|
state tree has a pointer to its corresponding node in the plan tree, plus
|
|
|
|
executor state data as needed to implement that node type. This arrangement
|
|
|
|
allows the plan tree to be completely read-only so far as the executor is
|
|
|
|
concerned: all data that is modified during execution is in the state tree.
|
|
|
|
Read-only plan trees make life much simpler for plan caching and reuse.
|
|
|
|
|
|
|
|
Each Plan node may have expression trees associated with it, to represent
|
|
|
|
its target list, qualification conditions, etc. These trees are also
|
|
|
|
read-only to the executor, but the executor state for expression evaluation
|
|
|
|
does not mirror the Plan expression's tree shape, as explained below.
|
|
|
|
Rather, there's just one ExprState node per expression tree, although this
|
|
|
|
may have sub-nodes for some complex expression node types.
|
2002-12-05 16:50:39 +01:00
|
|
|
|
|
|
|
Altogether there are four classes of nodes used in these trees: Plan nodes,
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
their corresponding PlanState nodes, Expr nodes, and ExprState nodes.
|
|
|
|
(Actually, there are also List nodes, which are used as "glue" in all
|
|
|
|
three tree-based representations.)
|
|
|
|
|
|
|
|
|
|
|
|
Expression Trees and ExprState nodes
|
|
|
|
------------------------------------
|
|
|
|
|
|
|
|
Expression trees, in contrast to Plan trees, are not mirrored into a
|
|
|
|
corresponding tree of state nodes. Instead each separately executable
|
|
|
|
expression tree (e.g. a Plan's qual or targetlist) is represented by one
|
|
|
|
ExprState node. The ExprState node contains the information needed to
|
|
|
|
evaluate the expression in a compact, linear form. That compact form is
|
|
|
|
stored as a flat array in ExprState->steps[] (an array of ExprEvalStep,
|
|
|
|
not ExprEvalStep *).
|
|
|
|
|
|
|
|
The reasons for choosing such a representation include:
|
|
|
|
- commonly the amount of work needed to evaluate one Expr-type node is
|
|
|
|
small enough that the overhead of having to perform a tree-walk
|
|
|
|
during evaluation is significant.
|
|
|
|
- the flat representation can be evaluated non-recursively within a single
|
|
|
|
function, reducing stack depth and function call overhead.
|
|
|
|
- such a representation is usable both for fast interpreted execution,
|
|
|
|
and for compiling into native code.
|
|
|
|
|
|
|
|
The Plan-tree representation of an expression is compiled into an
|
|
|
|
ExprState node by ExecInitExpr(). As much complexity as possible should
|
|
|
|
be handled by ExecInitExpr() (and helpers), instead of execution time
|
|
|
|
where both interpreted and compiled versions would need to deal with the
|
|
|
|
complexity. Besides duplicating effort between execution approaches,
|
|
|
|
runtime initialization checks also have a small but noticeable cost every
|
|
|
|
time the expression is evaluated. Therefore, we allow ExecInitExpr() to
|
|
|
|
precompute information that we do not expect to vary across execution of a
|
|
|
|
single query, for example the set of CHECK constraint expressions to be
|
|
|
|
applied to a domain type. This could not be done at plan time without
|
|
|
|
greatly increasing the number of events that require plan invalidation.
|
|
|
|
(Previously, some information of this kind was rechecked on each
|
|
|
|
expression evaluation, but that seems like unnecessary overhead.)
|
|
|
|
|
|
|
|
|
|
|
|
Expression Initialization
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
During ExecInitExpr() and similar routines, Expr trees are converted
|
|
|
|
into the flat representation. Each Expr node might be represented by
|
|
|
|
zero, one, or more ExprEvalSteps.
|
|
|
|
|
|
|
|
Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp)
|
|
|
|
and it stores the result of its work into the Datum variable and boolean
|
|
|
|
null flag variable pointed to by ExprEvalStep->resvalue/resnull.
|
|
|
|
Complex expressions are performed by chaining together several steps.
|
|
|
|
For example, "a + b" (one OpExpr, with two Var expressions) would be
|
|
|
|
represented as two steps to fetch the Var values, and one step for the
|
|
|
|
evaluation of the function underlying the + operator. The steps for the
|
|
|
|
Vars would have their resvalue/resnull pointing directly to the appropriate
|
|
|
|
arg[] and argnull[] array elements in the FunctionCallInfoData struct that
|
|
|
|
is used by the function evaluation step, thus avoiding extra work to copy
|
|
|
|
the result values around.
|
|
|
|
|
|
|
|
The last entry in a completed ExprState->steps array is always an
|
|
|
|
EEOP_DONE step; this removes the need to test for end-of-array while
|
|
|
|
iterating. Also, if the expression contains any variable references (to
|
|
|
|
user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps
|
|
|
|
array begins with EEOP_*_FETCHSOME steps that ensure that the relevant
|
|
|
|
tuples have been deconstructed to make the required columns directly
|
|
|
|
available (cf. slot_getsomeattrs()). This allows individual Var-fetching
|
|
|
|
steps to be little more than an array lookup.
|
|
|
|
|
|
|
|
Most of ExecInitExpr()'s work is done by the recursive function
|
|
|
|
ExecInitExprRec() and its subroutines. ExecInitExprRec() maps one Expr
|
|
|
|
node into the steps required for execution, recursing as needed for
|
|
|
|
sub-expressions.
|
|
|
|
|
|
|
|
Each ExecInitExprRec() call has to specify where that subexpression's
|
|
|
|
results are to be stored (via the resv/resnull parameters). This allows
|
|
|
|
the above scenario of evaluating a (sub-)expression directly into
|
|
|
|
fcinfo->arg/argnull, but also requires some care: target Datum/isnull
|
|
|
|
variables may not be shared with another ExecInitExprRec() unless the
|
|
|
|
results are only needed by steps executing before further usages of those
|
|
|
|
target Datum/isnull variables. Due to the non-recursiveness of the
|
|
|
|
ExprEvalStep representation that's usually easy to guarantee.
|
|
|
|
|
|
|
|
ExecInitExprRec() pushes new operations into the ExprState->steps array
|
|
|
|
using ExprEvalPushStep(). To keep the steps as a consecutively laid out
|
|
|
|
array, ExprEvalPushStep() has to repalloc the entire array when there's
|
|
|
|
not enough space. Because of that it is *not* allowed to point directly
|
|
|
|
into any of the steps during expression initialization. Therefore, the
|
|
|
|
resv/resnull for a subexpression usually point to some storage that is
|
|
|
|
palloc'd separately from the steps array. For instance, the
|
|
|
|
FunctionCallInfoData for a function call step is separately allocated
|
|
|
|
rather than being part of the ExprEvalStep array. The overall result
|
|
|
|
of a complete expression is typically returned into the resvalue/resnull
|
|
|
|
fields of the ExprState node itself.
|
|
|
|
|
|
|
|
Some steps, e.g. boolean expressions, allow skipping evaluation of
|
|
|
|
certain subexpressions. In the flat representation this amounts to
|
|
|
|
jumping to some later step rather than just continuing consecutively
|
|
|
|
with the next step. The target for such a jump is represented by
|
|
|
|
the integer index in the ExprState->steps array of the step to execute
|
|
|
|
next. (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.)
|
|
|
|
|
|
|
|
Typically, ExecInitExprRec() has to push a jumping step into the steps
|
|
|
|
array, then recursively generate steps for the subexpression that might
|
|
|
|
get skipped over, then go back and fix up the jump target index using
|
|
|
|
the now-known length of the subexpression's steps. This is handled by
|
|
|
|
adjust_jumps lists in execExpr.c.
|
|
|
|
|
|
|
|
The last step in constructing an ExprState is to apply ExecReadyExpr(),
|
|
|
|
which readies it for execution using whichever execution method has been
|
|
|
|
selected.
|
|
|
|
|
|
|
|
|
|
|
|
Expression Evaluation
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
To allow for different methods of expression evaluation, and for
|
|
|
|
better branch/jump target prediction, expressions are evaluated by
|
|
|
|
calling ExprState->evalfunc (via ExprEvalExpr() and friends).
|
|
|
|
|
|
|
|
ExprReadyExpr() can choose the method of interpretation by setting
|
|
|
|
evalfunc to an appropriate function. The default execution function,
|
|
|
|
ExecInterpExpr, is implemented in execExprInterp.c; see its header
|
|
|
|
comment for details. Special-case evalfuncs are used for certain
|
|
|
|
especially-simple expressions.
|
|
|
|
|
|
|
|
Note that a lot of the more complex expression evaluation steps, which are
|
|
|
|
less performance-critical than the simpler ones, are implemented as
|
|
|
|
separate functions outside the fast-path of expression execution, allowing
|
|
|
|
their implementation to be shared between interpreted and compiled
|
|
|
|
expression evaluation. This means that these helper functions are not
|
|
|
|
allowed to perform expression step dispatch themselves, as the method of
|
|
|
|
dispatch will vary based on the caller. The helpers therefore cannot call
|
|
|
|
for the execution of subexpressions; all subexpression results they need
|
|
|
|
must be computed by earlier steps. And dispatch to the following
|
|
|
|
expression step must be performed after returning from the helper.
|
|
|
|
|
|
|
|
|
|
|
|
Targetlist Evaluation
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
ExecBuildProjectionInfo builds an ExprState that has the effect of
|
|
|
|
evaluating a targetlist into ExprState->resultslot. A generic targetlist
|
|
|
|
expression is executed by evaluating it as discussed above (storing the
|
|
|
|
result into the ExprState's resvalue/resnull fields) and then using an
|
|
|
|
EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[]
|
|
|
|
and tts_isnull[] array elements of the result slot. There are special
|
|
|
|
fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that
|
|
|
|
are simple Vars using only one step instead of two.
|
2002-12-05 16:50:39 +01:00
|
|
|
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
Memory Management
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
A "per query" memory context is created during CreateExecutorState();
|
|
|
|
all storage allocated during an executor invocation is allocated in that
|
|
|
|
context or a child context. This allows easy reclamation of storage
|
|
|
|
during executor shutdown --- rather than messing with retail pfree's and
|
|
|
|
probable storage leaks, we just destroy the memory context.
|
|
|
|
|
|
|
|
In particular, the plan state trees and expression state trees described
|
|
|
|
in the previous section are allocated in the per-query memory context.
|
|
|
|
|
|
|
|
To avoid intra-query memory leaks, most processing while a query runs
|
|
|
|
is done in "per tuple" memory contexts, which are so-called because they
|
|
|
|
are typically reset to empty once per tuple. Per-tuple contexts are usually
|
|
|
|
associated with ExprContexts, and commonly each PlanState node has its own
|
|
|
|
ExprContext to evaluate its qual and targetlist expressions in.
|
|
|
|
|
|
|
|
|
|
|
|
Query Processing Control Flow
|
|
|
|
-----------------------------
|
|
|
|
|
|
|
|
This is a sketch of control flow for full query processing:
|
|
|
|
|
|
|
|
CreateQueryDesc
|
|
|
|
|
|
|
|
ExecutorStart
|
|
|
|
CreateExecutorState
|
|
|
|
creates per-query context
|
|
|
|
switch to per-query context to run ExecInitNode
|
Fix SQL-spec incompatibilities in new transition table feature.
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
2017-09-16 19:20:32 +02:00
|
|
|
AfterTriggerBeginQuery
|
2002-12-15 17:17:59 +01:00
|
|
|
ExecInitNode --- recursively scans plan tree
|
|
|
|
CreateExprContext
|
|
|
|
creates per-tuple context
|
|
|
|
ExecInitExpr
|
|
|
|
|
|
|
|
ExecutorRun
|
|
|
|
ExecProcNode --- recursively called in per-query context
|
|
|
|
ExecEvalExpr --- called in per-tuple context
|
|
|
|
ResetExprContext --- to free memory
|
|
|
|
|
2011-02-27 19:43:29 +01:00
|
|
|
ExecutorFinish
|
|
|
|
ExecPostprocessPlan --- run any unfinished ModifyTable nodes
|
|
|
|
AfterTriggerEndQuery
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
ExecutorEnd
|
|
|
|
ExecEndNode --- recursively releases resources
|
|
|
|
FreeExecutorState
|
|
|
|
frees per-query context and child contexts
|
|
|
|
|
|
|
|
FreeQueryDesc
|
|
|
|
|
|
|
|
Per above comments, it's not really critical for ExecEndNode to free any
|
|
|
|
memory; it'll all go away in FreeExecutorState anyway. However, we do need to
|
|
|
|
be careful to close relations, drop buffer pins, etc, so we do need to scan
|
|
|
|
the plan state tree to find these sorts of resources.
|
|
|
|
|
|
|
|
|
|
|
|
The executor can also be used to evaluate simple expressions without any Plan
|
|
|
|
tree ("simple" meaning "no aggregates and no sub-selects", though such might
|
|
|
|
be hidden inside function calls). This case has a flow of control like
|
|
|
|
|
|
|
|
CreateExecutorState
|
|
|
|
creates per-query context
|
|
|
|
|
|
|
|
CreateExprContext -- or use GetPerTupleExprContext(estate)
|
|
|
|
creates per-tuple context
|
|
|
|
|
|
|
|
ExecPrepareExpr
|
2009-01-09 16:46:11 +01:00
|
|
|
temporarily switch to per-query context
|
|
|
|
run the expression through expression_planner
|
2002-12-15 17:17:59 +01:00
|
|
|
ExecInitExpr
|
|
|
|
|
|
|
|
Repeatedly do:
|
|
|
|
ExecEvalExprSwitchContext
|
|
|
|
ExecEvalExpr --- called in per-tuple context
|
|
|
|
ResetExprContext --- to free memory
|
|
|
|
|
|
|
|
FreeExecutorState
|
|
|
|
frees per-query context, as well as ExprContext
|
|
|
|
(a separate FreeExprContext call is not necessary)
|
|
|
|
|
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
EvalPlanQual (READ COMMITTED Update Checking)
|
2001-05-15 02:35:50 +02:00
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
For simple SELECTs, the executor need only pay attention to tuples that are
|
|
|
|
valid according to the snapshot seen by the current transaction (ie, they
|
|
|
|
were inserted by a previously committed transaction, and not deleted by any
|
|
|
|
previously committed transaction). However, for UPDATE and DELETE it is not
|
|
|
|
cool to modify or delete a tuple that's been modified by an open or
|
|
|
|
concurrently-committed transaction. If we are running in SERIALIZABLE
|
|
|
|
isolation level then we just raise an error when this condition is seen to
|
|
|
|
occur. In READ COMMITTED isolation level, we must work a lot harder.
|
|
|
|
|
|
|
|
The basic idea in READ COMMITTED mode is to take the modified tuple
|
|
|
|
committed by the concurrent transaction (after waiting for it to commit,
|
|
|
|
if need be) and re-evaluate the query qualifications to see if it would
|
|
|
|
still meet the quals. If so, we regenerate the updated tuple (if we are
|
|
|
|
doing an UPDATE) from the modified tuple, and finally update/delete the
|
2005-04-28 23:47:18 +02:00
|
|
|
modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its
|
2009-10-12 20:10:51 +02:00
|
|
|
action is just to lock the modified tuple and return results based on that
|
|
|
|
version of the tuple.
|
2001-05-15 02:35:50 +02:00
|
|
|
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
To implement this checking, we actually re-run the query from scratch for
|
|
|
|
each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
|
|
|
|
relation scan nodes tweaked to return only the current tuples --- either
|
|
|
|
the original ones, or the updated (and now locked) versions of the modified
|
|
|
|
tuple(s). If this query returns a tuple, then the modified tuple(s) pass
|
|
|
|
the quals (and the query output is the suitably modified update tuple, if
|
|
|
|
we're doing UPDATE). If no tuple is returned, then the modified tuple(s)
|
|
|
|
fail the quals, so we ignore the current result tuple and continue the
|
|
|
|
original query.
|
|
|
|
|
|
|
|
In UPDATE/DELETE, only the target relation needs to be handled this way.
|
|
|
|
In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
|
|
|
|
so we obtain lock on the current tuple version in each such relation before
|
|
|
|
executing the recheck.
|
|
|
|
|
|
|
|
It is also possible that there are relations in the query that are not
|
|
|
|
to be locked (they are neither the UPDATE/DELETE target nor specified to
|
|
|
|
be locked in SELECT FOR UPDATE/SHARE). When re-running the test query
|
|
|
|
we want to use the same rows from these relations that were joined to
|
|
|
|
the locked rows. For ordinary relations this can be implemented relatively
|
|
|
|
cheaply by including the row TID in the join outputs and re-fetching that
|
|
|
|
TID. (The re-fetch is expensive, but we're trying to optimize the normal
|
|
|
|
case where no re-test is needed.) We have also to consider non-table
|
|
|
|
relations, such as a ValuesScan or FunctionScan. For these, since there
|
|
|
|
is no equivalent of TID, the only practical solution seems to be to include
|
|
|
|
the entire row value in the join output row.
|
|
|
|
|
|
|
|
We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
|
|
|
|
so as to ensure that at most one tuple can be returned for any particular
|
|
|
|
set of scan tuples. Otherwise we'd get duplicates due to the original
|
2016-09-13 20:25:35 +02:00
|
|
|
query returning the same set of scan tuples multiple times. Likewise,
|
|
|
|
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
|
|
|
|
effect of the same row being updated multiple times, which is not very
|
|
|
|
useful --- and updates after the first would have no effect anyway.
|