postgresql/src/backend/executor/README

406 lines
20 KiB
Plaintext
Raw Normal View History

2010-09-20 22:08:53 +02:00
src/backend/executor/README
The Postgres Executor
2008-03-21 14:23:29 +01:00
=====================
The executor processes a tree of "plan nodes". The plan tree is essentially
a demand-pull pipeline of tuple processing operations. Each node, when
called, will produce the next tuple in its output sequence, or NULL if no
more tuples are available. If the node is not a primitive relation-scanning
node, it will have child node(s) that it calls in turn to obtain input
tuples.
Refinements on this basic model include:
* Choice of scan direction (forwards or backwards). Caution: this is not
currently well-supported. It works for primitive scan nodes, but not very
well for joins, aggregates, etc.
* Rescan command to reset a node and make it generate its output sequence
over again.
* Parameters that can alter a node's results. After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).
For a SELECT, it is only necessary to deliver the top-level result tuples
to the client. For INSERT/UPDATE/DELETE, the actual table modification
operations happen in a top-level ModifyTable plan node. If the query
includes a RETURNING clause, the ModifyTable node delivers the computed
RETURNING rows as output, otherwise it returns nothing. Handling INSERT
is pretty straightforward: the tuples returned from the plan tree below
ModifyTable are inserted into the correct result relation. For UPDATE,
Rework planning and execution of UPDATE and DELETE. This patch makes two closely related sets of changes: 1. For UPDATE, the subplan of the ModifyTable node now only delivers the new values of the changed columns (i.e., the expressions computed in the query's SET clause) plus row identity information such as CTID. ModifyTable must re-fetch the original tuple to merge in the old values of any unchanged columns. The core advantage of this is that the changed columns are uniform across all tables of an inherited or partitioned target relation, whereas the other columns might not be. A secondary advantage, when the UPDATE involves joins, is that less data needs to pass through the plan tree. The disadvantage of course is an extra fetch of each tuple to be updated. However, that seems to be very nearly free in context; even worst-case tests don't show it to add more than a couple percent to the total query cost. At some point it might be interesting to combine the re-fetch with the tuple access that ModifyTable must do anyway to mark the old tuple dead; but that would require a good deal of refactoring and it seems it wouldn't buy all that much, so this patch doesn't attempt it. 2. For inherited UPDATE/DELETE, instead of generating a separate subplan for each target relation, we now generate a single subplan that is just exactly like a SELECT's plan, then stick ModifyTable on top of that. To let ModifyTable know which target relation a given incoming row refers to, a tableoid junk column is added to the row identity information. This gets rid of the horrid hack that was inheritance_planner(), eliminating O(N^2) planning cost and memory consumption in cases where there were many unprunable target relations. Point 2 of course requires point 1, so that there is a uniform definition of the non-junk columns to be returned by the subplan. We can't insist on uniform definition of the row identity junk columns however, if we want to keep the ability to have both plain and foreign tables in a partitioning hierarchy. Since it wouldn't scale very far to have every child table have its own row identity column, this patch includes provisions to merge similar row identity columns into one column of the subplan result. In particular, we can merge the whole-row Vars typically used as row identity by FDWs into one column by pretending they are type RECORD. (It's still okay for the actual composite Datums to be labeled with the table's rowtype OID, though.) There is more that can be done to file down residual inefficiencies in this patch, but it seems to be committable now. FDW authors should note several API changes: * The argument list for AddForeignUpdateTargets() has changed, and so has the method it must use for adding junk columns to the query. Call add_row_identity_var() instead of manipulating the parse tree directly. You might want to reconsider exactly what you're adding, too. * PlanDirectModify() must now work a little harder to find the ForeignScan plan node; if the foreign table is part of a partitioning hierarchy then the ForeignScan might not be the direct child of ModifyTable. See postgres_fdw for sample code. * To check whether a relation is a target relation, it's no longer sufficient to compare its relid to root->parse->resultRelation. Instead, check it against all_result_relids or leaf_result_relids, as appropriate. Amit Langote and Tom Lane Discussion: https://postgr.es/m/CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
2021-03-31 17:52:34 +02:00
the plan tree returns the new values of the updated columns, plus "junk"
(hidden) column(s) identifying which table row is to be updated. The
ModifyTable node must fetch that row to extract values for the unchanged
columns, combine the values into a new row, and apply the update. (For a
heap table, the row-identity junk column is a CTID, but other things may
be used for other table types.) For DELETE, the plan tree need only deliver
junk row-identity column(s), and the ModifyTable node visits each of those
rows and marks the row deleted.
XXX a great deal more documentation needs to be written here...
Plan Trees and State Trees
--------------------------
The plan tree delivered by the planner contains a tree of Plan nodes (struct
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
types derived from struct Plan). During executor startup we build a parallel
tree of identical structure containing executor state nodes --- generally,
every plan node type has a corresponding executor state node type. Each node
in the state tree has a pointer to its corresponding node in the plan tree,
plus executor state data as needed to implement that node type. This
arrangement allows the plan tree to be completely read-only so far as the
executor is concerned: all data that is modified during execution is in the
state tree. Read-only plan trees make life much simpler for plan caching and
reuse.
A corresponding executor state node may not be created during executor startup
if the executor determines that an entire subplan is not required due to
execution time partition pruning determining that no matching records will be
found there. This currently only occurs for Append and MergeAppend nodes. In
this case the non-required subplans are ignored and the executor state's
subnode array will become out of sequence to the plan's subplan list.
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
Each Plan node may have expression trees associated with it, to represent
its target list, qualification conditions, etc. These trees are also
read-only to the executor, but the executor state for expression evaluation
does not mirror the Plan expression's tree shape, as explained below.
Rather, there's just one ExprState node per expression tree, although this
may have sub-nodes for some complex expression node types.
Altogether there are four classes of nodes used in these trees: Plan nodes,
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
their corresponding PlanState nodes, Expr nodes, and ExprState nodes.
(Actually, there are also List nodes, which are used as "glue" in all
three tree-based representations.)
Expression Trees and ExprState nodes
------------------------------------
Expression trees, in contrast to Plan trees, are not mirrored into a
corresponding tree of state nodes. Instead each separately executable
expression tree (e.g. a Plan's qual or targetlist) is represented by one
ExprState node. The ExprState node contains the information needed to
evaluate the expression in a compact, linear form. That compact form is
stored as a flat array in ExprState->steps[] (an array of ExprEvalStep,
not ExprEvalStep *).
The reasons for choosing such a representation include:
- commonly the amount of work needed to evaluate one Expr-type node is
small enough that the overhead of having to perform a tree-walk
during evaluation is significant.
- the flat representation can be evaluated non-recursively within a single
function, reducing stack depth and function call overhead.
- such a representation is usable both for fast interpreted execution,
and for compiling into native code.
The Plan-tree representation of an expression is compiled into an
ExprState node by ExecInitExpr(). As much complexity as possible should
be handled by ExecInitExpr() (and helpers), instead of execution time
where both interpreted and compiled versions would need to deal with the
complexity. Besides duplicating effort between execution approaches,
runtime initialization checks also have a small but noticeable cost every
time the expression is evaluated. Therefore, we allow ExecInitExpr() to
precompute information that we do not expect to vary across execution of a
single query, for example the set of CHECK constraint expressions to be
applied to a domain type. This could not be done at plan time without
greatly increasing the number of events that require plan invalidation.
(Previously, some information of this kind was rechecked on each
expression evaluation, but that seems like unnecessary overhead.)
Expression Initialization
-------------------------
During ExecInitExpr() and similar routines, Expr trees are converted
into the flat representation. Each Expr node might be represented by
zero, one, or more ExprEvalSteps.
Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp)
and it stores the result of its work into the Datum variable and boolean
null flag variable pointed to by ExprEvalStep->resvalue/resnull.
Complex expressions are performed by chaining together several steps.
For example, "a + b" (one OpExpr, with two Var expressions) would be
represented as two steps to fetch the Var values, and one step for the
evaluation of the function underlying the + operator. The steps for the
Vars would have their resvalue/resnull pointing directly to the appropriate
Change function call information to be variable length. Before this change FunctionCallInfoData, the struct arguments etc for V1 function calls are stored in, always had space for FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two arrays. For nearly every function call 100 arguments is far more than needed, therefore wasting memory. Arg and argnull being two separate arrays also guarantees that to access a single argument, two cachelines have to be touched. Change the layout so there's a single variable-length array with pairs of value / isnull. That drastically reduces memory consumption for most function calls (on x86-64 a two argument function now uses 64bytes, previously 936 bytes), and makes it very likely that argument value and its nullness are on the same cacheline. Arguments are stored in a new NullableDatum struct, which, due to padding, needs more memory per argument than before. But as usually far fewer arguments are stored, and individual arguments are cheaper to access, that's still a clear win. It's likely that there's other places where conversion to NullableDatum arrays would make sense, e.g. TupleTableSlots, but that's for another commit. Because the function call information is now variable-length allocations have to take the number of arguments into account. For heap allocations that can be done with SizeForFunctionCallInfoData(), for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro that helps to allocate an appropriately sized and aligned variable. Some places with stack allocation function call information don't know the number of arguments at compile time, and currently variably sized stack allocations aren't allowed in postgres. Therefore allow for FUNC_MAX_ARGS space in these cases. They're not that common, so for now that seems acceptable. Because of the need to allocate FunctionCallInfo of the appropriate size, older extensions may need to update their code. To avoid subtle breakages, the FunctionCallInfoData struct has been renamed to FunctionCallInfoBaseData. Most code only references FunctionCallInfo, so that shouldn't cause much collateral damage. This change is also a prerequisite for more efficient expression JIT compilation (by allocating the function call information on the stack, allowing LLVM to optimize it away); previously the size of the call information caused problems inside LLVM's optimizer. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
2019-01-26 23:17:52 +01:00
args[].value .isnull elements in the FunctionCallInfoBaseData struct that
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
is used by the function evaluation step, thus avoiding extra work to copy
the result values around.
The last entry in a completed ExprState->steps array is always an
EEOP_DONE step; this removes the need to test for end-of-array while
iterating. Also, if the expression contains any variable references (to
user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps
array begins with EEOP_*_FETCHSOME steps that ensure that the relevant
tuples have been deconstructed to make the required columns directly
available (cf. slot_getsomeattrs()). This allows individual Var-fetching
steps to be little more than an array lookup.
Most of ExecInitExpr()'s work is done by the recursive function
ExecInitExprRec() and its subroutines. ExecInitExprRec() maps one Expr
node into the steps required for execution, recursing as needed for
sub-expressions.
Each ExecInitExprRec() call has to specify where that subexpression's
results are to be stored (via the resv/resnull parameters). This allows
the above scenario of evaluating a (sub-)expression directly into
Change function call information to be variable length. Before this change FunctionCallInfoData, the struct arguments etc for V1 function calls are stored in, always had space for FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two arrays. For nearly every function call 100 arguments is far more than needed, therefore wasting memory. Arg and argnull being two separate arrays also guarantees that to access a single argument, two cachelines have to be touched. Change the layout so there's a single variable-length array with pairs of value / isnull. That drastically reduces memory consumption for most function calls (on x86-64 a two argument function now uses 64bytes, previously 936 bytes), and makes it very likely that argument value and its nullness are on the same cacheline. Arguments are stored in a new NullableDatum struct, which, due to padding, needs more memory per argument than before. But as usually far fewer arguments are stored, and individual arguments are cheaper to access, that's still a clear win. It's likely that there's other places where conversion to NullableDatum arrays would make sense, e.g. TupleTableSlots, but that's for another commit. Because the function call information is now variable-length allocations have to take the number of arguments into account. For heap allocations that can be done with SizeForFunctionCallInfoData(), for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro that helps to allocate an appropriately sized and aligned variable. Some places with stack allocation function call information don't know the number of arguments at compile time, and currently variably sized stack allocations aren't allowed in postgres. Therefore allow for FUNC_MAX_ARGS space in these cases. They're not that common, so for now that seems acceptable. Because of the need to allocate FunctionCallInfo of the appropriate size, older extensions may need to update their code. To avoid subtle breakages, the FunctionCallInfoData struct has been renamed to FunctionCallInfoBaseData. Most code only references FunctionCallInfo, so that shouldn't cause much collateral damage. This change is also a prerequisite for more efficient expression JIT compilation (by allocating the function call information on the stack, allowing LLVM to optimize it away); previously the size of the call information caused problems inside LLVM's optimizer. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
2019-01-26 23:17:52 +01:00
fcinfo->args[].value/isnull, but also requires some care: target Datum/isnull
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
variables may not be shared with another ExecInitExprRec() unless the
results are only needed by steps executing before further usages of those
target Datum/isnull variables. Due to the non-recursiveness of the
ExprEvalStep representation that's usually easy to guarantee.
ExecInitExprRec() pushes new operations into the ExprState->steps array
using ExprEvalPushStep(). To keep the steps as a consecutively laid out
array, ExprEvalPushStep() has to repalloc the entire array when there's
not enough space. Because of that it is *not* allowed to point directly
into any of the steps during expression initialization. Therefore, the
resv/resnull for a subexpression usually point to some storage that is
palloc'd separately from the steps array. For instance, the
Change function call information to be variable length. Before this change FunctionCallInfoData, the struct arguments etc for V1 function calls are stored in, always had space for FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two arrays. For nearly every function call 100 arguments is far more than needed, therefore wasting memory. Arg and argnull being two separate arrays also guarantees that to access a single argument, two cachelines have to be touched. Change the layout so there's a single variable-length array with pairs of value / isnull. That drastically reduces memory consumption for most function calls (on x86-64 a two argument function now uses 64bytes, previously 936 bytes), and makes it very likely that argument value and its nullness are on the same cacheline. Arguments are stored in a new NullableDatum struct, which, due to padding, needs more memory per argument than before. But as usually far fewer arguments are stored, and individual arguments are cheaper to access, that's still a clear win. It's likely that there's other places where conversion to NullableDatum arrays would make sense, e.g. TupleTableSlots, but that's for another commit. Because the function call information is now variable-length allocations have to take the number of arguments into account. For heap allocations that can be done with SizeForFunctionCallInfoData(), for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro that helps to allocate an appropriately sized and aligned variable. Some places with stack allocation function call information don't know the number of arguments at compile time, and currently variably sized stack allocations aren't allowed in postgres. Therefore allow for FUNC_MAX_ARGS space in these cases. They're not that common, so for now that seems acceptable. Because of the need to allocate FunctionCallInfo of the appropriate size, older extensions may need to update their code. To avoid subtle breakages, the FunctionCallInfoData struct has been renamed to FunctionCallInfoBaseData. Most code only references FunctionCallInfo, so that shouldn't cause much collateral damage. This change is also a prerequisite for more efficient expression JIT compilation (by allocating the function call information on the stack, allowing LLVM to optimize it away); previously the size of the call information caused problems inside LLVM's optimizer. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
2019-01-26 23:17:52 +01:00
FunctionCallInfoBaseData for a function call step is separately allocated
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
rather than being part of the ExprEvalStep array. The overall result
of a complete expression is typically returned into the resvalue/resnull
fields of the ExprState node itself.
Some steps, e.g. boolean expressions, allow skipping evaluation of
certain subexpressions. In the flat representation this amounts to
jumping to some later step rather than just continuing consecutively
with the next step. The target for such a jump is represented by
the integer index in the ExprState->steps array of the step to execute
next. (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.)
Typically, ExecInitExprRec() has to push a jumping step into the steps
array, then recursively generate steps for the subexpression that might
get skipped over, then go back and fix up the jump target index using
the now-known length of the subexpression's steps. This is handled by
adjust_jumps lists in execExpr.c.
The last step in constructing an ExprState is to apply ExecReadyExpr(),
which readies it for execution using whichever execution method has been
selected.
Expression Evaluation
---------------------
To allow for different methods of expression evaluation, and for
better branch/jump target prediction, expressions are evaluated by
calling ExprState->evalfunc (via ExecEvalExpr() and friends).
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
ExecReadyExpr() can choose the method of interpretation by setting
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
evalfunc to an appropriate function. The default execution function,
ExecInterpExpr, is implemented in execExprInterp.c; see its header
comment for details. Special-case evalfuncs are used for certain
especially-simple expressions.
Note that a lot of the more complex expression evaluation steps, which are
less performance-critical than the simpler ones, are implemented as
separate functions outside the fast-path of expression execution, allowing
their implementation to be shared between interpreted and compiled
expression evaluation. This means that these helper functions are not
allowed to perform expression step dispatch themselves, as the method of
dispatch will vary based on the caller. The helpers therefore cannot call
for the execution of subexpressions; all subexpression results they need
must be computed by earlier steps. And dispatch to the following
expression step must be performed after returning from the helper.
Targetlist Evaluation
---------------------
ExecBuildProjectionInfo builds an ExprState that has the effect of
evaluating a targetlist into ExprState->resultslot. A generic targetlist
expression is executed by evaluating it as discussed above (storing the
result into the ExprState's resvalue/resnull fields) and then using an
EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[]
and tts_isnull[] array elements of the result slot. There are special
fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that
are simple Vars using only one step instead of two.
Memory Management
-----------------
A "per query" memory context is created during CreateExecutorState();
all storage allocated during an executor invocation is allocated in that
context or a child context. This allows easy reclamation of storage
during executor shutdown --- rather than messing with retail pfree's and
probable storage leaks, we just destroy the memory context.
In particular, the plan state trees and expression state trees described
in the previous section are allocated in the per-query memory context.
To avoid intra-query memory leaks, most processing while a query runs
is done in "per tuple" memory contexts, which are so-called because they
are typically reset to empty once per tuple. Per-tuple contexts are usually
associated with ExprContexts, and commonly each PlanState node has its own
ExprContext to evaluate its qual and targetlist expressions in.
Query Processing Control Flow
-----------------------------
This is a sketch of control flow for full query processing:
CreateQueryDesc
ExecutorStart
CreateExecutorState
creates per-query context
switch to per-query context to run ExecInitNode
Fix SQL-spec incompatibilities in new transition table feature. The standard says that all changes of the same kind (insert, update, or delete) caused in one table by a single SQL statement should be reported in a single transition table; and by that, they mean to include foreign key enforcement actions cascading from the statement's direct effects. It's also reasonable to conclude that if the standard had wCTEs, they would say that effects of wCTEs applying to the same table as each other or the outer statement should be merged into one transition table. We weren't doing it like that. Hence, arrange to merge tuples from multiple update actions into a single transition table as much as we can. There is a problem, which is that if the firing of FK enforcement triggers and after-row triggers with transition tables is interspersed, we might need to report more tuples after some triggers have already seen the transition table. It seems like a bad idea for the transition table to be mutable between trigger calls. There's no good way around this without a major redesign of the FK logic, so for now, resolve it by opening a new transition table each time this happens. Also, ensure that AFTER STATEMENT triggers fire just once per statement, or once per transition table when we're forced to make more than one. Previous versions of Postgres have allowed each FK enforcement query to cause an additional firing of the AFTER STATEMENT triggers for the referencing table, but that's certainly not per spec. (We're still doing multiple firings of BEFORE STATEMENT triggers, though; is that something worth changing?) Also, forbid using transition tables with column-specific UPDATE triggers. The spec requires such transition tables to show only the tuples for which the UPDATE trigger would have fired, which means maintaining multiple transition tables or else somehow filtering the contents at readout. Maybe someday we'll bother to support that option, but it looks like a lot of trouble for a marginal feature. The transition tables are now managed by the AfterTriggers data structures, rather than being directly the responsibility of ModifyTable nodes. This removes a subtransaction-lifespan memory leak introduced by my previous band-aid patch 3c4359521. In passing, refactor the AfterTriggers data structures to reduce the management overhead for them, by using arrays of structs rather than several parallel arrays for per-query-level and per-subtransaction state. I failed to resist the temptation to do some copy-editing on the SGML docs about triggers, above and beyond merely documenting the effects of this patch. Back-patch to v10, because we don't want the semantics of transition tables to change post-release. Patch by me, with help and review from Thomas Munro. Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
2017-09-16 19:20:32 +02:00
AfterTriggerBeginQuery
ExecInitNode --- recursively scans plan tree
ExecInitNode
recurse into subsidiary nodes
CreateExprContext
creates per-tuple context
ExecInitExpr
ExecutorRun
ExecProcNode --- recursively called in per-query context
ExecEvalExpr --- called in per-tuple context
ResetExprContext --- to free memory
ExecutorFinish
ExecPostprocessPlan --- run any unfinished ModifyTable nodes
AfterTriggerEndQuery
ExecutorEnd
ExecEndNode --- recursively releases resources
FreeExecutorState
frees per-query context and child contexts
FreeQueryDesc
Per above comments, it's not really critical for ExecEndNode to free any
memory; it'll all go away in FreeExecutorState anyway. However, we do need to
be careful to close relations, drop buffer pins, etc, so we do need to scan
the plan state tree to find these sorts of resources.
The executor can also be used to evaluate simple expressions without any Plan
tree ("simple" meaning "no aggregates and no sub-selects", though such might
be hidden inside function calls). This case has a flow of control like
CreateExecutorState
creates per-query context
CreateExprContext -- or use GetPerTupleExprContext(estate)
creates per-tuple context
ExecPrepareExpr
temporarily switch to per-query context
run the expression through expression_planner
ExecInitExpr
Repeatedly do:
ExecEvalExprSwitchContext
ExecEvalExpr --- called in per-tuple context
ResetExprContext --- to free memory
FreeExecutorState
frees per-query context, as well as ExprContext
(a separate FreeExprContext call is not necessary)
EvalPlanQual (READ COMMITTED Update Checking)
---------------------------------------------
For simple SELECTs, the executor need only pay attention to tuples that are
valid according to the snapshot seen by the current transaction (ie, they
were inserted by a previously committed transaction, and not deleted by any
previously committed transaction). However, for UPDATE and DELETE it is not
cool to modify or delete a tuple that's been modified by an open or
concurrently-committed transaction. If we are running in SERIALIZABLE
isolation level then we just raise an error when this condition is seen to
occur. In READ COMMITTED isolation level, we must work a lot harder.
The basic idea in READ COMMITTED mode is to take the modified tuple
committed by the concurrent transaction (after waiting for it to commit,
if need be) and re-evaluate the query qualifications to see if it would
still meet the quals. If so, we regenerate the updated tuple (if we are
doing an UPDATE) from the modified tuple, and finally update/delete the
modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its
action is just to lock the modified tuple and return results based on that
version of the tuple.
To implement this checking, we actually re-run the query from scratch for
each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
relation scan nodes tweaked to return only the current tuples --- either
the original ones, or the updated (and now locked) versions of the modified
tuple(s). If this query returns a tuple, then the modified tuple(s) pass
the quals (and the query output is the suitably modified update tuple, if
we're doing UPDATE). If no tuple is returned, then the modified tuple(s)
fail the quals, so we ignore the current result tuple and continue the
original query.
In UPDATE/DELETE, only the target relation needs to be handled this way.
In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
so we obtain lock on the current tuple version in each such relation before
executing the recheck.
It is also possible that there are relations in the query that are not
to be locked (they are neither the UPDATE/DELETE target nor specified to
be locked in SELECT FOR UPDATE/SHARE). When re-running the test query
we want to use the same rows from these relations that were joined to
the locked rows. For ordinary relations this can be implemented relatively
cheaply by including the row TID in the join outputs and re-fetching that
TID. (The re-fetch is expensive, but we're trying to optimize the normal
case where no re-test is needed.) We have also to consider non-table
relations, such as a ValuesScan or FunctionScan. For these, since there
is no equivalent of TID, the only practical solution seems to be to include
the entire row value in the join output row.
We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
so as to ensure that at most one tuple can be returned for any particular
set of scan tuples. Otherwise we'd get duplicates due to the original
query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
Add support for asynchronous execution. This implements asynchronous execution, which runs multiple parts of a non-parallel-aware Append concurrently rather than serially to improve performance when possible. Currently, the only node type that can be run concurrently is a ForeignScan that is an immediate child of such an Append. In the case where such ForeignScans access data on different remote servers, this would run those ForeignScans concurrently, and overlap the remote operations to be performed simultaneously, so it'll improve the performance especially when the operations involve time-consuming ones such as remote join and remote aggregation. We may extend this to other node types such as joins or aggregates over ForeignScans in the future. This also adds the support for postgres_fdw, which is enabled by the table-level/server-level option "async_capable". The default is false. Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself. This commit is mostly based on the patch proposed by Robert Haas, but also uses stuff from the patch proposed by Kyotaro Horiguchi and from the patch proposed by Thomas Munro. Reviewed by Kyotaro Horiguchi, Konstantin Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and others. Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
2021-03-31 11:45:00 +02:00
Asynchronous Execution
----------------------
In cases where a node is waiting on an event external to the database system,
such as a ForeignScan awaiting network I/O, it's desirable for the node to
indicate that it cannot return any tuple immediately but may be able to do so
at a later time. A process which discovers this type of situation can always
handle it simply by blocking, but this may waste time that could be spent
executing some other part of the plan tree where progress could be made
immediately. This is particularly likely to occur when the plan tree contains
an Append node. Asynchronous execution runs multiple parts of an Append node
concurrently rather than serially to improve performance.
For asynchronous execution, an Append node must first request a tuple from an
async-capable child node using ExecAsyncRequest. Next, it must execute the
asynchronous event loop using ExecAppendAsyncEventWait. Eventually, when a
child node to which an asynchronous request has been made produces a tuple,
the Append node will receive it from the event loop via ExecAsyncResponse. In
the current implementation of asynchronous execution, the only node type that
requests tuples from an async-capable child node is an Append, while the only
node type that might be async-capable is a ForeignScan.
Typically, the ExecAsyncResponse callback is the only one required for nodes
that wish to request tuples asynchronously. On the other hand, async-capable
nodes generally need to implement three methods:
1. When an asynchronous request is made, the node's ExecAsyncRequest callback
will be invoked; it should use ExecAsyncRequestPending to indicate that the
request is pending for a callback described below. Alternatively, it can
instead use ExecAsyncRequestDone if a result is available immediately.
2. When the event loop wishes to wait or poll for file descriptor events, the
node's ExecAsyncConfigureWait callback will be invoked to configure the
file descriptor event for which the node wishes to wait.
3. When the file descriptor becomes ready, the node's ExecAsyncNotify callback
will be invoked; like #1, it should use ExecAsyncRequestPending for another
callback or ExecAsyncRequestDone to return a result immediately.