2002-12-05 16:50:39 +01:00
|
|
|
$Header: /cvsroot/pgsql/src/backend/executor/README,v 1.2 2002/12/05 15:50:30 tgl Exp $
|
2001-05-15 02:35:50 +02:00
|
|
|
|
|
|
|
The Postgres Executor
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
The executor processes a tree of "plan nodes". The plan tree is essentially
|
|
|
|
a demand-pull pipeline of tuple processing operations. Each node, when
|
|
|
|
called, will produce the next tuple in its output sequence, or NULL if no
|
|
|
|
more tuples are available. If the node is not a primitive relation-scanning
|
|
|
|
node, it will have child node(s) that it calls in turn to obtain input
|
|
|
|
tuples.
|
|
|
|
|
|
|
|
Refinements on this basic model include:
|
|
|
|
|
|
|
|
* Choice of scan direction (forwards or backwards). Caution: this is not
|
|
|
|
currently well-supported. It works for primitive scan nodes, but not very
|
|
|
|
well for joins, aggregates, etc.
|
|
|
|
|
|
|
|
* Rescan command to reset a node and make it generate its output sequence
|
|
|
|
over again.
|
|
|
|
|
|
|
|
* Parameters that can alter a node's results. After adjusting a parameter,
|
|
|
|
the rescan command must be applied to that node and all nodes above it.
|
|
|
|
There is a moderately intelligent scheme to avoid rescanning nodes
|
|
|
|
unnecessarily (for example, Sort does not rescan its input if no parameters
|
|
|
|
of the input have changed, since it can just reread its stored sorted data).
|
|
|
|
|
|
|
|
The plan tree concept implements SELECT directly: it is only necessary to
|
|
|
|
deliver the top-level result tuples to the client, or insert them into
|
|
|
|
another table in the case of INSERT ... SELECT. (INSERT ... VALUES is
|
|
|
|
handled similarly, but the plan tree is just a Result node with no source
|
|
|
|
tables.) For UPDATE, the plan tree selects the tuples that need to be
|
|
|
|
updated (WHERE condition) and delivers a new calculated tuple value for each
|
|
|
|
such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple.
|
|
|
|
The executor's top level then uses this information to update the correct
|
|
|
|
tuple. DELETE is similar to UPDATE except that only a CTID need be
|
|
|
|
delivered by the plan tree.
|
|
|
|
|
|
|
|
XXX a great deal more documentation needs to be written here...
|
|
|
|
|
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
Plan Trees and State Trees
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
The plan tree delivered by the planner contains a tree of Plan nodes (struct
|
|
|
|
types derived from struct Plan). Each Plan node may have expression trees
|
|
|
|
associated with it, to represent its target list, qualification conditions,
|
|
|
|
etc. During executor startup we build a parallel tree of identical structure
|
|
|
|
containing executor state nodes --- every plan and expression node type has
|
|
|
|
a corresponding executor state node type. Each node in the state tree has a
|
|
|
|
pointer to its corresponding node in the plan tree, plus executor state data
|
|
|
|
as needed to implement that node type. This arrangement allows the plan
|
|
|
|
tree to be completely read-only as far as the executor is concerned: all data
|
|
|
|
that is modified during execution is in the state tree. Read-only plan trees
|
|
|
|
make life much simpler for plan caching and reuse.
|
|
|
|
|
|
|
|
Altogether there are four classes of nodes used in these trees: Plan nodes,
|
|
|
|
their corresponding PlanState nodes, Expr nodes, and their corresponding
|
|
|
|
ExprState nodes. (Actually, there are also List nodes, which are used as
|
|
|
|
"glue" in all four kinds of tree.)
|
|
|
|
|
|
|
|
|
2001-05-15 02:35:50 +02:00
|
|
|
EvalPlanQual (READ COMMITTED update checking)
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
For simple SELECTs, the executor need only pay attention to tuples that are
|
|
|
|
valid according to the snapshot seen by the current transaction (ie, they
|
|
|
|
were inserted by a previously committed transaction, and not deleted by any
|
|
|
|
previously committed transaction). However, for UPDATE and DELETE it is not
|
|
|
|
cool to modify or delete a tuple that's been modified by an open or
|
|
|
|
concurrently-committed transaction. If we are running in SERIALIZABLE
|
|
|
|
isolation level then we just raise an error when this condition is seen to
|
|
|
|
occur. In READ COMMITTED isolation level, we must work a lot harder.
|
|
|
|
|
|
|
|
The basic idea in READ COMMITTED mode is to take the modified tuple
|
|
|
|
committed by the concurrent transaction (after waiting for it to commit,
|
|
|
|
if need be) and re-evaluate the query qualifications to see if it would
|
|
|
|
still meet the quals. If so, we regenerate the updated tuple (if we are
|
|
|
|
doing an UPDATE) from the modified tuple, and finally update/delete the
|
|
|
|
modified tuple. SELECT FOR UPDATE behaves similarly, except that its action
|
|
|
|
is just to mark the modified tuple for update by the current transaction.
|
|
|
|
|
|
|
|
To implement this checking, we actually re-run the entire query from scratch
|
|
|
|
for each modified tuple, but with the scan node that sourced the original
|
|
|
|
tuple set to return only the modified tuple, not the original tuple or any
|
|
|
|
of the rest of the relation. If this query returns a tuple, then the
|
|
|
|
modified tuple passes the quals (and the query output is the suitably
|
|
|
|
modified update tuple, if we're doing UPDATE). If no tuple is returned,
|
|
|
|
then the modified tuple fails the quals, so we ignore it and continue the
|
|
|
|
original query. (This is reasonably efficient for simple queries, but may
|
|
|
|
be horribly slow for joins. A better design would be nice; one thought for
|
|
|
|
future investigation is to treat the tuple substitution like a parameter,
|
|
|
|
so that we can avoid rescanning unrelated nodes.)
|
|
|
|
|
|
|
|
Note a fundamental bogosity of this approach: if the relation containing
|
|
|
|
the original tuple is being used in a self-join, the other instance(s) of
|
|
|
|
the relation will be treated as still containing the original tuple, whereas
|
|
|
|
logical consistency would demand that the modified tuple appear in them too.
|
|
|
|
But we'd have to actually substitute the modified tuple for the original,
|
|
|
|
while still returning all the rest of the relation, to ensure consistent
|
|
|
|
answers. Implementing this correctly is a task for future work.
|
|
|
|
|
|
|
|
In UPDATE/DELETE, only the target relation needs to be handled this way,
|
|
|
|
so only one special recheck query needs to execute at a time. In SELECT FOR
|
|
|
|
UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible
|
|
|
|
that while we are executing a recheck query for one modified tuple, we will
|
|
|
|
hit another modified tuple in another relation. In this case we "stack up"
|
|
|
|
recheck queries: a sub-recheck query is spawned in which both the first and
|
|
|
|
second modified tuples will be returned as the only components of their
|
|
|
|
relations. (In event of success, all these modified tuples will be marked
|
|
|
|
for update.) Again, this isn't necessarily quite the right thing ... but in
|
|
|
|
simple cases it works. Potentially, recheck queries could get nested to the
|
|
|
|
depth of the number of FOR UPDATE relations in the query.
|
|
|
|
|
|
|
|
It should be noted also that UPDATE/DELETE expect at most one tuple to
|
|
|
|
result from the modified query, whereas in the FOR UPDATE case it's possible
|
|
|
|
for multiple tuples to result (since we could be dealing with a join in
|
|
|
|
which multiple tuples join to the modified tuple). We want FOR UPDATE to
|
|
|
|
mark all relevant tuples, so we pass all tuples output by all the stacked
|
|
|
|
recheck queries back to the executor toplevel for marking.
|