Provide a bit more high-level documentation for the GEQO planner.

Per request from Luca Ferrari.
This commit is contained in:
Tom Lane 2007-07-21 04:02:41 +00:00
parent 7abe764f17
commit ddb93cac24
2 changed files with 85 additions and 21 deletions

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.29 2007/01/31 20:56:16 momjian Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.30 2007/07/21 04:02:41 tgl Exp $ -->
<chapter id="overview">
<title>Overview of PostgreSQL Internals</title>
@ -345,9 +345,10 @@
can be executed would take an excessive amount of time and memory
space. In particular, this occurs when executing queries
involving large numbers of join operations. In order to determine
a reasonable (not optimal) query plan in a reasonable amount of
time, <productname>PostgreSQL</productname> uses a <xref
linkend="geqo" endterm="geqo-title">.
a reasonable (not necessarily optimal) query plan in a reasonable amount
of time, <productname>PostgreSQL</productname> uses a <xref
linkend="geqo" endterm="geqo-title"> when the number of joins
exceeds a threshold (see <xref linkend="guc-geqo-threshold">).
</para>
</note>
@ -380,20 +381,17 @@
the index's <firstterm>operator class</>, another plan is created using
the B-tree index to scan the relation. If there are further indexes
present and the restrictions in the query happen to match a key of an
index further plans will be considered.
index, further plans will be considered. Index scan plans are also
generated for indexes that have a sort ordering that can match the
query's <literal>ORDER BY</> clause (if any), or a sort ordering that
might be useful for merge joining (see below).
</para>
<para>
After all feasible plans have been found for scanning single relations,
plans for joining relations are created. The planner/optimizer
preferentially considers joins between any two relations for which there
exist a corresponding join clause in the <literal>WHERE</literal> qualification (i.e. for
which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
exists). Join pairs with no join clause are considered only when there
is no other choice, that is, a particular relation has no available
join clauses to any other relation. All possible plans are generated for
every join pair considered
by the planner/optimizer. The three possible join strategies are:
If the query requires joining two or more relations,
plans for joining relations are considered
after all feasible plans have been found for scanning single relations.
The three available join strategies are:
<itemizedlist>
<listitem>
@ -439,6 +437,26 @@
cheapest one.
</para>
<para>
If the query uses fewer than <xref linkend="guc-geqo-threshold">
relations, a near-exhaustive search is conducted to find the best
join sequence. The planner preferentially considers joins between any
two relations for which there exist a corresponding join clause in the
<literal>WHERE</literal> qualification (i.e. for
which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
exists). Join pairs with no join clause are considered only when there
is no other choice, that is, a particular relation has no available
join clauses to any other relation. All possible plans are generated for
every join pair considered by the planner, and the one that is
(estimated to be) the cheapest is chosen.
</para>
<para>
When <varname>geqo_threshold</varname> is exceeded, the join
sequences considered are determined by heuristics, as described
in <xref linkend="geqo">. Otherwise the process is the same.
</para>
<para>
The finished plan tree consists of sequential or index scans of
the base relations, plus nested-loop, merge, or hash join nodes as

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.39 2007/02/16 03:50:29 momjian Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.40 2007/07/21 04:02:41 tgl Exp $ -->
<chapter id="geqo">
<chapterinfo>
@ -186,11 +186,6 @@
<productname>PostgreSQL</productname> optimizer.
</para>
<para>
Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor
algorithm.
</para>
<para>
Specific characteristics of the <acronym>GEQO</acronym>
implementation in <productname>PostgreSQL</productname>
@ -224,6 +219,11 @@
</itemizedlist>
</para>
<para>
Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
Genitor algorithm.
</para>
<para>
The <acronym>GEQO</acronym> module allows
the <productname>PostgreSQL</productname> query optimizer to
@ -231,6 +231,42 @@
non-exhaustive search.
</para>
<sect2>
<title>Generating Possible Plans with <acronym>GEQO</acronym></title>
<para>
The <acronym>GEQO</acronym> planning process uses the standard planner
code to generate plans for scans of individual relations. Then join
plans are developed using the genetic approach. As shown above, each
candidate join plan is represented by a sequence in which to join
the base relations. In the initial stage, the <acronym>GEQO</acronym>
code simply generates some possible join sequences at random. For each
join sequence considered, the standard planner code is invoked to
estimate the cost of performing the query using that join sequence.
(For each step of the join sequence, all three possible join strategies
are considered; and all the initially-determined relation scan plans
are available. The estimated cost is the cheapest of these
possibilities.) Join sequences with lower estimated cost are considered
<quote>more fit</> than those with higher cost. The genetic algorithm
discards the least fit candidates. Then new candidates are generated
by combining genes of more-fit candidates &mdash; that is, by using
randomly-chosen portions of known low-cost join sequences to create
new sequences for consideration. This process is repeated until a
preset number of join sequences have been considered; then the best
one found at any time during the search is used to generate the finished
plan.
</para>
<para>
This process is inherently nondeterministic, because of the randomized
choices made during both the initial population selection and subsequent
<quote>mutation</> of the best candidates. Hence different plans may
be selected from one run to the next, resulting in varying run time
and varying output row order.
</para>
</sect2>
<sect2 id="geqo-future">
<title>Future Implementation Tasks for
<productname>PostgreSQL</> <acronym>GEQO</acronym></title>
@ -257,6 +293,16 @@
</itemizedlist>
</para>
<para>
In the current implementation, the fitness of each candidate join
sequence is estimated by running the standard planner's join selection
and cost estimation code from scratch. To the extent that different
candidates use similar sub-sequences of joins, a great deal of work
will be repeated. This could be made significantly faster by retaining
cost estimates for sub-joins. The problem is to avoid expending
unreasonable amounts of memory on retaining that state.
</para>
<para>
At a more basic level, it is not clear that solving query optimization
with a GA algorithm designed for TSP is appropriate. In the TSP case,