postgresql/doc/src/sgml/geqo.sgml

<!--
$Header: /cvsroot/pgsql/doc/src/sgml/geqo.sgml,v 1.8 1999/03/30 15:25:56 thomas Exp $
Genetic Optimizer

$Log: geqo.sgml,v $
Revision 1.8  1999/03/30 15:25:56  thomas
Fix up small markup problems. Force omit-tags to nil so we have tag
 completion as required by the newest DocBook conventions.

Revision 1.7  1999/02/19 01:57:08  thomas
Fix SGML markup from last content changes.

Revision 1.6  1999/02/18 05:26:17  momjian
Enable bushy plans by default.

Revision 1.5  1998/12/29 02:24:15  thomas
Clean up to ensure tag completion as required by the newest versions
 of Norm's Modular Style Sheets and jade/docbook.
From Vince Vielhaber <vev@michvhf.com>.

Revision 1.4  1998/08/15 06:55:05  thomas
Change Id field in chapter tag to change html output file name.

-->

<Chapter Id="geqo">
<DocInfo>
<Author>
<FirstName>Martin</FirstName>
<SurName>Utesch</SurName>
<Affiliation>
<Orgname>
University of Mining and Technology
</Orgname>
<Orgdiv>
Institute of Automatic Control
</Orgdiv>
<Address>
<City>
Freiberg
</City>
<Country>
Germany
</Country>
</Address>
</Affiliation>
</Author>
<Date>1997-10-02</Date>
</DocInfo>

<Title>Genetic Query Optimization in Database Systems</Title>

<Para>
<Note>
<Title>Author</Title>
<Para>
Written by <ULink url="utesch@aut.tu-freiberg.de">Martin Utesch</ULink>
for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
</Para>
</Note>
</para>

<Sect1>
<Title>Query Handling as a Complex Optimization Problem</Title>

<Para>
   Among all relational operators the most difficult one to process and
optimize is the <FirstTerm>join</FirstTerm>. The number of alternative plans to answer a query
grows exponentially with the number of <Command>join</Command>s included in it. Further
optimization effort is caused by the support of a variety of <FirstTerm>join methods</FirstTerm>
 (e.g., nested loop, index scan, merge join in <ProductName>Postgres</ProductName>) to
process individual <Command>join</Command>s and a diversity of <FirstTerm>indices</FirstTerm> (e.g., r-tree,
b-tree, hash in <ProductName>Postgres</ProductName>) as access paths for relations.
</para>

<Para>
   The current <ProductName>Postgres</ProductName> optimizer implementation performs a <FirstTerm>near-
exhaustive search</FirstTerm> over the space of alternative strategies. This query
optimization technique is inadequate to support database application
domains that involve the need for extensive queries, such as artificial
intelligence.
</para>

<Para>
   The Institute of Automatic Control at the University of Mining and
Technology, in Freiberg, Germany, encountered the described problems as its
folks wanted to take the <ProductName>Postgres</ProductName> DBMS as the backend for a decision
support knowledge based system for the maintenance of an electrical
power grid. The DBMS needed to handle large <Command>join</Command> queries for the
inference machine of the knowledge based system.
</para>

<Para>
   Performance difficulties within exploring the space of possible query
plans arose the demand for a new optimization technique being developed.
</para>

<Para>
   In the following we propose the implementation of a <FirstTerm>Genetic Algorithm</FirstTerm>
 as an option for the database query optimization problem.
</para>
</sect1>

<Sect1>
<Title>Genetic Algorithms (<Acronym>GA</Acronym>)</Title>

<Para>
   The <Acronym>GA</Acronym> is a heuristic optimization method which operates through
determined, randomized search. The set of possible solutions for the
optimization problem is considered as a <FirstTerm>population</FirstTerm> of <FirstTerm>individuals</FirstTerm>.
The degree of adaption of an individual to its environment is specified
by its <FirstTerm>fitness</FirstTerm>.
</para>

<Para>
   The coordinates of an individual in the search space are represented
by <FirstTerm>chromosomes</FirstTerm>, in essence a set of character strings. A <FirstTerm>gene</FirstTerm> is a
subsection of a chromosome which encodes the value of a single parameter
being optimized. Typical encodings for a gene could be <FirstTerm>binary</FirstTerm> or
<FirstTerm>integer</FirstTerm>.
</para>

<Para>
   Through simulation of the evolutionary operations <FirstTerm>recombination</FirstTerm>,
<FirstTerm>mutation</FirstTerm>, and <FirstTerm>selection</FirstTerm> new generations of search points are found
that show a higher average fitness than their ancestors.
</para>

<Para>
   According to the "comp.ai.genetic" <Acronym>FAQ</Acronym> it cannot be stressed too
strongly that a <Acronym>GA</Acronym> is not a pure random search for a solution to a
problem. A <Acronym>GA</Acronym> uses stochastic processes, but the result is distinctly
non-random (better than random).

<ProgramListing>
Structured Diagram of a <Acronym>GA</Acronym>:
---------------------------

P(t)    generation of ancestors at a time t
P''(t)  generation of descendants at a time t

+=========================================+
|>>>>>>>>>>>  Algorithm GA  <<<<<<<<<<<<<<|
+=========================================+
| INITIALIZE t := 0                       |
+=========================================+
| INITIALIZE P(t)                         |
+=========================================+
| evalute FITNESS of P(t)                 |
+=========================================+
| while not STOPPING CRITERION do         |
|   +-------------------------------------+
|   | P'(t)  := RECOMBINATION{P(t)}       |
|   +-------------------------------------+
|   | P''(t) := MUTATION{P'(t)}           |
|   +-------------------------------------+
|   | P(t+1) := SELECTION{P''(t) + P(t)}  |
|   +-------------------------------------+
|   | evalute FITNESS of P''(t)           |
|   +-------------------------------------+
|   | t := t + 1                          |
+===+=====================================+
</ProgramListing>
</para>
</sect1>

<Sect1>
<Title>Genetic Query Optimization (<Acronym>GEQO</Acronym>) in Postgres</Title>

<Para>
   The <Acronym>GEQO</Acronym> module is intended for the solution of the query
optimization problem similar to a traveling salesman problem (<Acronym>TSP</Acronym>).
Possible query plans are encoded as integer strings. Each string
represents the <Command>join</Command> order from one relation of the query to the next.
E. g., the query tree
<ProgramListing>
       /\
      /\ 2
     /\ 3
    4  1
</ProgramListing>
is encoded by the integer string '4-1-3-2',
which means, first join relation '4' and '1', then '3', and
then '2', where 1, 2, 3, 4 are relids in <ProductName>Postgres</ProductName>.
</para>

<Para>
   Parts of the <Acronym>GEQO</Acronym> module are adapted from D. Whitley's Genitor
algorithm.
</para>

<Para>
   Specific characteristics of the <Acronym>GEQO</Acronym> implementation in <ProductName>Postgres</ProductName>
are:

<ItemizedList Mark="bullet" Spacing="compact">
<ListItem>
<Para>
Usage of a <FirstTerm>steady state</FirstTerm> <Acronym>GA</Acronym> (replacement of the least fit
   individuals in a population, not whole-generational replacement)
   allows fast convergence towards improved query plans. This is
   essential for query handling with reasonable time;
</Para>
</ListItem>

<ListItem>
<Para>
Usage of <FirstTerm>edge recombination crossover</FirstTerm> which is especially suited
   to keep edge losses low for the solution of the <Acronym>TSP</Acronym> by means of a <Acronym>GA</Acronym>;
</Para>
</ListItem>

<ListItem>
<Para>
Mutation as genetic operator is deprecated so that no repair
   mechanisms are needed to generate legal <Acronym>TSP</Acronym> tours.
</Para>
</ListItem>
</ItemizedList>
</para>

<Para>
   The <Acronym>GEQO</Acronym> module gives the following benefits to the <ProductName>Postgres</ProductName> DBMS
compared to the <ProductName>Postgres</ProductName> query optimizer implementation:

<ItemizedList Mark="bullet" Spacing="compact">
<ListItem>
<Para>
Handling of large <Command>join</Command> queries through non-exhaustive search;
</Para>
</ListItem>

<ListItem>
<Para>
Improved cost size approximation of query plans since no longer
   plan merging is needed (the <Acronym>GEQO</Acronym> module evaluates the cost for a
   query plan as an individual).
</Para>
</ListItem>
</ItemizedList>
</para>

</Sect1>

<Sect1>
<Title>Future Implementation Tasks for <ProductName>Postgres</ProductName> <Acronym>GEQO</Acronym></Title>

<Sect2>
<Title>Basic Improvements</Title>

<Sect3>
<Title>Improve freeing of memory when query is already processed</Title>

<Para>
With large <Command>join</Command> queries the computing time spent for the genetic query
optimization seems to be a mere <Emphasis>fraction</Emphasis> of the time
 <ProductName>Postgres</ProductName>
needs for freeing memory via routine <Function>MemoryContextFree</Function>,
file <FileName>backend/utils/mmgr/mcxt.c</FileName>.
Debugging showed that it get stucked in a loop of routine
<Function>OrderedElemPop</Function>, file <FileName>backend/utils/mmgr/oset.c</FileName>.
The same problems arise with long queries when using the normal
<ProductName>Postgres</ProductName> query optimization algorithm.
</para>
</sect3>

<Sect3>
<Title>Improve genetic algorithm parameter settings</Title>

<Para>
In file <FileName>backend/optimizer/geqo/geqo_params.c</FileName>, routines
<Function>gimme_pool_size</Function> and <Function>gimme_number_generations</Function>,
we have to find a compromise for the parameter settings
to satisfy two competing demands:
<ItemizedList Spacing="compact">
<ListItem>
<Para>
Optimality of the query plan
</Para>
</ListItem>
<ListItem>
<Para>
Computing time
</Para>
</ListItem>
</ItemizedList>
</para>
</sect3>

<Sect3>
<Title>Find better solution for integer overflow</Title>

<Para>
In file <FileName>backend/optimizer/geqo/geqo_eval.c</FileName>, routine
<Function>geqo_joinrel_size</Function>,
the present hack for MAXINT overflow is to set the <ProductName>Postgres</ProductName> integer
value of <StructField>rel->size</StructField> to its logarithm.
Modifications of <StructName>Rel</StructName> in <FileName>backend/nodes/relation.h</FileName> will
surely have severe impacts on the whole <ProductName>Postgres</ProductName> implementation.
</para>
</sect3>

<Sect3>
<Title>Find solution for exhausted memory</Title>

<Para>
Memory exhaustion may occur with more than 10 relations involved in a query.
In file <FileName>backend/optimizer/geqo/geqo_eval.c</FileName>, routine
<Function>gimme_tree</Function> is recursively called.
Maybe I forgot something to be freed correctly, but I dunno what.
Of course the <StructName>rel</StructName> data structure of the <Command>join</Command> keeps growing and
growing the more relations are packed into it.
Suggestions are welcome :-(
</para>
</sect3>
</sect2>


<BIBLIOGRAPHY Id="geqo-biblio">
<TITLE>
References
</TITLE>
<PARA>Reference information for <Acronym>GEQ</Acronym> algorithms.
</PARA>
<BIBLIOENTRY>

<BOOKBIBLIO>
<TITLE>
The Hitch-Hiker's Guide to Evolutionary Computation
</TITLE>
<AUTHORGROUP>
<AUTHOR>
<FIRSTNAME>J&ouml;rg</FIRSTNAME>
<SURNAME>Heitk&ouml;tter</SURNAME>
</AUTHOR>
<AUTHOR>
<FIRSTNAME>David</FIRSTNAME>
<SURNAME>Beasley</SURNAME>
</AUTHOR>
</AUTHORGROUP>
<PUBLISHER>
<PUBLISHERNAME>
InterNet resource
</PUBLISHERNAME>
</PUBLISHER>
<ABSTRACT>
<Para>
FAQ in <ULink url="news://comp.ai.genetic">comp.ai.genetic</ULink>
is available at <ULink url="ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html">Encore</ULink>.
</Para>
</ABSTRACT>
</BOOKBIBLIO>

<BOOKBIBLIO>
<TITLE>
The Design and Implementation of the Postgres Query Optimizer
</TITLE>
<AUTHORGROUP>
<AUTHOR>
<FIRSTNAME>Z.</FIRSTNAME>
<SURNAME>Fong</SURNAME>
</AUTHOR>
</AUTHORGROUP>
<PUBLISHER>
<PUBLISHERNAME>
University of California, Berkeley Computer Science Department
</PUBLISHERNAME>
</PUBLISHER>
<ABSTRACT>
<Para>
File <FileName>planner/Report.ps</FileName> in the 'postgres-papers' distribution.
</Para>
</ABSTRACT>
</BOOKBIBLIO>

<BOOKBIBLIO>
<TITLE>
Fundamentals of Database Systems
</TITLE>
<AUTHORGROUP>
<AUTHOR>
<FIRSTNAME>R.</FIRSTNAME>
<SURNAME>Elmasri</SURNAME>
</AUTHOR>
<AUTHOR>
<FIRSTNAME>S.</FIRSTNAME>
<SURNAME>Navathe</SURNAME>
</AUTHOR>
</AUTHORGROUP>
<PUBLISHER>
<PUBLISHERNAME>
The Benjamin/Cummings Pub., Inc.
</PUBLISHERNAME>
</PUBLISHER>
</BOOKBIBLIO>

</BIBLIOENTRY>
</BIBLIOGRAPHY>

</sect1>
</Chapter>

<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-omittag:nil
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document:nil
sgml-default-dtd-file:"./reference.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:"/usr/lib/sgml/catalog"
sgml-local-ecat-files:nil
End:
-->