From ddb93cac24cfe810e9c94df7f03facc1d07725fd Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 21 Jul 2007 04:02:41 +0000
Subject: [PATCH] Provide a bit more high-level documentation for the GEQO
 planner. Per request from Luca Ferrari.

---
 doc/src/sgml/arch-dev.sgml | 48 +++++++++++++++++++++----------
 doc/src/sgml/geqo.sgml     | 58 ++++++++++++++++++++++++++++++++++----
 2 files changed, 85 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml
index c861a656e90..7ee1ba357f0 100644
--- a/doc/src/sgml/arch-dev.sgml
+++ b/doc/src/sgml/arch-dev.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.29 2007/01/31 20:56:16 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.30 2007/07/21 04:02:41 tgl Exp $ -->
 
  <chapter id="overview">
   <title>Overview of PostgreSQL Internals</title>
@@ -345,9 +345,10 @@
      can be executed would take an excessive amount of time and memory
      space. In particular, this occurs when executing queries
      involving large numbers of join operations. In order to determine
-     a reasonable (not optimal) query plan in a reasonable amount of
-     time, <productname>PostgreSQL</productname> uses a <xref
-     linkend="geqo" endterm="geqo-title">.
+     a reasonable (not necessarily optimal) query plan in a reasonable amount
+     of time, <productname>PostgreSQL</productname> uses a <xref
+     linkend="geqo" endterm="geqo-title"> when the number of joins
+     exceeds a threshold (see <xref linkend="guc-geqo-threshold">).
     </para>
    </note>
 
@@ -380,20 +381,17 @@
      the index's <firstterm>operator class</>, another plan is created using
      the B-tree index to scan the relation. If there are further indexes
      present and the restrictions in the query happen to match a key of an
-     index further plans will be considered.
+     index, further plans will be considered.  Index scan plans are also
+     generated for indexes that have a sort ordering that can match the
+     query's <literal>ORDER BY</> clause (if any), or a sort ordering that
+     might be useful for merge joining (see below).
     </para>
 
     <para>
-     After all feasible plans have been found for scanning single relations,
-     plans for joining relations are created. The planner/optimizer
-     preferentially considers joins between any two relations for which there
-     exist a corresponding join clause in the <literal>WHERE</literal> qualification (i.e. for
-     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
-     exists). Join pairs with no join clause are considered only when there
-     is no other choice, that is, a particular relation has no available
-     join clauses to any other relation. All possible plans are generated for
-     every join pair considered
-     by the planner/optimizer. The three possible join strategies are:
+     If the query requires joining two or more relations,
+     plans for joining relations are considered
+     after all feasible plans have been found for scanning single relations.
+     The three available join strategies are:
 
      <itemizedlist>
       <listitem>
@@ -439,6 +437,26 @@
      cheapest one.
     </para>
 
+    <para>
+     If the query uses fewer than <xref linkend="guc-geqo-threshold">
+     relations, a near-exhaustive search is conducted to find the best
+     join sequence.  The planner preferentially considers joins between any
+     two relations for which there exist a corresponding join clause in the
+     <literal>WHERE</literal> qualification (i.e. for
+     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
+     exists). Join pairs with no join clause are considered only when there
+     is no other choice, that is, a particular relation has no available
+     join clauses to any other relation. All possible plans are generated for
+     every join pair considered by the planner, and the one that is
+     (estimated to be) the cheapest is chosen.
+    </para>
+
+    <para>
+     When <varname>geqo_threshold</varname> is exceeded, the join
+     sequences considered are determined by heuristics, as described
+     in <xref linkend="geqo">.  Otherwise the process is the same.
+    </para>
+
     <para>
      The finished plan tree consists of sequential or index scans of
      the base relations, plus nested-loop, merge, or hash join nodes as
diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml
index 6225dc4c321..2f680762c13 100644
--- a/doc/src/sgml/geqo.sgml
+++ b/doc/src/sgml/geqo.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.39 2007/02/16 03:50:29 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.40 2007/07/21 04:02:41 tgl Exp $ -->
 
  <chapter id="geqo">
   <chapterinfo>
@@ -186,11 +186,6 @@
     <productname>PostgreSQL</productname> optimizer.
    </para>
 
-   <para>
-    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor
-    algorithm.
-   </para>
-
    <para>
     Specific characteristics of the <acronym>GEQO</acronym>
     implementation in <productname>PostgreSQL</productname>
@@ -224,6 +219,11 @@
     </itemizedlist>
    </para>
 
+   <para>
+    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
+    Genitor algorithm.
+   </para>
+
    <para>
     The <acronym>GEQO</acronym> module allows
     the <productname>PostgreSQL</productname> query optimizer to
@@ -231,6 +231,42 @@
     non-exhaustive search.
    </para>
 
+  <sect2>
+   <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
+
+   <para>
+    The <acronym>GEQO</acronym> planning process uses the standard planner
+    code to generate plans for scans of individual relations.  Then join
+    plans are developed using the genetic approach.  As shown above, each
+    candidate join plan is represented by a sequence in which to join
+    the base relations.  In the initial stage, the <acronym>GEQO</acronym>
+    code simply generates some possible join sequences at random.  For each
+    join sequence considered, the standard planner code is invoked to
+    estimate the cost of performing the query using that join sequence.
+    (For each step of the join sequence, all three possible join strategies
+    are considered; and all the initially-determined relation scan plans
+    are available.  The estimated cost is the cheapest of these
+    possibilities.)  Join sequences with lower estimated cost are considered
+    <quote>more fit</> than those with higher cost.  The genetic algorithm
+    discards the least fit candidates.  Then new candidates are generated
+    by combining genes of more-fit candidates &mdash; that is, by using
+    randomly-chosen portions of known low-cost join sequences to create
+    new sequences for consideration.  This process is repeated until a
+    preset number of join sequences have been considered; then the best
+    one found at any time during the search is used to generate the finished
+    plan.
+   </para>
+
+   <para>
+    This process is inherently nondeterministic, because of the randomized
+    choices made during both the initial population selection and subsequent
+    <quote>mutation</> of the best candidates.  Hence different plans may
+    be selected from one run to the next, resulting in varying run time
+    and varying output row order.
+   </para>
+
+  </sect2>
+
   <sect2 id="geqo-future">
    <title>Future Implementation Tasks for
     <productname>PostgreSQL</> <acronym>GEQO</acronym></title>
@@ -257,6 +293,16 @@
       </itemizedlist>
      </para>
 
+     <para>
+      In the current implementation, the fitness of each candidate join
+      sequence is estimated by running the standard planner's join selection
+      and cost estimation code from scratch.  To the extent that different
+      candidates use similar sub-sequences of joins, a great deal of work
+      will be repeated.  This could be made significantly faster by retaining
+      cost estimates for sub-joins.  The problem is to avoid expending
+      unreasonable amounts of memory on retaining that state.
+     </para>
+
      <para>
       At a more basic level, it is not clear that solving query optimization
       with a GA algorithm designed for TSP is appropriate.  In the TSP case,