From ddb93cac24cfe810e9c94df7f03facc1d07725fd Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sat, 21 Jul 2007 04:02:41 +0000 Subject: [PATCH] Provide a bit more high-level documentation for the GEQO planner. Per request from Luca Ferrari. --- doc/src/sgml/arch-dev.sgml | 48 +++++++++++++++++++++---------- doc/src/sgml/geqo.sgml | 58 ++++++++++++++++++++++++++++++++++---- 2 files changed, 85 insertions(+), 21 deletions(-) diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml index c861a656e90..7ee1ba357f0 100644 --- a/doc/src/sgml/arch-dev.sgml +++ b/doc/src/sgml/arch-dev.sgml @@ -1,4 +1,4 @@ - + Overview of PostgreSQL Internals @@ -345,9 +345,10 @@ can be executed would take an excessive amount of time and memory space. In particular, this occurs when executing queries involving large numbers of join operations. In order to determine - a reasonable (not optimal) query plan in a reasonable amount of - time, PostgreSQL uses a . + a reasonable (not necessarily optimal) query plan in a reasonable amount + of time, PostgreSQL uses a when the number of joins + exceeds a threshold (see ). @@ -380,20 +381,17 @@ the index's operator class, another plan is created using the B-tree index to scan the relation. If there are further indexes present and the restrictions in the query happen to match a key of an - index further plans will be considered. + index, further plans will be considered. Index scan plans are also + generated for indexes that have a sort ordering that can match the + query's ORDER BY clause (if any), or a sort ordering that + might be useful for merge joining (see below). - After all feasible plans have been found for scanning single relations, - plans for joining relations are created. The planner/optimizer - preferentially considers joins between any two relations for which there - exist a corresponding join clause in the WHERE qualification (i.e. for - which a restriction like where rel1.attr1=rel2.attr2 - exists). Join pairs with no join clause are considered only when there - is no other choice, that is, a particular relation has no available - join clauses to any other relation. All possible plans are generated for - every join pair considered - by the planner/optimizer. The three possible join strategies are: + If the query requires joining two or more relations, + plans for joining relations are considered + after all feasible plans have been found for scanning single relations. + The three available join strategies are: @@ -439,6 +437,26 @@ cheapest one. + + If the query uses fewer than + relations, a near-exhaustive search is conducted to find the best + join sequence. The planner preferentially considers joins between any + two relations for which there exist a corresponding join clause in the + WHERE qualification (i.e. for + which a restriction like where rel1.attr1=rel2.attr2 + exists). Join pairs with no join clause are considered only when there + is no other choice, that is, a particular relation has no available + join clauses to any other relation. All possible plans are generated for + every join pair considered by the planner, and the one that is + (estimated to be) the cheapest is chosen. + + + + When geqo_threshold is exceeded, the join + sequences considered are determined by heuristics, as described + in . Otherwise the process is the same. + + The finished plan tree consists of sequential or index scans of the base relations, plus nested-loop, merge, or hash join nodes as diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml index 6225dc4c321..2f680762c13 100644 --- a/doc/src/sgml/geqo.sgml +++ b/doc/src/sgml/geqo.sgml @@ -1,4 +1,4 @@ - + @@ -186,11 +186,6 @@ PostgreSQL optimizer. - - Parts of the GEQO module are adapted from D. Whitley's Genitor - algorithm. - - Specific characteristics of the GEQO implementation in PostgreSQL @@ -224,6 +219,11 @@ + + Parts of the GEQO module are adapted from D. Whitley's + Genitor algorithm. + + The GEQO module allows the PostgreSQL query optimizer to @@ -231,6 +231,42 @@ non-exhaustive search. + + Generating Possible Plans with <acronym>GEQO</acronym> + + + The GEQO planning process uses the standard planner + code to generate plans for scans of individual relations. Then join + plans are developed using the genetic approach. As shown above, each + candidate join plan is represented by a sequence in which to join + the base relations. In the initial stage, the GEQO + code simply generates some possible join sequences at random. For each + join sequence considered, the standard planner code is invoked to + estimate the cost of performing the query using that join sequence. + (For each step of the join sequence, all three possible join strategies + are considered; and all the initially-determined relation scan plans + are available. The estimated cost is the cheapest of these + possibilities.) Join sequences with lower estimated cost are considered + more fit than those with higher cost. The genetic algorithm + discards the least fit candidates. Then new candidates are generated + by combining genes of more-fit candidates — that is, by using + randomly-chosen portions of known low-cost join sequences to create + new sequences for consideration. This process is repeated until a + preset number of join sequences have been considered; then the best + one found at any time during the search is used to generate the finished + plan. + + + + This process is inherently nondeterministic, because of the randomized + choices made during both the initial population selection and subsequent + mutation of the best candidates. Hence different plans may + be selected from one run to the next, resulting in varying run time + and varying output row order. + + + + Future Implementation Tasks for <productname>PostgreSQL</> <acronym>GEQO</acronym> @@ -257,6 +293,16 @@ + + In the current implementation, the fitness of each candidate join + sequence is estimated by running the standard planner's join selection + and cost estimation code from scratch. To the extent that different + candidates use similar sub-sequences of joins, a great deal of work + will be repeated. This could be made significantly faster by retaining + cost estimates for sub-joins. The problem is to avoid expending + unreasonable amounts of memory on retaining that state. + + At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case,