From 76e386d5e4d7150932e30336a559cbe3c3658ec3 Mon Sep 17 00:00:00 2001
From: Bruce Momjian <bruce@momjian.us>
Date: Sat, 24 May 2003 03:59:06 +0000
Subject: [PATCH] Add cost estimate discussion to TODO.detail.

---
 doc/TODO.detail/optimizer | 403 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 402 insertions(+), 1 deletion(-)

diff --git a/doc/TODO.detail/optimizer b/doc/TODO.detail/optimizer
index 194ca349d3..fd6324f367 100644
--- a/doc/TODO.detail/optimizer
+++ b/doc/TODO.detail/optimizer
@@ -1059,7 +1059,7 @@ From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000
 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
-Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.19 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
+Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.20 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
 	Thu, 20 Jan 2000 19:35:19 -0500 (EST)
@@ -2003,3 +2003,404 @@ your stats be out-of-date or otherwise misleading.
 
 			regards, tom lane
 
+From pgsql-hackers-owner+M29943@postgresql.org Thu Oct  3 18:18:27 2002
+Return-path: <pgsql-hackers-owner+M29943@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MIOU23771
+	for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:18:25 -0400 (EDT)
+Received: from localhost (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with ESMTP
+	id B9F51476570; Thu,  3 Oct 2002 18:18:21 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id E083B4761B0; Thu,  3 Oct 2002 18:18:19 -0400 (EDT)
+Received: from localhost (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with ESMTP id 13ADC476063
+	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:18:17 -0400 (EDT)
+Received: from acorn.he.net (acorn.he.net [64.71.137.130])
+	by postgresql.org (Postfix) with ESMTP id 3AEC8475FFF
+	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:18:16 -0400 (EDT)
+Received: from CurtisVaio ([63.164.0.47] (may be forged)) by acorn.he.net (8.8.6/8.8.2) with SMTP id PAA19215; Thu, 3 Oct 2002 15:18:14 -0700
+From: "Curtis Faith" <curtis@galtair.com>
+To: "Tom Lane" <tgl@sss.pgh.pa.us>
+cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Advice: Where could I be of help? 
+Date: Thu, 3 Oct 2002 18:17:55 -0400
+Message-ID: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
+MIME-Version: 1.0
+Content-Type: text/plain;
+	charset="iso-8859-1"
+Content-Transfer-Encoding: 7bit
+X-Priority: 3 (Normal)
+X-MSMail-Priority: Normal
+X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
+In-Reply-To: <13379.1033675158@sss.pgh.pa.us>
+X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
+Importance: Normal
+X-Virus-Scanned: by AMaViS new-20020517
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Virus-Scanned: by AMaViS new-20020517
+Status: OR
+
+tom lane wrote:
+> But more globally, I think that our worst problems these days have to do
+> with planner misestimations leading to bad plans.  The planner is
+> usually *capable* of generating a good plan, but all too often it picks
+> the wrong one.  We need work on improving the cost modeling equations
+> to be closer to reality.  If that's at all close to your sphere of
+> interest then I think it should be #1 priority --- it's both localized,
+> which I think is important for a first project, and potentially a
+> considerable win.
+
+This seems like a very interesting problem. One of the ways that I thought
+would be interesting and would solve the problem of trying to figure out the
+right numbers is to have certain guesses for the actual values based on
+statistics gathered during vacuum and general running and then have the
+planner run the "best" plan.
+
+Then during execution if the planner turned out to be VERY wrong about
+certain assumptions the execution system could update the stats that led to
+those wrong assumptions. That way the system would seek the correct values
+automatically. We could also gather the stats that the system produces for
+certain actual databases and then use those to make smarter initial guesses.
+
+I've found that I can never predict costs. I always end up testing
+empirically and find myself surprised at the results.
+
+We should be able to make the executor smart enough to keep count of actual
+costs (or a statistical approximation) without introducing any significant
+overhead.
+
+tom lane also wrote:
+> There is no "cache flushing".  We have a shared buffer cache management
+> algorithm that's straight LRU across all buffers.  There's been some
+> interest in smarter cache-replacement code; I believe Neil Conway is
+> messing around with an LRU-2 implementation right now.  If you've got
+> better ideas we're all ears.
+
+Hmmm, this is the area that I think could lead to huge performance gains.
+
+Consider a simple system with a table tbl_master that gets read by each
+process many times but with very infrequent inserts and that contains about
+3,000 rows. The single but heavily used index for this table is contained in
+a btree with a depth of three with 20 - 8K pages in the first two levels of
+the btree.
+
+Another table tbl_detail with 10 indices that gets very frequent inserts.
+There are over 300,000 rows. Some queries result in index scans over the
+approximatley 5,000 8K pages in the index.
+
+There is a 40M shared cache for this system.
+
+Everytime a query which requires the index scan runs it will blow out the
+entire cache since the scan will load more blocks than the cache holds. Only
+blocks that are accessed while the scan is going will survive. LRU is bad,
+bad, bad!
+
+LRU-2 might be better but it seems like it still won't give enough priority
+to the most frequently used blocks. I don't see how it would do better for
+the above case.
+
+I once implemented a modified cache algorithm that was based on the clock
+algorithm for VM page caches. VM paging is similar to databases in that
+there is definite locality of reference and certain pages are MUCH more
+likely to be requested.
+
+The basic idea was to have a flag in each block that represented the access
+time in clock intervals. Imagine a clock hand sweeping across a clock, every
+access is like a tiny movement in the clock hand. Blocks that are not
+accessed during a sweep are candidates for removal.
+
+My modification was to use access counts to increase the durability of the
+more accessed blocks. Each time a block is accessed it's flag is shifted
+left (up to a maximum number of shifts - ShiftN ) and 1 is added to it.
+Every so many cache accesses (and synchronously when the cache is full) a
+pass is made over each block, right shifting the flags (a clock sweep). This
+can also be done one block at a time each access so the clock is directly
+linked to the cache access rate. Any blocks with 0 are placed into a doubly
+linked list of candidates for removal. New cache blocks are allocated from
+the list of candidates. Accesses of blocks in the candidate list just
+removes them from the list.
+
+An index root node page would likely be accessed frequently enough so that
+all it's bits would be set so it would take ShiftN clock sweeps.
+
+This algorithm increased the cache hit ratio from 40% to about 90% for the
+cases I tested when compared to a simple LRU mechanism. The paging ratio is
+greatly dependent on the ratio of the actual database size to the cache
+size.
+
+The bottom line that it is very important to keep blocks that are frequently
+accessed in the cache. The top levels of large btrees are accessed many
+hundreds (actually a power of the number of keys in each page) of times more
+frequently than the leaf pages. LRU can be the worst possible algorithm for
+something like an index or table scan of large tables since it flushes a
+large number of potentially frequently accessed blocks in favor of ones that
+are very unlikely to be retrieved again.
+
+tom lane also wrote:
+> This is an interesting area.  Keep in mind though that Postgres is a
+> portable DB that tries to be agnostic about what kernel and filesystem
+> it's sitting on top of --- and in any case it does not run as root, so
+> has very limited ability to affect what the kernel/filesystem do.
+> I'm not sure how much can be done without losing those portability
+> advantages.
+
+The kinds of things I was thinking about should be very portable. I found
+that simply writing the cache in order of the file system offset results in
+very greatly improved performance since it lets the head seek in smaller
+increments and much more smoothly, especially with modern disks. Most of the
+time the file system will create files are large sequential bytes on the
+physical disks in order. It might be in a few chunks but those chunks will
+be sequential and fairly large.
+
+tom lane also wrote:
+> Well, not really all that isolated.  The bottom-level index code doesn't
+> know whether you're doing INSERT or UPDATE, and would have no easy
+> access to the original tuple if it did know.  The original theory about
+> this was that the planner could detect the situation where the index(es)
+> don't overlap the set of columns being changed by the UPDATE, which
+> would be nice since there'd be zero runtime overhead.  Unfortunately
+> that breaks down if any BEFORE UPDATE triggers are fired that modify the
+> tuple being stored.  So all in all it turns out to be a tad messy to fit
+> this in :-(.  I am unconvinced that the impact would be huge anyway,
+> especially as of 7.3 which has a shortcut path for dead index entries.
+
+Well, this probably is not the right place to start then.
+
+- Curtis
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+From pgsql-hackers-owner+M29945@postgresql.org Thu Oct  3 18:47:34 2002
+Return-path: <pgsql-hackers-owner+M29945@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MlWU26068
+	for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:47:32 -0400 (EDT)
+Received: from localhost (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with ESMTP
+	id F2AAE476306; Thu,  3 Oct 2002 18:47:27 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id E7B5247604F; Thu,  3 Oct 2002 18:47:24 -0400 (EDT)
+Received: from localhost (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with ESMTP id 9ADCC4761A1
+	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:47:18 -0400 (EDT)
+Received: from sss.pgh.pa.us (unknown [192.204.191.242])
+	by postgresql.org (Postfix) with ESMTP id DDB0B476187
+	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:47:17 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.12.5/8.12.5) with ESMTP id g93MlIhR015091;
+	Thu, 3 Oct 2002 18:47:18 -0400 (EDT)
+To: "Curtis Faith" <curtis@galtair.com>
+cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Advice: Where could I be of help? 
+In-Reply-To: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com> 
+References: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
+Comments: In-reply-to "Curtis Faith" <curtis@galtair.com>
+	message dated "Thu, 03 Oct 2002 18:17:55 -0400"
+Date: Thu, 03 Oct 2002 18:47:17 -0400
+Message-ID: <15090.1033685237@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+X-Virus-Scanned: by AMaViS new-20020517
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Virus-Scanned: by AMaViS new-20020517
+Status: OR
+
+"Curtis Faith" <curtis@galtair.com> writes:
+> Then during execution if the planner turned out to be VERY wrong about
+> certain assumptions the execution system could update the stats that led to
+> those wrong assumptions. That way the system would seek the correct values
+> automatically.
+
+That has been suggested before, but I'm unsure how to make it work.
+There are a lot of parameters involved in any planning decision and it's
+not obvious which ones to tweak, or in which direction, if the plan
+turns out to be bad.  But if you can come up with some ideas, go to
+it!
+
+> Everytime a query which requires the index scan runs it will blow out the
+> entire cache since the scan will load more blocks than the cache
+> holds.
+
+Right, that's the scenario that kills simple LRU ...
+
+> LRU-2 might be better but it seems like it still won't give enough priority
+> to the most frequently used blocks.
+
+Blocks touched more than once per query (like the upper-level index
+blocks) will survive under LRU-2.  Blocks touched once per query won't.
+Seems to me that it should be a win.
+
+> My modification was to use access counts to increase the durability of the
+> more accessed blocks.
+
+You could do it that way too, but I'm unsure whether the extra
+complexity will buy anything.  Ultimately, I think an LRU-anything
+algorithm is equivalent to a clock sweep for those pages that only get
+touched once per some-long-interval: the single-touch guys get recycled
+in order of last use, which seems just like a clock sweep around the
+cache.  The guys with some amount of preference get excluded from the
+once-around sweep.  To determine whether LRU-2 is better or worse than
+some other preference algorithm requires a finer grain of analysis than
+this.  I'm not a fan of "more complex must be better", so I'd want to see
+why it's better before buying into it ...
+
+> The kinds of things I was thinking about should be very portable. I found
+> that simply writing the cache in order of the file system offset results in
+> very greatly improved performance since it lets the head seek in smaller
+> increments and much more smoothly, especially with modern disks.
+
+Shouldn't the OS be responsible for scheduling those writes
+appropriately?  Ye good olde elevator algorithm ought to handle this;
+and it's at least one layer closer to the actual disk layout than we
+are, thus more likely to issue the writes in a good order.  It's worth
+experimenting with, perhaps, but I'm pretty dubious about it.
+
+BTW, one other thing that Vadim kept saying we should do is alter the
+cache management strategy to retain dirty blocks in memory (ie, give
+some amount of preference to as-yet-unwritten dirty pages compared to
+clean pages).  There is no reliability cost here since the WAL will let
+us reconstruct any dirty pages if we crash before they get written; and
+the periodic checkpoints will ensure that we eventually write a dirty
+block and thus it will become available for recycling.  This seems like
+a promising line of thought that's orthogonal to the basic
+LRU-vs-whatever issue.  Nobody's got round to looking at it yet though.
+I've got no idea how much preference should be given to a dirty block
+--- not infinite, probably, but some.
+
+			regards, tom lane
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+From pgsql-hackers-owner+M29974@postgresql.org Fri Oct  4 01:28:54 2002
+Return-path: <pgsql-hackers-owner+M29974@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g945SpU13476
+	for <pgman@candle.pha.pa.us>; Fri, 4 Oct 2002 01:28:52 -0400 (EDT)
+Received: from localhost (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with ESMTP
+	id 63999476BB2; Fri,  4 Oct 2002 01:26:56 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id BB7CA476B85; Fri,  4 Oct 2002 01:26:54 -0400 (EDT)
+Received: from localhost (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with ESMTP id 5FD7E476759
+	for <pgsql-hackers@postgresql.org>; Fri,  4 Oct 2002 01:26:52 -0400 (EDT)
+Received: from mclean.mail.mindspring.net (mclean.mail.mindspring.net [207.69.200.57])
+	by postgresql.org (Postfix) with ESMTP id 1F4A14766D8
+	for <pgsql-hackers@postgresql.org>; Fri,  4 Oct 2002 01:26:51 -0400 (EDT)
+Received: from 1cust163.tnt1.st-thomas.vi.da.uu.net ([200.58.4.163] helo=CurtisVaio)
+	by mclean.mail.mindspring.net with smtp (Exim 3.33 #1)
+	id 17xKzB-0000yK-00; Fri, 04 Oct 2002 01:26:49 -0400
+From: "Curtis Faith" <curtis@galtair.com>
+To: "Tom Lane" <tgl@sss.pgh.pa.us>
+cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Advice: Where could I be of help? 
+Date: Fri, 4 Oct 2002 01:26:36 -0400
+Message-ID: <DMEEJMCDOJAKPPFACMPMIECECEAA.curtis@galtair.com>
+MIME-Version: 1.0
+Content-Type: text/plain;
+	charset="iso-8859-1"
+Content-Transfer-Encoding: 7bit
+X-Priority: 3 (Normal)
+X-MSMail-Priority: Normal
+X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
+In-Reply-To: <15090.1033685237@sss.pgh.pa.us>
+X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
+Importance: Normal
+X-Virus-Scanned: by AMaViS new-20020517
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Virus-Scanned: by AMaViS new-20020517
+Status: OR
+
+I wrote:
+
+> > My modification was to use access counts to increase the
+> durability of the
+> > more accessed blocks.
+>
+
+tom lane replies:
+> You could do it that way too, but I'm unsure whether the extra
+> complexity will buy anything.  Ultimately, I think an LRU-anything
+> algorithm is equivalent to a clock sweep for those pages that only get
+> touched once per some-long-interval: the single-touch guys get recycled
+> in order of last use, which seems just like a clock sweep around the
+> cache.  The guys with some amount of preference get excluded from the
+> once-around sweep.  To determine whether LRU-2 is better or worse than
+> some other preference algorithm requires a finer grain of analysis than
+> this.  I'm not a fan of "more complex must be better", so I'd want to see
+> why it's better before buying into it ...
+
+I'm definitely not a fan of "more complex must be better either". In fact,
+its surprising how often the real performance problems are easy to fix
+and simple while many person years are spent solving the issue everyone
+"knows" must be causing the performance problems only to find little gain.
+
+The key here is empirical testing. If the cache hit ratio for LRU-2 is
+much better then there may be no need here. OTOH, it took less than
+less than 30 lines or so of code to do what I described, so I don't consider
+it too, too "more complex" :=} We should run a test which includes
+running indexes (or is indices the PostgreSQL convention?) that are three
+or more times the size of the cache to see how well LRU-2 works. Is there
+any cache performance reporting built into pgsql?
+
+tom lane wrote:
+> Shouldn't the OS be responsible for scheduling those writes
+> appropriately?  Ye good olde elevator algorithm ought to handle this;
+> and it's at least one layer closer to the actual disk layout than we
+> are, thus more likely to issue the writes in a good order.  It's worth
+> experimenting with, perhaps, but I'm pretty dubious about it.
+
+I wasn't proposing anything other than changing the order of the writes,
+not actually ensuring that they get written that way at the level you
+describe above. This will help a lot on brain-dead file systems that
+can't do this ordering and probably also in cases where the number
+of blocks in the cache is very large.
+
+On a related note, while looking at the code, it seems to me that we
+are writing out the buffer cache synchronously, so there won't be
+any possibility of the file system reordering anyway. This appears to be
+a huge performance problem. I've read claims  in the archives that
+that the buffers are written asynchronously but my read of the
+code says otherwise. Can someone point out my error?
+
+I only see calls that ultimately call FileWrite or write(2) which will
+block without a O_NOBLOCK open. I thought one of the main reasons
+for having a WAL is so that you can write out the buffer's asynchronously.
+
+What am I missing?
+
+I wrote:
+> > Then during execution if the planner turned out to be VERY wrong about
+> > certain assumptions the execution system could update the stats
+> that led to
+> > those wrong assumptions. That way the system would seek the
+> correct values
+> > automatically.
+
+tom lane replied:
+> That has been suggested before, but I'm unsure how to make it work.
+> There are a lot of parameters involved in any planning decision and it's
+> not obvious which ones to tweak, or in which direction, if the plan
+> turns out to be bad.  But if you can come up with some ideas, go to
+> it!
+
+I'll have to look at the current planner before I can suggest
+anything concrete.
+
+- Curtis
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+