Add cost estimate discussion to TODO.detail.

2024-09-30 16:11:29 +02:00 · 2003-05-24 03:59:06 +00:00 · 2003-05-24 03:59:06 +00:00 · 76e386d5e4
commit 76e386d5e4
parent 07d89f6f81
1 changed files with 402 additions and 1 deletions
--- a/doc/TODO.detail/optimizer
+++ b/doc/TODO.detail/optimizer
@ -1059,7 +1059,7 @@ From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000
 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
 	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
-Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.19 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
+Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.20 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
 	Thu, 20 Jan 2000 19:35:19 -0500 (EST)
@ -2003,3 +2003,404 @@ your stats be out-of-date or otherwise misleading.
 			regards, tom lane
 From pgsql-hackers-owner+M29943@postgresql.org Thu Oct  3 18:18:27 2002
 Return-path: <pgsql-hackers-owner+M29943@postgresql.org>
 Received: from postgresql.org (postgresql.org [64.49.215.8])
 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MIOU23771
 	for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:18:25 -0400 (EDT)
 Received: from localhost (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with ESMTP
 	id B9F51476570; Thu,  3 Oct 2002 18:18:21 -0400 (EDT)
 Received: from postgresql.org (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with SMTP
 	id E083B4761B0; Thu,  3 Oct 2002 18:18:19 -0400 (EDT)
 Received: from localhost (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with ESMTP id 13ADC476063
 	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:18:17 -0400 (EDT)
 Received: from acorn.he.net (acorn.he.net [64.71.137.130])
 	by postgresql.org (Postfix) with ESMTP id 3AEC8475FFF
 	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:18:16 -0400 (EDT)
 Received: from CurtisVaio ([63.164.0.47] (may be forged)) by acorn.he.net (8.8.6/8.8.2) with SMTP id PAA19215; Thu, 3 Oct 2002 15:18:14 -0700
 From: "Curtis Faith" <curtis@galtair.com>
 To: "Tom Lane" <tgl@sss.pgh.pa.us>
 cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
 Subject: Re: [HACKERS] Advice: Where could I be of help? 
 Date: Thu, 3 Oct 2002 18:17:55 -0400
 Message-ID: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
 MIME-Version: 1.0
 Content-Type: text/plain;
 	charset="iso-8859-1"
 Content-Transfer-Encoding: 7bit
 X-Priority: 3 (Normal)
 X-MSMail-Priority: Normal
 X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
 In-Reply-To: <13379.1033675158@sss.pgh.pa.us>
 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
 Importance: Normal
 X-Virus-Scanned: by AMaViS new-20020517
 Precedence: bulk
 Sender: pgsql-hackers-owner@postgresql.org
 X-Virus-Scanned: by AMaViS new-20020517
 Status: OR
 tom lane wrote:
 > But more globally, I think that our worst problems these days have to do
 > with planner misestimations leading to bad plans.  The planner is
 > usually *capable* of generating a good plan, but all too often it picks
 > the wrong one.  We need work on improving the cost modeling equations
 > to be closer to reality.  If that's at all close to your sphere of
 > interest then I think it should be #1 priority --- it's both localized,
 > which I think is important for a first project, and potentially a
 > considerable win.
 This seems like a very interesting problem. One of the ways that I thought
 would be interesting and would solve the problem of trying to figure out the
 right numbers is to have certain guesses for the actual values based on
 statistics gathered during vacuum and general running and then have the
 planner run the "best" plan.
 Then during execution if the planner turned out to be VERY wrong about
 certain assumptions the execution system could update the stats that led to
 those wrong assumptions. That way the system would seek the correct values
 automatically. We could also gather the stats that the system produces for
 certain actual databases and then use those to make smarter initial guesses.
 I've found that I can never predict costs. I always end up testing
 empirically and find myself surprised at the results.
 We should be able to make the executor smart enough to keep count of actual
 costs (or a statistical approximation) without introducing any significant
 overhead.
 tom lane also wrote:
 > There is no "cache flushing".  We have a shared buffer cache management
 > algorithm that's straight LRU across all buffers.  There's been some
 > interest in smarter cache-replacement code; I believe Neil Conway is
 > messing around with an LRU-2 implementation right now.  If you've got
 > better ideas we're all ears.
 Hmmm, this is the area that I think could lead to huge performance gains.
 Consider a simple system with a table tbl_master that gets read by each
 process many times but with very infrequent inserts and that contains about
 3,000 rows. The single but heavily used index for this table is contained in
 a btree with a depth of three with 20 - 8K pages in the first two levels of
 the btree.
 Another table tbl_detail with 10 indices that gets very frequent inserts.
 There are over 300,000 rows. Some queries result in index scans over the
 approximatley 5,000 8K pages in the index.
 There is a 40M shared cache for this system.
 Everytime a query which requires the index scan runs it will blow out the
 entire cache since the scan will load more blocks than the cache holds. Only
 blocks that are accessed while the scan is going will survive. LRU is bad,
 bad, bad!
 LRU-2 might be better but it seems like it still won't give enough priority
 to the most frequently used blocks. I don't see how it would do better for
 the above case.
 I once implemented a modified cache algorithm that was based on the clock
 algorithm for VM page caches. VM paging is similar to databases in that
 there is definite locality of reference and certain pages are MUCH more
 likely to be requested.
 The basic idea was to have a flag in each block that represented the access
 time in clock intervals. Imagine a clock hand sweeping across a clock, every
 access is like a tiny movement in the clock hand. Blocks that are not
 accessed during a sweep are candidates for removal.
 My modification was to use access counts to increase the durability of the
 more accessed blocks. Each time a block is accessed it's flag is shifted
 left (up to a maximum number of shifts - ShiftN ) and 1 is added to it.
 Every so many cache accesses (and synchronously when the cache is full) a
 pass is made over each block, right shifting the flags (a clock sweep). This
 can also be done one block at a time each access so the clock is directly
 linked to the cache access rate. Any blocks with 0 are placed into a doubly
 linked list of candidates for removal. New cache blocks are allocated from
 the list of candidates. Accesses of blocks in the candidate list just
 removes them from the list.
 An index root node page would likely be accessed frequently enough so that
 all it's bits would be set so it would take ShiftN clock sweeps.
 This algorithm increased the cache hit ratio from 40% to about 90% for the
 cases I tested when compared to a simple LRU mechanism. The paging ratio is
 greatly dependent on the ratio of the actual database size to the cache
 size.
 The bottom line that it is very important to keep blocks that are frequently
 accessed in the cache. The top levels of large btrees are accessed many
 hundreds (actually a power of the number of keys in each page) of times more
 frequently than the leaf pages. LRU can be the worst possible algorithm for
 something like an index or table scan of large tables since it flushes a
 large number of potentially frequently accessed blocks in favor of ones that
 are very unlikely to be retrieved again.
 tom lane also wrote:
 > This is an interesting area.  Keep in mind though that Postgres is a
 > portable DB that tries to be agnostic about what kernel and filesystem
 > it's sitting on top of --- and in any case it does not run as root, so
 > has very limited ability to affect what the kernel/filesystem do.
 > I'm not sure how much can be done without losing those portability
 > advantages.
 The kinds of things I was thinking about should be very portable. I found
 that simply writing the cache in order of the file system offset results in
 very greatly improved performance since it lets the head seek in smaller
 increments and much more smoothly, especially with modern disks. Most of the
 time the file system will create files are large sequential bytes on the
 physical disks in order. It might be in a few chunks but those chunks will
 be sequential and fairly large.
 tom lane also wrote:
 > Well, not really all that isolated.  The bottom-level index code doesn't
 > know whether you're doing INSERT or UPDATE, and would have no easy
 > access to the original tuple if it did know.  The original theory about
 > this was that the planner could detect the situation where the index(es)
 > don't overlap the set of columns being changed by the UPDATE, which
 > would be nice since there'd be zero runtime overhead.  Unfortunately
 > that breaks down if any BEFORE UPDATE triggers are fired that modify the
 > tuple being stored.  So all in all it turns out to be a tad messy to fit
 > this in :-(.  I am unconvinced that the impact would be huge anyway,
 > especially as of 7.3 which has a shortcut path for dead index entries.
 Well, this probably is not the right place to start then.
 - Curtis
 ---------------------------(end of broadcast)---------------------------
 TIP 4: Don't 'kill -9' the postmaster
 From pgsql-hackers-owner+M29945@postgresql.org Thu Oct  3 18:47:34 2002
 Return-path: <pgsql-hackers-owner+M29945@postgresql.org>
 Received: from postgresql.org (postgresql.org [64.49.215.8])
 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MlWU26068
 	for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:47:32 -0400 (EDT)
 Received: from localhost (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with ESMTP
 	id F2AAE476306; Thu,  3 Oct 2002 18:47:27 -0400 (EDT)
 Received: from postgresql.org (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with SMTP
 	id E7B5247604F; Thu,  3 Oct 2002 18:47:24 -0400 (EDT)
 Received: from localhost (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with ESMTP id 9ADCC4761A1
 	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:47:18 -0400 (EDT)
 Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 	by postgresql.org (Postfix) with ESMTP id DDB0B476187
 	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:47:17 -0400 (EDT)
 Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 	by sss.pgh.pa.us (8.12.5/8.12.5) with ESMTP id g93MlIhR015091;
 	Thu, 3 Oct 2002 18:47:18 -0400 (EDT)
 To: "Curtis Faith" <curtis@galtair.com>
 cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
 Subject: Re: [HACKERS] Advice: Where could I be of help? 
 In-Reply-To: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com> 
 References: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
 Comments: In-reply-to "Curtis Faith" <curtis@galtair.com>
 	message dated "Thu, 03 Oct 2002 18:17:55 -0400"
 Date: Thu, 03 Oct 2002 18:47:17 -0400
 Message-ID: <15090.1033685237@sss.pgh.pa.us>
 From: Tom Lane <tgl@sss.pgh.pa.us>
 X-Virus-Scanned: by AMaViS new-20020517
 Precedence: bulk
 Sender: pgsql-hackers-owner@postgresql.org
 X-Virus-Scanned: by AMaViS new-20020517
 Status: OR
 "Curtis Faith" <curtis@galtair.com> writes:
 > Then during execution if the planner turned out to be VERY wrong about
 > certain assumptions the execution system could update the stats that led to
 > those wrong assumptions. That way the system would seek the correct values
 > automatically.
 That has been suggested before, but I'm unsure how to make it work.
 There are a lot of parameters involved in any planning decision and it's
 not obvious which ones to tweak, or in which direction, if the plan
 turns out to be bad.  But if you can come up with some ideas, go to
 it!
 > Everytime a query which requires the index scan runs it will blow out the
 > entire cache since the scan will load more blocks than the cache
 > holds.
 Right, that's the scenario that kills simple LRU ...
 > LRU-2 might be better but it seems like it still won't give enough priority
 > to the most frequently used blocks.
 Blocks touched more than once per query (like the upper-level index
 blocks) will survive under LRU-2.  Blocks touched once per query won't.
 Seems to me that it should be a win.
 > My modification was to use access counts to increase the durability of the
 > more accessed blocks.
 You could do it that way too, but I'm unsure whether the extra
 complexity will buy anything.  Ultimately, I think an LRU-anything
 algorithm is equivalent to a clock sweep for those pages that only get
 touched once per some-long-interval: the single-touch guys get recycled
 in order of last use, which seems just like a clock sweep around the
 cache.  The guys with some amount of preference get excluded from the
 once-around sweep.  To determine whether LRU-2 is better or worse than
 some other preference algorithm requires a finer grain of analysis than
 this.  I'm not a fan of "more complex must be better", so I'd want to see
 why it's better before buying into it ...
 > The kinds of things I was thinking about should be very portable. I found
 > that simply writing the cache in order of the file system offset results in
 > very greatly improved performance since it lets the head seek in smaller
 > increments and much more smoothly, especially with modern disks.
 Shouldn't the OS be responsible for scheduling those writes
 appropriately?  Ye good olde elevator algorithm ought to handle this;
 and it's at least one layer closer to the actual disk layout than we
 are, thus more likely to issue the writes in a good order.  It's worth
 experimenting with, perhaps, but I'm pretty dubious about it.
 BTW, one other thing that Vadim kept saying we should do is alter the
 cache management strategy to retain dirty blocks in memory (ie, give
 some amount of preference to as-yet-unwritten dirty pages compared to
 clean pages).  There is no reliability cost here since the WAL will let
 us reconstruct any dirty pages if we crash before they get written; and
 the periodic checkpoints will ensure that we eventually write a dirty
 block and thus it will become available for recycling.  This seems like
 a promising line of thought that's orthogonal to the basic
 LRU-vs-whatever issue.  Nobody's got round to looking at it yet though.
 I've got no idea how much preference should be given to a dirty block
 --- not infinite, probably, but some.
 			regards, tom lane
 ---------------------------(end of broadcast)---------------------------
 TIP 5: Have you checked our extensive FAQ?
 http://www.postgresql.org/users-lounge/docs/faq.html
 From pgsql-hackers-owner+M29974@postgresql.org Fri Oct  4 01:28:54 2002
 Return-path: <pgsql-hackers-owner+M29974@postgresql.org>
 Received: from postgresql.org (postgresql.org [64.49.215.8])
 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g945SpU13476
 	for <pgman@candle.pha.pa.us>; Fri, 4 Oct 2002 01:28:52 -0400 (EDT)
 Received: from localhost (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with ESMTP
 	id 63999476BB2; Fri,  4 Oct 2002 01:26:56 -0400 (EDT)
 Received: from postgresql.org (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with SMTP
 	id BB7CA476B85; Fri,  4 Oct 2002 01:26:54 -0400 (EDT)
 Received: from localhost (postgresql.org [64.49.215.8])
 	by postgresql.org (Postfix) with ESMTP id 5FD7E476759
 	for <pgsql-hackers@postgresql.org>; Fri,  4 Oct 2002 01:26:52 -0400 (EDT)
 Received: from mclean.mail.mindspring.net (mclean.mail.mindspring.net [207.69.200.57])
 	by postgresql.org (Postfix) with ESMTP id 1F4A14766D8
 	for <pgsql-hackers@postgresql.org>; Fri,  4 Oct 2002 01:26:51 -0400 (EDT)
 Received: from 1cust163.tnt1.st-thomas.vi.da.uu.net ([200.58.4.163] helo=CurtisVaio)
 	by mclean.mail.mindspring.net with smtp (Exim 3.33 #1)
 	id 17xKzB-0000yK-00; Fri, 04 Oct 2002 01:26:49 -0400
 From: "Curtis Faith" <curtis@galtair.com>
 To: "Tom Lane" <tgl@sss.pgh.pa.us>
 cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
 Subject: Re: [HACKERS] Advice: Where could I be of help? 
 Date: Fri, 4 Oct 2002 01:26:36 -0400
 Message-ID: <DMEEJMCDOJAKPPFACMPMIECECEAA.curtis@galtair.com>
 MIME-Version: 1.0
 Content-Type: text/plain;
 	charset="iso-8859-1"
 Content-Transfer-Encoding: 7bit
 X-Priority: 3 (Normal)
 X-MSMail-Priority: Normal
 X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
 In-Reply-To: <15090.1033685237@sss.pgh.pa.us>
 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
 Importance: Normal
 X-Virus-Scanned: by AMaViS new-20020517
 Precedence: bulk
 Sender: pgsql-hackers-owner@postgresql.org
 X-Virus-Scanned: by AMaViS new-20020517
 Status: OR
 I wrote:
 > > My modification was to use access counts to increase the
 > durability of the
 > > more accessed blocks.
 >
 tom lane replies:
 > You could do it that way too, but I'm unsure whether the extra
 > complexity will buy anything.  Ultimately, I think an LRU-anything
 > algorithm is equivalent to a clock sweep for those pages that only get
 > touched once per some-long-interval: the single-touch guys get recycled
 > in order of last use, which seems just like a clock sweep around the
 > cache.  The guys with some amount of preference get excluded from the
 > once-around sweep.  To determine whether LRU-2 is better or worse than
 > some other preference algorithm requires a finer grain of analysis than
 > this.  I'm not a fan of "more complex must be better", so I'd want to see
 > why it's better before buying into it ...
 I'm definitely not a fan of "more complex must be better either". In fact,
 its surprising how often the real performance problems are easy to fix
 and simple while many person years are spent solving the issue everyone
 "knows" must be causing the performance problems only to find little gain.
 The key here is empirical testing. If the cache hit ratio for LRU-2 is
 much better then there may be no need here. OTOH, it took less than
 less than 30 lines or so of code to do what I described, so I don't consider
 it too, too "more complex" :=} We should run a test which includes
 running indexes (or is indices the PostgreSQL convention?) that are three
 or more times the size of the cache to see how well LRU-2 works. Is there
 any cache performance reporting built into pgsql?
 tom lane wrote:
 > Shouldn't the OS be responsible for scheduling those writes
 > appropriately?  Ye good olde elevator algorithm ought to handle this;
 > and it's at least one layer closer to the actual disk layout than we
 > are, thus more likely to issue the writes in a good order.  It's worth
 > experimenting with, perhaps, but I'm pretty dubious about it.
 I wasn't proposing anything other than changing the order of the writes,
 not actually ensuring that they get written that way at the level you
 describe above. This will help a lot on brain-dead file systems that
 can't do this ordering and probably also in cases where the number
 of blocks in the cache is very large.
 On a related note, while looking at the code, it seems to me that we
 are writing out the buffer cache synchronously, so there won't be
 any possibility of the file system reordering anyway. This appears to be
 a huge performance problem. I've read claims  in the archives that
 that the buffers are written asynchronously but my read of the
 code says otherwise. Can someone point out my error?
 I only see calls that ultimately call FileWrite or write(2) which will
 block without a O_NOBLOCK open. I thought one of the main reasons
 for having a WAL is so that you can write out the buffer's asynchronously.
 What am I missing?
 I wrote:
 > > Then during execution if the planner turned out to be VERY wrong about
 > > certain assumptions the execution system could update the stats
 > that led to
 > > those wrong assumptions. That way the system would seek the
 > correct values
 > > automatically.
 tom lane replied:
 > That has been suggested before, but I'm unsure how to make it work.
 > There are a lot of parameters involved in any planning decision and it's
 > not obvious which ones to tweak, or in which direction, if the plan
 > turns out to be bad.  But if you can come up with some ideas, go to
 > it!
 I'll have to look at the current planner before I can suggest
 anything concrete.
 - Curtis
 ---------------------------(end of broadcast)---------------------------
 TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org