From 76e386d5e4d7150932e30336a559cbe3c3658ec3 Mon Sep 17 00:00:00 2001 From: Bruce Momjian Date: Sat, 24 May 2003 03:59:06 +0000 Subject: [PATCH] Add cost estimate discussion to TODO.detail. --- doc/TODO.detail/optimizer | 403 +++++++++++++++++++++++++++++++++++++- 1 file changed, 402 insertions(+), 1 deletion(-) diff --git a/doc/TODO.detail/optimizer b/doc/TODO.detail/optimizer index 194ca349d3..fd6324f367 100644 --- a/doc/TODO.detail/optimizer +++ b/doc/TODO.detail/optimizer @@ -1059,7 +1059,7 @@ From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672 for ; Thu, 20 Jan 2000 19:45:30 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.19 $) with ESMTP id TAA01989 for ; Thu, 20 Jan 2000 19:39:15 -0500 (EST) +Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.20 $) with ESMTP id TAA01989 for ; Thu, 20 Jan 2000 19:39:15 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.9.3/8.9.3) with SMTP id TAA00957; Thu, 20 Jan 2000 19:35:19 -0500 (EST) @@ -2003,3 +2003,404 @@ your stats be out-of-date or otherwise misleading. regards, tom lane +From pgsql-hackers-owner+M29943@postgresql.org Thu Oct 3 18:18:27 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MIOU23771 + for ; Thu, 3 Oct 2002 18:18:25 -0400 (EDT) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP + id B9F51476570; Thu, 3 Oct 2002 18:18:21 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id E083B4761B0; Thu, 3 Oct 2002 18:18:19 -0400 (EDT) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP id 13ADC476063 + for ; Thu, 3 Oct 2002 18:18:17 -0400 (EDT) +Received: from acorn.he.net (acorn.he.net [64.71.137.130]) + by postgresql.org (Postfix) with ESMTP id 3AEC8475FFF + for ; Thu, 3 Oct 2002 18:18:16 -0400 (EDT) +Received: from CurtisVaio ([63.164.0.47] (may be forged)) by acorn.he.net (8.8.6/8.8.2) with SMTP id PAA19215; Thu, 3 Oct 2002 15:18:14 -0700 +From: "Curtis Faith" +To: "Tom Lane" +cc: "Pgsql-Hackers" +Subject: Re: [HACKERS] Advice: Where could I be of help? +Date: Thu, 3 Oct 2002 18:17:55 -0400 +Message-ID: +MIME-Version: 1.0 +Content-Type: text/plain; + charset="iso-8859-1" +Content-Transfer-Encoding: 7bit +X-Priority: 3 (Normal) +X-MSMail-Priority: Normal +X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) +In-Reply-To: <13379.1033675158@sss.pgh.pa.us> +X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700 +Importance: Normal +X-Virus-Scanned: by AMaViS new-20020517 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Virus-Scanned: by AMaViS new-20020517 +Status: OR + +tom lane wrote: +> But more globally, I think that our worst problems these days have to do +> with planner misestimations leading to bad plans. The planner is +> usually *capable* of generating a good plan, but all too often it picks +> the wrong one. We need work on improving the cost modeling equations +> to be closer to reality. If that's at all close to your sphere of +> interest then I think it should be #1 priority --- it's both localized, +> which I think is important for a first project, and potentially a +> considerable win. + +This seems like a very interesting problem. One of the ways that I thought +would be interesting and would solve the problem of trying to figure out the +right numbers is to have certain guesses for the actual values based on +statistics gathered during vacuum and general running and then have the +planner run the "best" plan. + +Then during execution if the planner turned out to be VERY wrong about +certain assumptions the execution system could update the stats that led to +those wrong assumptions. That way the system would seek the correct values +automatically. We could also gather the stats that the system produces for +certain actual databases and then use those to make smarter initial guesses. + +I've found that I can never predict costs. I always end up testing +empirically and find myself surprised at the results. + +We should be able to make the executor smart enough to keep count of actual +costs (or a statistical approximation) without introducing any significant +overhead. + +tom lane also wrote: +> There is no "cache flushing". We have a shared buffer cache management +> algorithm that's straight LRU across all buffers. There's been some +> interest in smarter cache-replacement code; I believe Neil Conway is +> messing around with an LRU-2 implementation right now. If you've got +> better ideas we're all ears. + +Hmmm, this is the area that I think could lead to huge performance gains. + +Consider a simple system with a table tbl_master that gets read by each +process many times but with very infrequent inserts and that contains about +3,000 rows. The single but heavily used index for this table is contained in +a btree with a depth of three with 20 - 8K pages in the first two levels of +the btree. + +Another table tbl_detail with 10 indices that gets very frequent inserts. +There are over 300,000 rows. Some queries result in index scans over the +approximatley 5,000 8K pages in the index. + +There is a 40M shared cache for this system. + +Everytime a query which requires the index scan runs it will blow out the +entire cache since the scan will load more blocks than the cache holds. Only +blocks that are accessed while the scan is going will survive. LRU is bad, +bad, bad! + +LRU-2 might be better but it seems like it still won't give enough priority +to the most frequently used blocks. I don't see how it would do better for +the above case. + +I once implemented a modified cache algorithm that was based on the clock +algorithm for VM page caches. VM paging is similar to databases in that +there is definite locality of reference and certain pages are MUCH more +likely to be requested. + +The basic idea was to have a flag in each block that represented the access +time in clock intervals. Imagine a clock hand sweeping across a clock, every +access is like a tiny movement in the clock hand. Blocks that are not +accessed during a sweep are candidates for removal. + +My modification was to use access counts to increase the durability of the +more accessed blocks. Each time a block is accessed it's flag is shifted +left (up to a maximum number of shifts - ShiftN ) and 1 is added to it. +Every so many cache accesses (and synchronously when the cache is full) a +pass is made over each block, right shifting the flags (a clock sweep). This +can also be done one block at a time each access so the clock is directly +linked to the cache access rate. Any blocks with 0 are placed into a doubly +linked list of candidates for removal. New cache blocks are allocated from +the list of candidates. Accesses of blocks in the candidate list just +removes them from the list. + +An index root node page would likely be accessed frequently enough so that +all it's bits would be set so it would take ShiftN clock sweeps. + +This algorithm increased the cache hit ratio from 40% to about 90% for the +cases I tested when compared to a simple LRU mechanism. The paging ratio is +greatly dependent on the ratio of the actual database size to the cache +size. + +The bottom line that it is very important to keep blocks that are frequently +accessed in the cache. The top levels of large btrees are accessed many +hundreds (actually a power of the number of keys in each page) of times more +frequently than the leaf pages. LRU can be the worst possible algorithm for +something like an index or table scan of large tables since it flushes a +large number of potentially frequently accessed blocks in favor of ones that +are very unlikely to be retrieved again. + +tom lane also wrote: +> This is an interesting area. Keep in mind though that Postgres is a +> portable DB that tries to be agnostic about what kernel and filesystem +> it's sitting on top of --- and in any case it does not run as root, so +> has very limited ability to affect what the kernel/filesystem do. +> I'm not sure how much can be done without losing those portability +> advantages. + +The kinds of things I was thinking about should be very portable. I found +that simply writing the cache in order of the file system offset results in +very greatly improved performance since it lets the head seek in smaller +increments and much more smoothly, especially with modern disks. Most of the +time the file system will create files are large sequential bytes on the +physical disks in order. It might be in a few chunks but those chunks will +be sequential and fairly large. + +tom lane also wrote: +> Well, not really all that isolated. The bottom-level index code doesn't +> know whether you're doing INSERT or UPDATE, and would have no easy +> access to the original tuple if it did know. The original theory about +> this was that the planner could detect the situation where the index(es) +> don't overlap the set of columns being changed by the UPDATE, which +> would be nice since there'd be zero runtime overhead. Unfortunately +> that breaks down if any BEFORE UPDATE triggers are fired that modify the +> tuple being stored. So all in all it turns out to be a tad messy to fit +> this in :-(. I am unconvinced that the impact would be huge anyway, +> especially as of 7.3 which has a shortcut path for dead index entries. + +Well, this probably is not the right place to start then. + +- Curtis + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + +From pgsql-hackers-owner+M29945@postgresql.org Thu Oct 3 18:47:34 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MlWU26068 + for ; Thu, 3 Oct 2002 18:47:32 -0400 (EDT) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP + id F2AAE476306; Thu, 3 Oct 2002 18:47:27 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id E7B5247604F; Thu, 3 Oct 2002 18:47:24 -0400 (EDT) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP id 9ADCC4761A1 + for ; Thu, 3 Oct 2002 18:47:18 -0400 (EDT) +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id DDB0B476187 + for ; Thu, 3 Oct 2002 18:47:17 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.12.5/8.12.5) with ESMTP id g93MlIhR015091; + Thu, 3 Oct 2002 18:47:18 -0400 (EDT) +To: "Curtis Faith" +cc: "Pgsql-Hackers" +Subject: Re: [HACKERS] Advice: Where could I be of help? +In-Reply-To: +References: +Comments: In-reply-to "Curtis Faith" + message dated "Thu, 03 Oct 2002 18:17:55 -0400" +Date: Thu, 03 Oct 2002 18:47:17 -0400 +Message-ID: <15090.1033685237@sss.pgh.pa.us> +From: Tom Lane +X-Virus-Scanned: by AMaViS new-20020517 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Virus-Scanned: by AMaViS new-20020517 +Status: OR + +"Curtis Faith" writes: +> Then during execution if the planner turned out to be VERY wrong about +> certain assumptions the execution system could update the stats that led to +> those wrong assumptions. That way the system would seek the correct values +> automatically. + +That has been suggested before, but I'm unsure how to make it work. +There are a lot of parameters involved in any planning decision and it's +not obvious which ones to tweak, or in which direction, if the plan +turns out to be bad. But if you can come up with some ideas, go to +it! + +> Everytime a query which requires the index scan runs it will blow out the +> entire cache since the scan will load more blocks than the cache +> holds. + +Right, that's the scenario that kills simple LRU ... + +> LRU-2 might be better but it seems like it still won't give enough priority +> to the most frequently used blocks. + +Blocks touched more than once per query (like the upper-level index +blocks) will survive under LRU-2. Blocks touched once per query won't. +Seems to me that it should be a win. + +> My modification was to use access counts to increase the durability of the +> more accessed blocks. + +You could do it that way too, but I'm unsure whether the extra +complexity will buy anything. Ultimately, I think an LRU-anything +algorithm is equivalent to a clock sweep for those pages that only get +touched once per some-long-interval: the single-touch guys get recycled +in order of last use, which seems just like a clock sweep around the +cache. The guys with some amount of preference get excluded from the +once-around sweep. To determine whether LRU-2 is better or worse than +some other preference algorithm requires a finer grain of analysis than +this. I'm not a fan of "more complex must be better", so I'd want to see +why it's better before buying into it ... + +> The kinds of things I was thinking about should be very portable. I found +> that simply writing the cache in order of the file system offset results in +> very greatly improved performance since it lets the head seek in smaller +> increments and much more smoothly, especially with modern disks. + +Shouldn't the OS be responsible for scheduling those writes +appropriately? Ye good olde elevator algorithm ought to handle this; +and it's at least one layer closer to the actual disk layout than we +are, thus more likely to issue the writes in a good order. It's worth +experimenting with, perhaps, but I'm pretty dubious about it. + +BTW, one other thing that Vadim kept saying we should do is alter the +cache management strategy to retain dirty blocks in memory (ie, give +some amount of preference to as-yet-unwritten dirty pages compared to +clean pages). There is no reliability cost here since the WAL will let +us reconstruct any dirty pages if we crash before they get written; and +the periodic checkpoints will ensure that we eventually write a dirty +block and thus it will become available for recycling. This seems like +a promising line of thought that's orthogonal to the basic +LRU-vs-whatever issue. Nobody's got round to looking at it yet though. +I've got no idea how much preference should be given to a dirty block +--- not infinite, probably, but some. + + regards, tom lane + +---------------------------(end of broadcast)--------------------------- +TIP 5: Have you checked our extensive FAQ? + +http://www.postgresql.org/users-lounge/docs/faq.html + +From pgsql-hackers-owner+M29974@postgresql.org Fri Oct 4 01:28:54 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g945SpU13476 + for ; Fri, 4 Oct 2002 01:28:52 -0400 (EDT) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP + id 63999476BB2; Fri, 4 Oct 2002 01:26:56 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id BB7CA476B85; Fri, 4 Oct 2002 01:26:54 -0400 (EDT) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP id 5FD7E476759 + for ; Fri, 4 Oct 2002 01:26:52 -0400 (EDT) +Received: from mclean.mail.mindspring.net (mclean.mail.mindspring.net [207.69.200.57]) + by postgresql.org (Postfix) with ESMTP id 1F4A14766D8 + for ; Fri, 4 Oct 2002 01:26:51 -0400 (EDT) +Received: from 1cust163.tnt1.st-thomas.vi.da.uu.net ([200.58.4.163] helo=CurtisVaio) + by mclean.mail.mindspring.net with smtp (Exim 3.33 #1) + id 17xKzB-0000yK-00; Fri, 04 Oct 2002 01:26:49 -0400 +From: "Curtis Faith" +To: "Tom Lane" +cc: "Pgsql-Hackers" +Subject: Re: [HACKERS] Advice: Where could I be of help? +Date: Fri, 4 Oct 2002 01:26:36 -0400 +Message-ID: +MIME-Version: 1.0 +Content-Type: text/plain; + charset="iso-8859-1" +Content-Transfer-Encoding: 7bit +X-Priority: 3 (Normal) +X-MSMail-Priority: Normal +X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) +In-Reply-To: <15090.1033685237@sss.pgh.pa.us> +X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700 +Importance: Normal +X-Virus-Scanned: by AMaViS new-20020517 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Virus-Scanned: by AMaViS new-20020517 +Status: OR + +I wrote: + +> > My modification was to use access counts to increase the +> durability of the +> > more accessed blocks. +> + +tom lane replies: +> You could do it that way too, but I'm unsure whether the extra +> complexity will buy anything. Ultimately, I think an LRU-anything +> algorithm is equivalent to a clock sweep for those pages that only get +> touched once per some-long-interval: the single-touch guys get recycled +> in order of last use, which seems just like a clock sweep around the +> cache. The guys with some amount of preference get excluded from the +> once-around sweep. To determine whether LRU-2 is better or worse than +> some other preference algorithm requires a finer grain of analysis than +> this. I'm not a fan of "more complex must be better", so I'd want to see +> why it's better before buying into it ... + +I'm definitely not a fan of "more complex must be better either". In fact, +its surprising how often the real performance problems are easy to fix +and simple while many person years are spent solving the issue everyone +"knows" must be causing the performance problems only to find little gain. + +The key here is empirical testing. If the cache hit ratio for LRU-2 is +much better then there may be no need here. OTOH, it took less than +less than 30 lines or so of code to do what I described, so I don't consider +it too, too "more complex" :=} We should run a test which includes +running indexes (or is indices the PostgreSQL convention?) that are three +or more times the size of the cache to see how well LRU-2 works. Is there +any cache performance reporting built into pgsql? + +tom lane wrote: +> Shouldn't the OS be responsible for scheduling those writes +> appropriately? Ye good olde elevator algorithm ought to handle this; +> and it's at least one layer closer to the actual disk layout than we +> are, thus more likely to issue the writes in a good order. It's worth +> experimenting with, perhaps, but I'm pretty dubious about it. + +I wasn't proposing anything other than changing the order of the writes, +not actually ensuring that they get written that way at the level you +describe above. This will help a lot on brain-dead file systems that +can't do this ordering and probably also in cases where the number +of blocks in the cache is very large. + +On a related note, while looking at the code, it seems to me that we +are writing out the buffer cache synchronously, so there won't be +any possibility of the file system reordering anyway. This appears to be +a huge performance problem. I've read claims in the archives that +that the buffers are written asynchronously but my read of the +code says otherwise. Can someone point out my error? + +I only see calls that ultimately call FileWrite or write(2) which will +block without a O_NOBLOCK open. I thought one of the main reasons +for having a WAL is so that you can write out the buffer's asynchronously. + +What am I missing? + +I wrote: +> > Then during execution if the planner turned out to be VERY wrong about +> > certain assumptions the execution system could update the stats +> that led to +> > those wrong assumptions. That way the system would seek the +> correct values +> > automatically. + +tom lane replied: +> That has been suggested before, but I'm unsure how to make it work. +> There are a lot of parameters involved in any planning decision and it's +> not obvious which ones to tweak, or in which direction, if the plan +> turns out to be bad. But if you can come up with some ideas, go to +> it! + +I'll have to look at the current planner before I can suggest +anything concrete. + +- Curtis + + +---------------------------(end of broadcast)--------------------------- +TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org +