Add raw file discussion to performance TODO.detail.

2002-08-26 01:04:13 +00:00 · 2002-08-26 01:04:13 +00:00 · e21e02ab12
parent 7e3f2449d8
commit e21e02ab12
1 changed files with 796 additions and 3 deletions
--- a/doc/TODO.detail/performance
+++ b/doc/TODO.detail/performance
@ -345,7 +345,7 @@ From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999
 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
 	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
-Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
+Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
 	Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
@ -454,7 +454,7 @@ From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999
 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
 	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
-Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
+Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
 Received: from localhost (majordom@localhost)
 	by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
 	Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
@ -1006,7 +1006,7 @@ From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000
 Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165
 	for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT)
-Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
+Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
 Received: from hub.org (majordom@localhost [127.0.0.1])
 	by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477;
 	Fri, 16 Jun 2000 17:13:36 -0400 (EDT)
@ -2239,3 +2239,796 @@ from 1 to "maybe" for nodes that get too dense.
 Hannu


+From pgsql-hackers-owner+M21991@postgresql.org Wed Apr 24 23:37:37 2002
+Return-path: <pgsql-hackers-owner+M21991@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3ba416337
+	for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:37:36 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id CF13447622B; Wed, 24 Apr 2002 23:37:31 -0400 (EDT)
+Received: from sraigw.sra.co.jp (sraigw.sra.co.jp [202.32.10.2])
+	by postgresql.org (Postfix) with ESMTP id 3EE92474E4B
+	for <pgsql-hackers@postgresql.org>; Wed, 24 Apr 2002 23:37:19 -0400 (EDT)
+Received: from srascb.sra.co.jp (srascb [133.137.8.65])
+	by sraigw.sra.co.jp (8.9.3/3.7W-sraigw) with ESMTP id MAA76393;
+	Thu, 25 Apr 2002 12:35:44 +0900 (JST)
+Received: (from root@localhost)
+	by srascb.sra.co.jp (8.11.6/8.11.6) id g3P3ZCK64299;
+	Thu, 25 Apr 2002 12:35:12 +0900 (JST)
+	(envelope-from t-ishii@sra.co.jp)
+Received: from sranhm.sra.co.jp (sranhm [133.137.170.62])
+	by srascb.sra.co.jp (8.11.6/8.11.6av) with ESMTP id g3P3ZBV64291;
+	Thu, 25 Apr 2002 12:35:11 +0900 (JST)
+	(envelope-from t-ishii@sra.co.jp)
+Received: from localhost (IDENT:t-ishii@srapc1474.sra.co.jp [133.137.170.59])
+	by sranhm.sra.co.jp (8.9.3+3.2W/3.7W-srambox) with ESMTP id MAA25562;
+	Thu, 25 Apr 2002 12:35:43 +0900
+To: tgl@sss.pgh.pa.us
+cc: cjs@cynic.net, pgman@candle.pha.pa.us, pgsql-hackers@postgresql.org
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
+In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
+References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
+	<12342.1019705420@sss.pgh.pa.us>
+X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1
+	=?iso-2022-jp?B?KBskQjAqGyhCKQ==?=
+MIME-Version: 1.0
+Content-Type: Text/Plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Message-ID: <20020425123429E.t-ishii@sra.co.jp>
+Date: Thu, 25 Apr 2002 12:34:29 +0900
+From: Tatsuo Ishii <t-ishii@sra.co.jp>
+X-Dispatcher: imput version 20000228(IM140)
+Lines: 12
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+> Curt Sampson <cjs@cynic.net> writes:
+> > Grabbing bigger chunks is always optimal, AFICT, if they're not
+> > *too* big and you use the data. A single 64K read takes very little
+> > longer than a single 8K read.
+> 
+> Proof?
+
+Long time ago I tested with the 32k block size and got 1.5-2x speed up
+comparing ordinary 8k block size in the sequential scan case.
+FYI, if this is the case.
+--
+Tatsuo Ishii
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+From mloftis@wgops.com Thu Apr 25 01:43:14 2002
+Return-path: <mloftis@wgops.com>
+Received: from free.wgops.com (root@dsl092-002-178.sfo1.dsl.speakeasy.net [66.92.2.178])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P5hC426529
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 01:43:13 -0400 (EDT)
+Received: from wgops.com ([10.1.2.207])
+	by free.wgops.com (8.11.3/8.11.3) with ESMTP id g3P5hBR43020;
+	Wed, 24 Apr 2002 22:43:11 -0700 (PDT)
+	(envelope-from mloftis@wgops.com)
+Message-ID: <3CC7976F.7070407@wgops.com>
+Date: Wed, 24 Apr 2002 22:43:11 -0700
+From: Michael Loftis <mloftis@wgops.com>
+User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2
+X-Accept-Language: en-us
+MIME-Version: 1.0
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead
+References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> <12342.1019705420@sss.pgh.pa.us>
+Content-Type: text/plain; charset=us-ascii; format=flowed
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+
+
+Tom Lane wrote:
+
+>Curt Sampson <cjs@cynic.net> writes:
+>
+>>Grabbing bigger chunks is always optimal, AFICT, if they're not
+>>*too* big and you use the data. A single 64K read takes very little
+>>longer than a single 8K read.
+>>
+>
+>Proof?
+>
+I contend this statement.
+
+It's optimal to a point.  I know that my system settles into it's best 
+read-speeds @ 32K or 64K chunks.  8K chunks are far below optimal for my 
+system.  Most systems I work on do far better at 16K than at 8K, and 
+most don't see any degradation when going to 32K chunks.  (this is 
+across numerous OSes and configs -- results are interpretations from 
+bonnie disk i/o marks).
+
+Depending on what you're doing it is more efficiend to read bigger 
+blocks up to a point.  If you're multi-thread or reading in non-blocking 
+mode, take as big a chunk as you can handle or are ready to process in 
+quick order.  If you're picking up a bunch of little chunks here and 
+there and know oyu're not using them again then choose a size that will 
+hopeuflly cause some of the reads to overlap, failing that, pick the 
+smallest usable read size.
+
+The OS can never do that stuff for you.
+
+
+
+From cjs@cynic.net Thu Apr 25 03:29:05 2002
+Return-path: <cjs@cynic.net>
+Received: from angelic.cynic.net ([202.232.117.21])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7T3404027
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:29:03 -0400 (EDT)
+Received: from localhost (localhost [127.0.0.1])
+	by angelic.cynic.net (Postfix) with ESMTP
+	id 1C44E870E; Thu, 25 Apr 2002 16:28:51 +0900 (JST)
+Date: Thu, 25 Apr 2002 16:28:51 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
+In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
+Message-ID: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Wed, 24 Apr 2002, Tom Lane wrote:
+
+> Curt Sampson <cjs@cynic.net> writes:
+> > Grabbing bigger chunks is always optimal, AFICT, if they're not
+> > *too* big and you use the data. A single 64K read takes very little
+> > longer than a single 8K read.
+>
+> Proof?
+
+Well, there are various sorts of "proof" for this assertion. What
+sort do you want?
+
+Here's a few samples; if you're looking for something different to
+satisfy you, let's discuss it.
+
+1. Theoretical proof: two components of the delay in retrieving a
+block from disk are the disk arm movement and the wait for the
+right block to rotate under the head.
+
+When retrieving, say, eight adjacent blocks, these will be spread
+across no more than two cylinders (with luck, only one). The worst
+case access time for a single block is the disk arm movement plus
+the full rotational wait; this is the same as the worst case for
+eight blocks if they're all on one cylinder. If they're not on one
+cylinder, they're still on adjacent cylinders, requiring a very
+short seek.
+
+2. Proof by others using it: SQL server uses 64K reads when doing
+table scans, as they say that their research indicates that the
+major limitation is usually the number of I/O requests, not the
+I/O capacity of the disk. BSD's explicitly separates the optimum
+allocation size for storage (1K fragments) and optimum read size
+(8K blocks) because they found performance to be much better when
+a larger size block was read. Most file system vendors, too, do
+read-ahead for this very reason.
+
+3. Proof by testing. I wrote a little ruby program to seek to a
+random point in the first 2 GB of my raw disk partition and read
+1-8 8K blocks of data. (This was done as one I/O request.) (Using
+the raw disk partition I avoid any filesystem buffering.) Here are
+typical results:
+
+ 125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
+ 250 reads of  8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
+ 500 reads of  4x8K blocks: 2.5 sec, 199 req/sec.   5.03 ms/req, 1.26 ms/block
+1000 reads of  2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
+2000 reads of  1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block
+
+The ratios of data retrieval speed per read for groups of adjacent
+8K blocks, assuming a single 8K block reads in 1 time unit, are:
+
+    1 block	1.00
+    2 blocks	1.18
+    4 blocks	1.56
+    8 blocks	2.34
+    16 blocks	4.68
+
+At less than 20% more expensive, certainly two-block read requests
+could be considered to cost "very little more" than one-block read
+requests. Even four-block read requests are only half-again as
+expensive. And if you know you're really going to be using the
+data, read in 8 block chunks and your cost per block (in terms of
+time) drops to less than a third of the cost of single-block reads.
+
+Let me put paid to comments about multiple simultaneous readers
+making this invalid. Here's a typical result I get with four
+instances of the program running simultaneously:
+
+125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
+250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
+500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
+1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
+2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block
+
+Here's the ratio table again, with another column comparing the
+aggregate number of requests per second for one process and four
+processes:
+
+    1 block	1.00		310 : 440
+    2 blocks	1.10		262 : 401
+    4 blocks	1.28		199 : 346
+    8 blocks	1.69		132 : 260
+    16 blocks	3.89		 66 : 113
+
+Note that, here the relative increase in performance for increasing
+sizes of reads is even *better* until we get past 64K chunks. The
+overall throughput is better, of course, because with more requests
+per second coming in, the disk seek ordering code has more to work
+with and the average seek time spent seeking vs. reading will be
+reduced.
+
+You know, this is not rocket science; I'm sure there must be papers
+all over the place about this. If anybody still disagrees that it's
+a good thing to read chunks up to 64K or so when the blocks are
+adjacent and you know you'll need the data, I'd like to see some
+tangible evidence to support that.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From cjs@cynic.net Thu Apr 25 03:55:59 2002
+Return-path: <cjs@cynic.net>
+Received: from angelic.cynic.net ([202.232.117.21])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7tv405489
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:55:57 -0400 (EDT)
+Received: from localhost (localhost [127.0.0.1])
+	by angelic.cynic.net (Postfix) with ESMTP
+	id 188EC870E; Thu, 25 Apr 2002 16:55:51 +0900 (JST)
+Date: Thu, 25 Apr 2002 16:55:50 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead
+In-Reply-To: <200204250404.g3P44OI19061@candle.pha.pa.us>
+Message-ID: <Pine.NEB.4.43.0204251636550.3111-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Thu, 25 Apr 2002, Bruce Momjian wrote:
+
+> Well, we are guilty of trying to push as much as possible on to other
+> software.  We do this for portability reasons, and because we think our
+> time is best spent dealing with db issues, not issues then can be deal
+> with by other existing software, as long as the software is decent.
+
+That's fine. I think that's a perfectly fair thing to do.
+
+It was just the wording (i.e., "it's this other software's fault
+that blah de blah") that got to me. To say, "We don't do readahead
+becase most OSes supply it, and we feel that other things would
+help more to improve performance," is fine by me. Or even, "Well,
+nobody feels like doing it. You want it, do it yourself," I have
+no problem with.
+
+> Sure, that is certainly true.  However, it is hard to know what the
+> future will hold even if we had perfect knowledge of what was happening
+> in the kernel.  We don't know who else is going to start doing I/O once
+> our I/O starts.  We may have a better idea with kernel knowledge, but we
+> still don't know 100% what will be cached.
+
+Well, we do if we use raw devices and do our own caching, using
+pages that are pinned in RAM. That was sort of what I was aiming
+at for the long run.
+
+> We have free-behind on our list.
+
+Uh...can't do it, if you're relying on the OS to do the buffering.
+How do you tell the OS that you're no longer going to use a page?
+
+> I think LRU-K will do this quite well
+> and be a nice general solution for more than just sequential scans.
+
+LRU-K sounds like a great idea to me, as does putting pages read
+for a table scan at the LRU end of the cache, rather than the MRU
+(assuming we do something to ensure that they stay in cache until
+read once, at any rate).
+
+But again, great for your own cache, but doesn't work with the OS
+cache. And I'm a bit scared to crank up too high the amount of
+memory I give Postgres, lest the OS try to too aggressively buffer
+all that I/O in what memory remains to it, and start blowing programs
+(like maybe the backend binary itself) out of RAM. But maybe this
+isn't typically a problem; I don't know.
+
+> There may be validity in this.  It is easy to do (I think) and could be
+> a win.
+
+It didn't look to difficult to me, when I looked at the code, and
+you can see what kind of win it is from the response I just made
+to Tom.
+
+> >     1. It is *not* true that you have no idea where data is when
+> >     using a storage array or other similar system. While you
+> >     certainly ought not worry about things such as head positions
+> >     and so on, it's been a given for a long, long time that two
+> >     blocks that have close index numbers are going to be close
+> >     together in physical storage.
+>
+> SCSI drivers, for example, are pretty smart.  Not sure we can take
+> advantage of that from user-land I/O.
+
+Looking at the NetBSD ones, I don't see what they're doing that's
+so smart. (Aside from some awfully clever workarounds for stupid
+hardware limitations that would otherwise kill performance.) What
+sorts of "smart" are you referring to?
+
+> Yes, but we are seeing some db's moving away from raw I/O.
+
+Such as whom? And are you certain that they're moving to using the
+OS buffer cache, too? MS SQL server, for example, uses the filesystem,
+but turns off all buffering on those files.
+
+> Our performance numbers beat most of the big db's already, so we must
+> be doing something right.
+
+Really? Do the performance numbers for simple, bulk operations
+(imports, exports, table scans) beat the others handily? My intuition
+says not, but I'll happily be convinced otherwise.
+
+> Yes, but do we spend our time doing that.  Is the payoff worth it, vs.
+> working on other features.  Sure it would be great to have all these
+> fancy things, but is this where our time should be spent, considering
+> other items on the TODO list?
+
+I agree that these things need to be assesed.
+
+> Jumping in and doing the I/O ourselves is a big undertaking, and looking
+> at our TODO list, I am not sure if it is worth it right now.
+
+Right. I'm not trying to say this is a critical priority, I'm just
+trying to determine what we do right now, what we could do, and
+the potential performance increase that would give us.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From cjs@cynic.net Thu Apr 25 05:19:11 2002
+Return-path: <cjs@cynic.net>
+Received: from angelic.cynic.net ([202.232.117.21])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P9J9412878
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 05:19:10 -0400 (EDT)
+Received: from localhost (localhost [127.0.0.1])
+	by angelic.cynic.net (Postfix) with ESMTP
+	id 50386870E; Thu, 25 Apr 2002 18:19:03 +0900 (JST)
+Date: Thu, 25 Apr 2002 18:19:02 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
+In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
+Message-ID: <Pine.NEB.4.43.0204251805000.3111-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Thu, 25 Apr 2002, Curt Sampson wrote:
+
+> Here's the ratio table again, with another column comparing the
+> aggregate number of requests per second for one process and four
+> processes:
+>
+
+Just for interest, I ran this again with 20 processes working
+simultaneously. I did six runs at each blockread size and summed
+the tps for each process to find the aggregate number of reads per
+second during the test. I dropped the higest and the lowest ones,
+and averaged the rest. Here's the new table:
+
+		1 proc	4 procs	20 procs
+
+    1 block	310	440	260
+    2 blocks	262	401	481
+    4 blocks	199	346	354
+    8 blocks	132	260	250
+    16 blocks	 66	113	116
+
+I'm not sure at all why performance gets so much *worse* with a lot of
+contention on the 1K reads. This could have something to with NetBSD, or
+its buffer cache, or my laptop's crappy little disk drive....
+
+Or maybe I'm just running out of CPU.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From tgl@sss.pgh.pa.us Thu Apr 25 09:54:35 2002
+Return-path: <tgl@sss.pgh.pa.us>
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3PDsY407038
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 09:54:34 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3PDsXF25059;
+	Thu, 25 Apr 2002 09:54:33 -0400 (EDT)
+To: Curt Sampson <cjs@cynic.net>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
+In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> 
+References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
+Comments: In-reply-to Curt Sampson <cjs@cynic.net>
+	message dated "Thu, 25 Apr 2002 16:28:51 +0900"
+Date: Thu, 25 Apr 2002 09:54:32 -0400
+Message-ID: <25056.1019742872@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Status: OR
+
+Curt Sampson <cjs@cynic.net> writes:
+> 1. Theoretical proof: two components of the delay in retrieving a
+> block from disk are the disk arm movement and the wait for the
+> right block to rotate under the head.
+
+> When retrieving, say, eight adjacent blocks, these will be spread
+> across no more than two cylinders (with luck, only one).
+
+Weren't you contending earlier that with modern disk mechs you really
+have no idea where the data is?  You're asserting as an article of 
+faith that the OS has been able to place the file's data blocks
+optimally --- or at least well enough to avoid unnecessary seeks.
+But just a few days ago I was getting told that random_page_cost
+was BS because there could be no such placement.
+
+I'm getting a tad tired of sweeping generalizations offered without
+proof, especially when they conflict.
+
+> 3. Proof by testing. I wrote a little ruby program to seek to a
+> random point in the first 2 GB of my raw disk partition and read
+> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
+> the raw disk partition I avoid any filesystem buffering.)
+
+And also ensure that you aren't testing the point at issue.
+The point at issue is that *in the presence of kernel read-ahead*
+it's quite unclear that there's any benefit to a larger request size.
+Ideally the kernel will have the next block ready for you when you
+ask, no matter what the request is.
+
+There's been some talk of using the AIO interface (where available)
+to "encourage" the kernel to do read-ahead.  I don't foresee us
+writing our own substitute filesystem to make this happen, however.
+Oracle may have the manpower for that sort of boondoggle, but we
+don't...
+
+			regards, tom lane
+
+From pgsql-hackers-owner+M22053@postgresql.org Thu Apr 25 20:45:42 2002
+Return-path: <pgsql-hackers-owner+M22053@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q0jg405210
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 20:45:42 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 17CE6476270; Thu, 25 Apr 2002 20:45:38 -0400 (EDT)
+Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
+	by postgresql.org (Postfix) with ESMTP id 257DC47591C
+	for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 20:45:25 -0400 (EDT)
+Received: (from kaf@localhost)
+	by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3Q0erX14397;
+	Thu, 25 Apr 2002 17:40:53 -0700
+From: Kyle <kaf@nwlink.com>
+MIME-Version: 1.0
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Message-ID: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
+Date: Thu, 25 Apr 2002 17:40:53 -0700
+To: PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
+In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
+References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
+	<25056.1019742872@sss.pgh.pa.us>
+X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: ORr
+
+Tom Lane wrote:
+> ...
+> Curt Sampson <cjs@cynic.net> writes:
+> > 3. Proof by testing. I wrote a little ruby program to seek to a
+> > random point in the first 2 GB of my raw disk partition and read
+> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
+> > the raw disk partition I avoid any filesystem buffering.)
+> 
+> And also ensure that you aren't testing the point at issue.
+> The point at issue is that *in the presence of kernel read-ahead*
+> it's quite unclear that there's any benefit to a larger request size.
+> Ideally the kernel will have the next block ready for you when you
+> ask, no matter what the request is.
+> ...
+
+I have to agree with Tom.  I think the numbers below show that with
+kernel read-ahead, block size isn't an issue.
+
+The big_file1 file used below is 2.0 gig of random data, and the
+machine has 512 mb of main memory.  This ensures that we're not
+just getting cached data.
+
+foreach i (4k 8k 16k 32k 64k 128k)
+  echo $i
+  time dd bs=$i if=big_file1 of=/dev/null
+end
+
+and the results:
+
+bs    user    kernel   elapsed
+4k:   0.260   7.740    1:27.25
+8k:   0.210   8.060    1:30.48
+16k:  0.090   7.790    1:30.88
+32k:  0.060   8.090    1:32.75
+64k:  0.030   8.190    1:29.11
+128k: 0.070   9.830    1:28.74
+
+so with kernel read-ahead, we have basically the same elapsed (wall
+time) regardless of block size.  Sure, user time drops to a low at 64k
+blocksize, but kernel time is increasing.
+
+
+You could argue that this is a contrived example, no other I/O is
+being done.  Well I created a second 2.0g file (big_file2) and did two
+simultaneous reads from the same disk.  Sure performance went to hell
+but it shows blocksize is still irrelevant in a multi I/O environment
+with sequential read-ahead.
+
+foreach i ( 4k 8k 16k 32k 64k 128k )
+  echo $i
+  time dd bs=$i if=big_file1 of=/dev/null &
+  time dd bs=$i if=big_file2 of=/dev/null &
+  wait
+end
+
+bs    user    kernel   elapsed
+4k:   0.480   8.290    6:34.13  bigfile1
+      0.320   8.730    6:34.33  bigfile2
+8k:   0.250   7.580    6:31.75
+      0.180   8.450    6:31.88
+16k:  0.150   8.390    6:32.47
+      0.100   7.900    6:32.55
+32k:  0.190   8.460    6:24.72
+      0.060   8.410    6:24.73
+64k:  0.060   9.350    6:25.05
+      0.150   9.240    6:25.13
+128k: 0.090  10.610    6:33.14
+      0.110  11.320    6:33.31
+
+
+the differences in read times are basically in the mud.  Blocksize
+just doesn't matter much with the kernel doing readahead.
+
+-Kyle
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+From pgsql-hackers-owner+M22055@postgresql.org Thu Apr 25 22:19:07 2002
+Return-path: <pgsql-hackers-owner+M22055@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2J7411254
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:19:07 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id F3924476208; Thu, 25 Apr 2002 22:19:02 -0400 (EDT)
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id 6741D474E71
+	for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 22:18:50 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g3Q2Ili11246;
+	Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200204260218.g3Q2Ili11246@candle.pha.pa.us>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead
+In-Reply-To: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
+To: Kyle <kaf@nwlink.com>
+Date: Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+
+Nice test.  Would you test simultaneous 'dd' on the same file, perhaps
+with a slight delay between to the two so they don't read each other's
+blocks?
+
+seek() in the file will turn off read-ahead in most OS's.  I am not
+saying this is a major issue for PostgreSQL but the numbers would be
+interesting.
+
+
+---------------------------------------------------------------------------
+
+Kyle wrote:
+> Tom Lane wrote:
+> > ...
+> > Curt Sampson <cjs@cynic.net> writes:
+> > > 3. Proof by testing. I wrote a little ruby program to seek to a
+> > > random point in the first 2 GB of my raw disk partition and read
+> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
+> > > the raw disk partition I avoid any filesystem buffering.)
+> > 
+> > And also ensure that you aren't testing the point at issue.
+> > The point at issue is that *in the presence of kernel read-ahead*
+> > it's quite unclear that there's any benefit to a larger request size.
+> > Ideally the kernel will have the next block ready for you when you
+> > ask, no matter what the request is.
+> > ...
+> 
+> I have to agree with Tom.  I think the numbers below show that with
+> kernel read-ahead, block size isn't an issue.
+> 
+> The big_file1 file used below is 2.0 gig of random data, and the
+> machine has 512 mb of main memory.  This ensures that we're not
+> just getting cached data.
+> 
+> foreach i (4k 8k 16k 32k 64k 128k)
+>   echo $i
+>   time dd bs=$i if=big_file1 of=/dev/null
+> end
+> 
+> and the results:
+> 
+> bs    user    kernel   elapsed
+> 4k:   0.260   7.740    1:27.25
+> 8k:   0.210   8.060    1:30.48
+> 16k:  0.090   7.790    1:30.88
+> 32k:  0.060   8.090    1:32.75
+> 64k:  0.030   8.190    1:29.11
+> 128k: 0.070   9.830    1:28.74
+> 
+> so with kernel read-ahead, we have basically the same elapsed (wall
+> time) regardless of block size.  Sure, user time drops to a low at 64k
+> blocksize, but kernel time is increasing.
+> 
+> 
+> You could argue that this is a contrived example, no other I/O is
+> being done.  Well I created a second 2.0g file (big_file2) and did two
+> simultaneous reads from the same disk.  Sure performance went to hell
+> but it shows blocksize is still irrelevant in a multi I/O environment
+> with sequential read-ahead.
+> 
+> foreach i ( 4k 8k 16k 32k 64k 128k )
+>   echo $i
+>   time dd bs=$i if=big_file1 of=/dev/null &
+>   time dd bs=$i if=big_file2 of=/dev/null &
+>   wait
+> end
+> 
+> bs    user    kernel   elapsed
+> 4k:   0.480   8.290    6:34.13  bigfile1
+>       0.320   8.730    6:34.33  bigfile2
+> 8k:   0.250   7.580    6:31.75
+>       0.180   8.450    6:31.88
+> 16k:  0.150   8.390    6:32.47
+>       0.100   7.900    6:32.55
+> 32k:  0.190   8.460    6:24.72
+>       0.060   8.410    6:24.73
+> 64k:  0.060   9.350    6:25.05
+>       0.150   9.240    6:25.13
+> 128k: 0.090  10.610    6:33.14
+>       0.110  11.320    6:33.31
+> 
+> 
+> the differences in read times are basically in the mud.  Blocksize
+> just doesn't matter much with the kernel doing readahead.
+> 
+> -Kyle
+> 
+> ---------------------------(end of broadcast)---------------------------
+> TIP 6: Have you searched our list archives?
+> 
+> http://archives.postgresql.org
+> 
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+From cjs@cynic.net Thu Apr 25 22:27:23 2002
+Return-path: <cjs@cynic.net>
+Received: from angelic.cynic.net ([202.232.117.21])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2RL411868
+	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:27:22 -0400 (EDT)
+Received: from localhost (localhost [127.0.0.1])
+	by angelic.cynic.net (Postfix) with ESMTP
+	id AF60C870E; Fri, 26 Apr 2002 11:27:17 +0900 (JST)
+Date: Fri, 26 Apr 2002 11:27:17 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
+In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
+Message-ID: <Pine.NEB.4.43.0204261028110.449-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Thu, 25 Apr 2002, Tom Lane wrote:
+
+> Curt Sampson <cjs@cynic.net> writes:
+> > 1. Theoretical proof: two components of the delay in retrieving a
+> > block from disk are the disk arm movement and the wait for the
+> > right block to rotate under the head.
+>
+> > When retrieving, say, eight adjacent blocks, these will be spread
+> > across no more than two cylinders (with luck, only one).
+>
+> Weren't you contending earlier that with modern disk mechs you really
+> have no idea where the data is?
+
+No, that was someone else. I contend that with pretty much any
+large-scale storage mechanism (i.e., anything beyond ramdisks),
+you will find that accessing two adjacent blocks is almost always
+1) close to as fast as accessing just the one, and 2) much, much
+faster than accessing two blocks that are relatively far apart.
+
+There will be the odd case where the two adjacent blocks are
+physically far apart, but this is rare.
+
+If this idea doesn't hold true, the whole idea that sequential
+reads are faster than random reads falls apart, and the optimizer
+shouldn't even have the option to make random reads cost more, much
+less have it set to four rather than one (or whatever it's set to).
+
+> You're asserting as an article of
+> faith that the OS has been able to place the file's data blocks
+> optimally --- or at least well enough to avoid unnecessary seeks.
+
+So are you, in the optimizer. But that's all right; the OS often
+can and does do this placement; the FFS filesystem is explicitly
+designed to do this sort of thing. If the filesystem isn't empty
+and the files grow a lot they'll be split into large fragments,
+but the fragments will be contiguous.
+
+> But just a few days ago I was getting told that random_page_cost
+> was BS because there could be no such placement.
+
+I've been arguing against that point as well.
+
+> And also ensure that you aren't testing the point at issue.
+> The point at issue is that *in the presence of kernel read-ahead*
+> it's quite unclear that there's any benefit to a larger request size.
+
+I will test this.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+