Add raw file discussion to performance TODO.detail.

This commit is contained in:
Bruce Momjian 2002-08-26 01:04:13 +00:00
parent 7e3f2449d8
commit e21e02ab12
1 changed files with 796 additions and 3 deletions

View File

@ -345,7 +345,7 @@ From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
Received: from localhost (majordom@localhost)
by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
@ -454,7 +454,7 @@ From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
Received: from localhost (majordom@localhost)
by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
@ -1006,7 +1006,7 @@ From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165
for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT)
Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
Received: from hub.org (majordom@localhost [127.0.0.1])
by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477;
Fri, 16 Jun 2000 17:13:36 -0400 (EDT)
@ -2239,3 +2239,796 @@ from 1 to "maybe" for nodes that get too dense.
Hannu
From pgsql-hackers-owner+M21991@postgresql.org Wed Apr 24 23:37:37 2002
Return-path: <pgsql-hackers-owner+M21991@postgresql.org>
Received: from postgresql.org (postgresql.org [64.49.215.8])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3ba416337
for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:37:36 -0400 (EDT)
Received: from postgresql.org (postgresql.org [64.49.215.8])
by postgresql.org (Postfix) with SMTP
id CF13447622B; Wed, 24 Apr 2002 23:37:31 -0400 (EDT)
Received: from sraigw.sra.co.jp (sraigw.sra.co.jp [202.32.10.2])
by postgresql.org (Postfix) with ESMTP id 3EE92474E4B
for <pgsql-hackers@postgresql.org>; Wed, 24 Apr 2002 23:37:19 -0400 (EDT)
Received: from srascb.sra.co.jp (srascb [133.137.8.65])
by sraigw.sra.co.jp (8.9.3/3.7W-sraigw) with ESMTP id MAA76393;
Thu, 25 Apr 2002 12:35:44 +0900 (JST)
Received: (from root@localhost)
by srascb.sra.co.jp (8.11.6/8.11.6) id g3P3ZCK64299;
Thu, 25 Apr 2002 12:35:12 +0900 (JST)
(envelope-from t-ishii@sra.co.jp)
Received: from sranhm.sra.co.jp (sranhm [133.137.170.62])
by srascb.sra.co.jp (8.11.6/8.11.6av) with ESMTP id g3P3ZBV64291;
Thu, 25 Apr 2002 12:35:11 +0900 (JST)
(envelope-from t-ishii@sra.co.jp)
Received: from localhost (IDENT:t-ishii@srapc1474.sra.co.jp [133.137.170.59])
by sranhm.sra.co.jp (8.9.3+3.2W/3.7W-srambox) with ESMTP id MAA25562;
Thu, 25 Apr 2002 12:35:43 +0900
To: tgl@sss.pgh.pa.us
cc: cjs@cynic.net, pgman@candle.pha.pa.us, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
<12342.1019705420@sss.pgh.pa.us>
X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1
=?iso-2022-jp?B?KBskQjAqGyhCKQ==?=
MIME-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <20020425123429E.t-ishii@sra.co.jp>
Date: Thu, 25 Apr 2002 12:34:29 +0900
From: Tatsuo Ishii <t-ishii@sra.co.jp>
X-Dispatcher: imput version 20000228(IM140)
Lines: 12
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR
> Curt Sampson <cjs@cynic.net> writes:
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
> > *too* big and you use the data. A single 64K read takes very little
> > longer than a single 8K read.
>
> Proof?
Long time ago I tested with the 32k block size and got 1.5-2x speed up
comparing ordinary 8k block size in the sequential scan case.
FYI, if this is the case.
--
Tatsuo Ishii
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/users-lounge/docs/faq.html
From mloftis@wgops.com Thu Apr 25 01:43:14 2002
Return-path: <mloftis@wgops.com>
Received: from free.wgops.com (root@dsl092-002-178.sfo1.dsl.speakeasy.net [66.92.2.178])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P5hC426529
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 01:43:13 -0400 (EDT)
Received: from wgops.com ([10.1.2.207])
by free.wgops.com (8.11.3/8.11.3) with ESMTP id g3P5hBR43020;
Wed, 24 Apr 2002 22:43:11 -0700 (PDT)
(envelope-from mloftis@wgops.com)
Message-ID: <3CC7976F.7070407@wgops.com>
Date: Wed, 24 Apr 2002 22:43:11 -0700
From: Michael Loftis <mloftis@wgops.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2
X-Accept-Language: en-us
MIME-Version: 1.0
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> <12342.1019705420@sss.pgh.pa.us>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Status: OR
Tom Lane wrote:
>Curt Sampson <cjs@cynic.net> writes:
>
>>Grabbing bigger chunks is always optimal, AFICT, if they're not
>>*too* big and you use the data. A single 64K read takes very little
>>longer than a single 8K read.
>>
>
>Proof?
>
I contend this statement.
It's optimal to a point. I know that my system settles into it's best
read-speeds @ 32K or 64K chunks. 8K chunks are far below optimal for my
system. Most systems I work on do far better at 16K than at 8K, and
most don't see any degradation when going to 32K chunks. (this is
across numerous OSes and configs -- results are interpretations from
bonnie disk i/o marks).
Depending on what you're doing it is more efficiend to read bigger
blocks up to a point. If you're multi-thread or reading in non-blocking
mode, take as big a chunk as you can handle or are ready to process in
quick order. If you're picking up a bunch of little chunks here and
there and know oyu're not using them again then choose a size that will
hopeuflly cause some of the reads to overlap, failing that, pick the
smallest usable read size.
The OS can never do that stuff for you.
From cjs@cynic.net Thu Apr 25 03:29:05 2002
Return-path: <cjs@cynic.net>
Received: from angelic.cynic.net ([202.232.117.21])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7T3404027
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:29:03 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
by angelic.cynic.net (Postfix) with ESMTP
id 1C44E870E; Thu, 25 Apr 2002 16:28:51 +0900 (JST)
Date: Thu, 25 Apr 2002 16:28:51 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
Message-ID: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR
On Wed, 24 Apr 2002, Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
> > *too* big and you use the data. A single 64K read takes very little
> > longer than a single 8K read.
>
> Proof?
Well, there are various sorts of "proof" for this assertion. What
sort do you want?
Here's a few samples; if you're looking for something different to
satisfy you, let's discuss it.
1. Theoretical proof: two components of the delay in retrieving a
block from disk are the disk arm movement and the wait for the
right block to rotate under the head.
When retrieving, say, eight adjacent blocks, these will be spread
across no more than two cylinders (with luck, only one). The worst
case access time for a single block is the disk arm movement plus
the full rotational wait; this is the same as the worst case for
eight blocks if they're all on one cylinder. If they're not on one
cylinder, they're still on adjacent cylinders, requiring a very
short seek.
2. Proof by others using it: SQL server uses 64K reads when doing
table scans, as they say that their research indicates that the
major limitation is usually the number of I/O requests, not the
I/O capacity of the disk. BSD's explicitly separates the optimum
allocation size for storage (1K fragments) and optimum read size
(8K blocks) because they found performance to be much better when
a larger size block was read. Most file system vendors, too, do
read-ahead for this very reason.
3. Proof by testing. I wrote a little ruby program to seek to a
random point in the first 2 GB of my raw disk partition and read
1-8 8K blocks of data. (This was done as one I/O request.) (Using
the raw disk partition I avoid any filesystem buffering.) Here are
typical results:
125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
250 reads of 8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
500 reads of 4x8K blocks: 2.5 sec, 199 req/sec. 5.03 ms/req, 1.26 ms/block
1000 reads of 2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
2000 reads of 1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block
The ratios of data retrieval speed per read for groups of adjacent
8K blocks, assuming a single 8K block reads in 1 time unit, are:
1 block 1.00
2 blocks 1.18
4 blocks 1.56
8 blocks 2.34
16 blocks 4.68
At less than 20% more expensive, certainly two-block read requests
could be considered to cost "very little more" than one-block read
requests. Even four-block read requests are only half-again as
expensive. And if you know you're really going to be using the
data, read in 8 block chunks and your cost per block (in terms of
time) drops to less than a third of the cost of single-block reads.
Let me put paid to comments about multiple simultaneous readers
making this invalid. Here's a typical result I get with four
instances of the program running simultaneously:
125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block
Here's the ratio table again, with another column comparing the
aggregate number of requests per second for one process and four
processes:
1 block 1.00 310 : 440
2 blocks 1.10 262 : 401
4 blocks 1.28 199 : 346
8 blocks 1.69 132 : 260
16 blocks 3.89 66 : 113
Note that, here the relative increase in performance for increasing
sizes of reads is even *better* until we get past 64K chunks. The
overall throughput is better, of course, because with more requests
per second coming in, the disk seek ordering code has more to work
with and the average seek time spent seeking vs. reading will be
reduced.
You know, this is not rocket science; I'm sure there must be papers
all over the place about this. If anybody still disagrees that it's
a good thing to read chunks up to 64K or so when the blocks are
adjacent and you know you'll need the data, I'd like to see some
tangible evidence to support that.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
From cjs@cynic.net Thu Apr 25 03:55:59 2002
Return-path: <cjs@cynic.net>
Received: from angelic.cynic.net ([202.232.117.21])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7tv405489
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:55:57 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
by angelic.cynic.net (Postfix) with ESMTP
id 188EC870E; Thu, 25 Apr 2002 16:55:51 +0900 (JST)
Date: Thu, 25 Apr 2002 16:55:50 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <200204250404.g3P44OI19061@candle.pha.pa.us>
Message-ID: <Pine.NEB.4.43.0204251636550.3111-100000@angelic.cynic.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR
On Thu, 25 Apr 2002, Bruce Momjian wrote:
> Well, we are guilty of trying to push as much as possible on to other
> software. We do this for portability reasons, and because we think our
> time is best spent dealing with db issues, not issues then can be deal
> with by other existing software, as long as the software is decent.
That's fine. I think that's a perfectly fair thing to do.
It was just the wording (i.e., "it's this other software's fault
that blah de blah") that got to me. To say, "We don't do readahead
becase most OSes supply it, and we feel that other things would
help more to improve performance," is fine by me. Or even, "Well,
nobody feels like doing it. You want it, do it yourself," I have
no problem with.
> Sure, that is certainly true. However, it is hard to know what the
> future will hold even if we had perfect knowledge of what was happening
> in the kernel. We don't know who else is going to start doing I/O once
> our I/O starts. We may have a better idea with kernel knowledge, but we
> still don't know 100% what will be cached.
Well, we do if we use raw devices and do our own caching, using
pages that are pinned in RAM. That was sort of what I was aiming
at for the long run.
> We have free-behind on our list.
Uh...can't do it, if you're relying on the OS to do the buffering.
How do you tell the OS that you're no longer going to use a page?
> I think LRU-K will do this quite well
> and be a nice general solution for more than just sequential scans.
LRU-K sounds like a great idea to me, as does putting pages read
for a table scan at the LRU end of the cache, rather than the MRU
(assuming we do something to ensure that they stay in cache until
read once, at any rate).
But again, great for your own cache, but doesn't work with the OS
cache. And I'm a bit scared to crank up too high the amount of
memory I give Postgres, lest the OS try to too aggressively buffer
all that I/O in what memory remains to it, and start blowing programs
(like maybe the backend binary itself) out of RAM. But maybe this
isn't typically a problem; I don't know.
> There may be validity in this. It is easy to do (I think) and could be
> a win.
It didn't look to difficult to me, when I looked at the code, and
you can see what kind of win it is from the response I just made
to Tom.
> > 1. It is *not* true that you have no idea where data is when
> > using a storage array or other similar system. While you
> > certainly ought not worry about things such as head positions
> > and so on, it's been a given for a long, long time that two
> > blocks that have close index numbers are going to be close
> > together in physical storage.
>
> SCSI drivers, for example, are pretty smart. Not sure we can take
> advantage of that from user-land I/O.
Looking at the NetBSD ones, I don't see what they're doing that's
so smart. (Aside from some awfully clever workarounds for stupid
hardware limitations that would otherwise kill performance.) What
sorts of "smart" are you referring to?
> Yes, but we are seeing some db's moving away from raw I/O.
Such as whom? And are you certain that they're moving to using the
OS buffer cache, too? MS SQL server, for example, uses the filesystem,
but turns off all buffering on those files.
> Our performance numbers beat most of the big db's already, so we must
> be doing something right.
Really? Do the performance numbers for simple, bulk operations
(imports, exports, table scans) beat the others handily? My intuition
says not, but I'll happily be convinced otherwise.
> Yes, but do we spend our time doing that. Is the payoff worth it, vs.
> working on other features. Sure it would be great to have all these
> fancy things, but is this where our time should be spent, considering
> other items on the TODO list?
I agree that these things need to be assesed.
> Jumping in and doing the I/O ourselves is a big undertaking, and looking
> at our TODO list, I am not sure if it is worth it right now.
Right. I'm not trying to say this is a critical priority, I'm just
trying to determine what we do right now, what we could do, and
the potential performance increase that would give us.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
From cjs@cynic.net Thu Apr 25 05:19:11 2002
Return-path: <cjs@cynic.net>
Received: from angelic.cynic.net ([202.232.117.21])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P9J9412878
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 05:19:10 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
by angelic.cynic.net (Postfix) with ESMTP
id 50386870E; Thu, 25 Apr 2002 18:19:03 +0900 (JST)
Date: Thu, 25 Apr 2002 18:19:02 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
Message-ID: <Pine.NEB.4.43.0204251805000.3111-100000@angelic.cynic.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR
On Thu, 25 Apr 2002, Curt Sampson wrote:
> Here's the ratio table again, with another column comparing the
> aggregate number of requests per second for one process and four
> processes:
>
Just for interest, I ran this again with 20 processes working
simultaneously. I did six runs at each blockread size and summed
the tps for each process to find the aggregate number of reads per
second during the test. I dropped the higest and the lowest ones,
and averaged the rest. Here's the new table:
1 proc 4 procs 20 procs
1 block 310 440 260
2 blocks 262 401 481
4 blocks 199 346 354
8 blocks 132 260 250
16 blocks 66 113 116
I'm not sure at all why performance gets so much *worse* with a lot of
contention on the 1K reads. This could have something to with NetBSD, or
its buffer cache, or my laptop's crappy little disk drive....
Or maybe I'm just running out of CPU.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
From tgl@sss.pgh.pa.us Thu Apr 25 09:54:35 2002
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (root@[192.204.191.242])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3PDsY407038
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 09:54:34 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3PDsXF25059;
Thu, 25 Apr 2002 09:54:33 -0400 (EDT)
To: Curt Sampson <cjs@cynic.net>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
message dated "Thu, 25 Apr 2002 16:28:51 +0900"
Date: Thu, 25 Apr 2002 09:54:32 -0400
Message-ID: <25056.1019742872@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR
Curt Sampson <cjs@cynic.net> writes:
> 1. Theoretical proof: two components of the delay in retrieving a
> block from disk are the disk arm movement and the wait for the
> right block to rotate under the head.
> When retrieving, say, eight adjacent blocks, these will be spread
> across no more than two cylinders (with luck, only one).
Weren't you contending earlier that with modern disk mechs you really
have no idea where the data is? You're asserting as an article of
faith that the OS has been able to place the file's data blocks
optimally --- or at least well enough to avoid unnecessary seeks.
But just a few days ago I was getting told that random_page_cost
was BS because there could be no such placement.
I'm getting a tad tired of sweeping generalizations offered without
proof, especially when they conflict.
> 3. Proof by testing. I wrote a little ruby program to seek to a
> random point in the first 2 GB of my raw disk partition and read
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> the raw disk partition I avoid any filesystem buffering.)
And also ensure that you aren't testing the point at issue.
The point at issue is that *in the presence of kernel read-ahead*
it's quite unclear that there's any benefit to a larger request size.
Ideally the kernel will have the next block ready for you when you
ask, no matter what the request is.
There's been some talk of using the AIO interface (where available)
to "encourage" the kernel to do read-ahead. I don't foresee us
writing our own substitute filesystem to make this happen, however.
Oracle may have the manpower for that sort of boondoggle, but we
don't...
regards, tom lane
From pgsql-hackers-owner+M22053@postgresql.org Thu Apr 25 20:45:42 2002
Return-path: <pgsql-hackers-owner+M22053@postgresql.org>
Received: from postgresql.org (postgresql.org [64.49.215.8])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q0jg405210
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 20:45:42 -0400 (EDT)
Received: from postgresql.org (postgresql.org [64.49.215.8])
by postgresql.org (Postfix) with SMTP
id 17CE6476270; Thu, 25 Apr 2002 20:45:38 -0400 (EDT)
Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
by postgresql.org (Postfix) with ESMTP id 257DC47591C
for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 20:45:25 -0400 (EDT)
Received: (from kaf@localhost)
by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3Q0erX14397;
Thu, 25 Apr 2002 17:40:53 -0700
From: Kyle <kaf@nwlink.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
Date: Thu, 25 Apr 2002 17:40:53 -0700
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
<25056.1019742872@sss.pgh.pa.us>
X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr
Tom Lane wrote:
> ...
> Curt Sampson <cjs@cynic.net> writes:
> > 3. Proof by testing. I wrote a little ruby program to seek to a
> > random point in the first 2 GB of my raw disk partition and read
> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> > the raw disk partition I avoid any filesystem buffering.)
>
> And also ensure that you aren't testing the point at issue.
> The point at issue is that *in the presence of kernel read-ahead*
> it's quite unclear that there's any benefit to a larger request size.
> Ideally the kernel will have the next block ready for you when you
> ask, no matter what the request is.
> ...
I have to agree with Tom. I think the numbers below show that with
kernel read-ahead, block size isn't an issue.
The big_file1 file used below is 2.0 gig of random data, and the
machine has 512 mb of main memory. This ensures that we're not
just getting cached data.
foreach i (4k 8k 16k 32k 64k 128k)
echo $i
time dd bs=$i if=big_file1 of=/dev/null
end
and the results:
bs user kernel elapsed
4k: 0.260 7.740 1:27.25
8k: 0.210 8.060 1:30.48
16k: 0.090 7.790 1:30.88
32k: 0.060 8.090 1:32.75
64k: 0.030 8.190 1:29.11
128k: 0.070 9.830 1:28.74
so with kernel read-ahead, we have basically the same elapsed (wall
time) regardless of block size. Sure, user time drops to a low at 64k
blocksize, but kernel time is increasing.
You could argue that this is a contrived example, no other I/O is
being done. Well I created a second 2.0g file (big_file2) and did two
simultaneous reads from the same disk. Sure performance went to hell
but it shows blocksize is still irrelevant in a multi I/O environment
with sequential read-ahead.
foreach i ( 4k 8k 16k 32k 64k 128k )
echo $i
time dd bs=$i if=big_file1 of=/dev/null &
time dd bs=$i if=big_file2 of=/dev/null &
wait
end
bs user kernel elapsed
4k: 0.480 8.290 6:34.13 bigfile1
0.320 8.730 6:34.33 bigfile2
8k: 0.250 7.580 6:31.75
0.180 8.450 6:31.88
16k: 0.150 8.390 6:32.47
0.100 7.900 6:32.55
32k: 0.190 8.460 6:24.72
0.060 8.410 6:24.73
64k: 0.060 9.350 6:25.05
0.150 9.240 6:25.13
128k: 0.090 10.610 6:33.14
0.110 11.320 6:33.31
the differences in read times are basically in the mud. Blocksize
just doesn't matter much with the kernel doing readahead.
-Kyle
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org
From pgsql-hackers-owner+M22055@postgresql.org Thu Apr 25 22:19:07 2002
Return-path: <pgsql-hackers-owner+M22055@postgresql.org>
Received: from postgresql.org (postgresql.org [64.49.215.8])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2J7411254
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:19:07 -0400 (EDT)
Received: from postgresql.org (postgresql.org [64.49.215.8])
by postgresql.org (Postfix) with SMTP
id F3924476208; Thu, 25 Apr 2002 22:19:02 -0400 (EDT)
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
by postgresql.org (Postfix) with ESMTP id 6741D474E71
for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 22:18:50 -0400 (EDT)
Received: (from pgman@localhost)
by candle.pha.pa.us (8.11.6/8.10.1) id g3Q2Ili11246;
Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200204260218.g3Q2Ili11246@candle.pha.pa.us>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
To: Kyle <kaf@nwlink.com>
Date: Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR
Nice test. Would you test simultaneous 'dd' on the same file, perhaps
with a slight delay between to the two so they don't read each other's
blocks?
seek() in the file will turn off read-ahead in most OS's. I am not
saying this is a major issue for PostgreSQL but the numbers would be
interesting.
---------------------------------------------------------------------------
Kyle wrote:
> Tom Lane wrote:
> > ...
> > Curt Sampson <cjs@cynic.net> writes:
> > > 3. Proof by testing. I wrote a little ruby program to seek to a
> > > random point in the first 2 GB of my raw disk partition and read
> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> > > the raw disk partition I avoid any filesystem buffering.)
> >
> > And also ensure that you aren't testing the point at issue.
> > The point at issue is that *in the presence of kernel read-ahead*
> > it's quite unclear that there's any benefit to a larger request size.
> > Ideally the kernel will have the next block ready for you when you
> > ask, no matter what the request is.
> > ...
>
> I have to agree with Tom. I think the numbers below show that with
> kernel read-ahead, block size isn't an issue.
>
> The big_file1 file used below is 2.0 gig of random data, and the
> machine has 512 mb of main memory. This ensures that we're not
> just getting cached data.
>
> foreach i (4k 8k 16k 32k 64k 128k)
> echo $i
> time dd bs=$i if=big_file1 of=/dev/null
> end
>
> and the results:
>
> bs user kernel elapsed
> 4k: 0.260 7.740 1:27.25
> 8k: 0.210 8.060 1:30.48
> 16k: 0.090 7.790 1:30.88
> 32k: 0.060 8.090 1:32.75
> 64k: 0.030 8.190 1:29.11
> 128k: 0.070 9.830 1:28.74
>
> so with kernel read-ahead, we have basically the same elapsed (wall
> time) regardless of block size. Sure, user time drops to a low at 64k
> blocksize, but kernel time is increasing.
>
>
> You could argue that this is a contrived example, no other I/O is
> being done. Well I created a second 2.0g file (big_file2) and did two
> simultaneous reads from the same disk. Sure performance went to hell
> but it shows blocksize is still irrelevant in a multi I/O environment
> with sequential read-ahead.
>
> foreach i ( 4k 8k 16k 32k 64k 128k )
> echo $i
> time dd bs=$i if=big_file1 of=/dev/null &
> time dd bs=$i if=big_file2 of=/dev/null &
> wait
> end
>
> bs user kernel elapsed
> 4k: 0.480 8.290 6:34.13 bigfile1
> 0.320 8.730 6:34.33 bigfile2
> 8k: 0.250 7.580 6:31.75
> 0.180 8.450 6:31.88
> 16k: 0.150 8.390 6:32.47
> 0.100 7.900 6:32.55
> 32k: 0.190 8.460 6:24.72
> 0.060 8.410 6:24.73
> 64k: 0.060 9.350 6:25.05
> 0.150 9.240 6:25.13
> 128k: 0.090 10.610 6:33.14
> 0.110 11.320 6:33.31
>
>
> the differences in read times are basically in the mud. Blocksize
> just doesn't matter much with the kernel doing readahead.
>
> -Kyle
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org
From cjs@cynic.net Thu Apr 25 22:27:23 2002
Return-path: <cjs@cynic.net>
Received: from angelic.cynic.net ([202.232.117.21])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2RL411868
for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:27:22 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
by angelic.cynic.net (Postfix) with ESMTP
id AF60C870E; Fri, 26 Apr 2002 11:27:17 +0900 (JST)
Date: Fri, 26 Apr 2002 11:27:17 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
Message-ID: <Pine.NEB.4.43.0204261028110.449-100000@angelic.cynic.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR
On Thu, 25 Apr 2002, Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > 1. Theoretical proof: two components of the delay in retrieving a
> > block from disk are the disk arm movement and the wait for the
> > right block to rotate under the head.
>
> > When retrieving, say, eight adjacent blocks, these will be spread
> > across no more than two cylinders (with luck, only one).
>
> Weren't you contending earlier that with modern disk mechs you really
> have no idea where the data is?
No, that was someone else. I contend that with pretty much any
large-scale storage mechanism (i.e., anything beyond ramdisks),
you will find that accessing two adjacent blocks is almost always
1) close to as fast as accessing just the one, and 2) much, much
faster than accessing two blocks that are relatively far apart.
There will be the odd case where the two adjacent blocks are
physically far apart, but this is rare.
If this idea doesn't hold true, the whole idea that sequential
reads are faster than random reads falls apart, and the optimizer
shouldn't even have the option to make random reads cost more, much
less have it set to four rather than one (or whatever it's set to).
> You're asserting as an article of
> faith that the OS has been able to place the file's data blocks
> optimally --- or at least well enough to avoid unnecessary seeks.
So are you, in the optimizer. But that's all right; the OS often
can and does do this placement; the FFS filesystem is explicitly
designed to do this sort of thing. If the filesystem isn't empty
and the files grow a lot they'll be split into large fragments,
but the fragments will be contiguous.
> But just a few days ago I was getting told that random_page_cost
> was BS because there could be no such placement.
I've been arguing against that point as well.
> And also ensure that you aren't testing the point at issue.
> The point at issue is that *in the presence of kernel read-ahead*
> it's quite unclear that there's any benefit to a larger request size.
I will test this.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC