From 543539bb35c45e2b8783daf296c83c3696ad2834 Mon Sep 17 00:00:00 2001
From: Bruce Momjian <bruce@momjian.us>
Date: Mon, 26 Aug 2002 23:14:15 +0000
Subject: [PATCH] Add discussion of pre-write pages to WAL.

---
 doc/TODO.detail/wal | 2700 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 2700 insertions(+)
 create mode 100644 doc/TODO.detail/wal

diff --git a/doc/TODO.detail/wal b/doc/TODO.detail/wal
new file mode 100644
index 0000000000..6842d8daab
--- /dev/null
+++ b/doc/TODO.detail/wal
@@ -0,0 +1,2700 @@
+From cjs@cynic.net Thu Jun 20 22:18:27 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5L2IPo22195
+	for <pgman@candle.pha.pa.us>; Thu, 20 Jun 2002 22:18:26 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 88216F821; Fri, 21 Jun 2002 02:18:17 +0000 (UTC)
+Date: Fri, 21 Jun 2002 11:18:14 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us>
+Message-ID: <Pine.NEB.4.43.0206211106390.437-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Thu, 20 Jun 2002, Bruce Momjian wrote:
+
+> > MS SQL Server has an interesting way of dealing with this. They have a
+> > "torn" bit in each 512-byte chunk of a page, and this bit is set the
+> > same for each chunk. When they are about to write out a page, they first
+> > flip all of the torn bits and then do the write. If the write does not
+> > complete due to a system crash or whatever, this can be detected later
+> > because the torn bits won't match across the entire page.
+>
+> I was wondering, how does knowing the block is corrupt help MS SQL?
+
+I'm trying to recall, but I can't off hand. I'll have to look it
+up in my Inside SQL Server book, which is at home right now,
+unfortunately. I'll bring the book into work and let you know the
+details later.
+
+> Right now, we write changed pages to WAL, then later write them to disk.
+
+Ah. You write the entire page? MS writes only the changed tuple.
+And DB2, in fact, goes one better and writes only the part of the
+tuple up to the change, IIRC. Thus, if you put smaller and/or more
+frequently changed columns first, you'll have smaller logs.
+
+> I have always been looking for a way to prevent these WAL writes.  The
+> 512-byte bit seems interesting, but how does it help?
+
+Well, this would at least let you reduce the write to the 512-byte
+chunk that changed, rather than writing the entire 8K page.
+
+> And how does the bit help them with partial block writes?  Is the bit at
+> the end of the block?  Is that reliable?
+
+The bit is somewhere within every 512 byte "disk page" within the
+8192 byte "filesystem/database page." So an 8KB page is divided up
+like this:
+
+    | <----------------------- 8 Kb ----------------------> |
+
+    | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b |
+
+Thus, the tear bits start out like this:
+
+    |  0   |  0   |  0   |  0   |  0   |  0   |  0   |  0   |
+
+After a successful write of the entire page, you have this:
+
+    |  1   |  1   |  1   |  1   |  1   |  1   |  1   |  1   |
+
+If the write is unsuccessful, you end up with something like this:
+
+    |  1   |  1   |  1   |  1   |  1   |  0   |  0   |  0   |
+
+And now you know which parts of your page got written, and which
+parts didn't.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From cjs@cynic.net Sat Jun 22 04:41:54 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net ([63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5M8fpF04711
+	for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 04:41:53 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 415C8F820; Sat, 22 Jun 2002 08:41:33 +0000 (UTC)
+Date: Sat, 22 Jun 2002 17:41:30 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
+   mlw <markw@mohawksoft.com>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <19332.1024668861@sss.pgh.pa.us>
+Message-ID: <Pine.NEB.4.43.0206221731130.1091-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Fri, 21 Jun 2002, Tom Lane wrote:
+
+> Curt Sampson <cjs@cynic.net> writes:
+> > And now you know which parts of your page got written, and which
+> > parts didn't.
+>
+> Yes ... and what do you *do* about it?
+
+Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207:
+
+    torn page detection   When TRUE, this option causes a bit to be
+	flipped for each 512-byte sector in a database page (8 KB)
+	whenever the page is written to disk.  This option allows
+	SQL Server to detect incomplete I/O operations caused by
+	power failures or other system outages. If a bit is in the
+	wrong state when the page is later read by SQL Server, this
+	means the page was written incorrectly; a torn page has
+	been detected. Although SQL Server database pages are 8
+	KB, disks perform I/O operations using 512-byte sectors.
+	Therefore, 16 sectors are written per database page.  A
+	torn page can occur if the system crashes (for example,
+	because of power failure) between the time the operating
+	system writes the first 512-byte sector to disk and the
+	completion of the 8-KB I/O operation.  If the first sector
+	of a database page is successfully written before the crash,
+	it will appear that the database page on disk was updated,
+	although it might not have succeeded. Using battery-backed
+	disk caches can ensure that data is [sic] successfully
+	written to disk or not written at all. In this case, don't
+	set torn page detection to TRUE, as it isn't needed. If a
+	torn page is detected, the database will need to be restored
+	from backup because it will be physically inconsistent.
+
+As I understand it, this is not a problem for postgres becuase the
+entire page is written to the log. So postgres is safe, but quite
+inefficient. (It would be much more efficient to write just the
+changed tuple, or even just the changed values within the tuple,
+to the log.)
+
+Adding these torn bits would allow posgres at least to write to
+the log just the 512-byte sectors that have changed, rather than
+the entire 8 KB page.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From pgsql-hackers-owner+M24060@postgresql.org Sat Jun 22 18:31:21 2002
+Return-path: <pgsql-hackers-owner+M24060@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5MMVKF20014
+	for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 18:31:20 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 0ADFE476090; Sat, 22 Jun 2002 18:31:10 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 6B372475A96; Sat, 22 Jun 2002 18:28:42 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 47AD2475935
+	for <pgsql-hackers@postgresql.org>; Sat, 22 Jun 2002 18:28:40 -0400 (EDT)
+Received: from hades.usol.com (hades.usol.com [208.232.58.41])
+	by postgresql.org (Postfix) with ESMTP id 1D5DA476166
+	for <pgsql-hackers@postgresql.org>; Sat, 22 Jun 2002 18:23:16 -0400 (EDT)
+Received: from 01-081.024.popsite.net (01-081.024.popsite.net [216.126.160.81])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5MMMOj11344;
+	Sat, 22 Jun 2002 18:22:25 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
+   mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
+   Tom Lane <tgl@sss.pgh.pa.us>
+In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us>
+References: <200206210158.g5L1wFk20118@candle.pha.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+Message-ID: <1024784514.1793.242.camel@localhost.localdomain>
+MIME-Version: 1.0
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 22 Jun 2002 18:22:58 -0400
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: ORr
+
+On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote:
+> I was wondering, how does knowing the block is corrupt help MS SQL? 
+> Right now, we write changed pages to WAL, then later write them to disk.
+> I have always been looking for a way to prevent these WAL writes.  The
+> 512-byte bit seems interesting, but how does it help?
+> 
+> And how does the bit help them with partial block writes?  Is the bit at
+> the end of the block?  Is that reliable?
+> 
+
+My understanding of this is as follows:
+
+1) On most commercial systems, if you get a corrupted block (from
+partial write or whatever) you need to restore the file(s) from the most
+recent backup, and replay the log from the log archive (usually only the
+damaged files will be written to during replay). 
+
+2) If you can't deal with the downtime to recover the file, then EMC,
+Sun, or IBM will sell you an expensive disk array with an NVRAM cache
+that will do atomic writes. Some plain-vanilla SCSI disks are also
+capable of atomic writes, though usually they don't use NVRAM to do it. 
+
+The database must then make sure that each page-write gets translated
+into exactly one SCSI-level write. This is one reason why ORACLE and
+Sybase recommend that you use raw disk partitions for high availability.
+Some operating systems support this through the filesystem, but it is OS
+dependent. I think Solaris 7 & 8 has support for this, but I'm not sure.
+
+PostgreSQL has trouble because it can neither archive logs for replay,
+nor use raw disk partitions.
+
+
+One other point:
+
+Page pre-image logging is fundamentally the same as what Jim Grey's
+book[1] would call "careful writes". I don't believe they should be in
+the XLOG, because we never need to keep the pre-images after we're sure
+the buffer has made it to the disk. Instead, we should have the buffer
+IO routines implement ping-pong writes of some kind if we want
+protection from partial writes.
+
+
+Does any of this make sense?
+
+
+
+;jrnield
+
+
+[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts
+	and Techniques". Morgan Kaufmann.
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+From pgsql-hackers-owner+M24068@postgresql.org Sun Jun 23 08:40:27 2002
+Return-path: <pgsql-hackers-owner+M24068@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCeQF01601
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 08:40:27 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 8AC4B475CBC; Sun, 23 Jun 2002 08:40:22 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 4683647599D; Sun, 23 Jun 2002 08:37:40 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 0D57847592A
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 08:37:38 -0400 (EDT)
+Received: from hades.usol.com (hades.usol.com [208.232.58.41])
+	by postgresql.org (Postfix) with ESMTP id 75326475876
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 08:37:36 -0400 (EDT)
+Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111;
+	Sun, 23 Jun 2002 08:37:23 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
+   mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
+   Tom Lane <tgl@sss.pgh.pa.us>
+In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us>
+References: <200206222317.g5MNHBn23427@candle.pha.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 23 Jun 2002 08:37:53 -0400
+Message-ID: <1024835880.1793.264.camel@localhost.localdomain>
+MIME-Version: 1.0
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
+> J. R. Nield wrote:
+> > One other point:
+> > 
+> > Page pre-image logging is fundamentally the same as what Jim Grey's
+> > book[1] would call "careful writes". I don't believe they should be in
+> > the XLOG, because we never need to keep the pre-images after we're sure
+> > the buffer has made it to the disk. Instead, we should have the buffer
+> > IO routines implement ping-pong writes of some kind if we want
+> > protection from partial writes.
+> 
+> Ping-pong writes to where?  We have to fsync, and rather than fsync that
+> area and WAL, we just do WAL.  Not sure about a win there.
+> 
+
+The key question is: do we have some method to ensure that the OS
+doesn't do the writes in parallel?
+
+If the OS will ensure that one of the two block writes of a ping-pong
+completes before the other starts, then we don't need to fsync() at 
+all. 
+
+The only thing we are protecting against is the possibility of both
+writes being partial. If neither is done, that's fine because WAL will
+protect us. If the first write is partial, we will detect that and use
+the old data from the other, then recover from WAL. If the first is
+complete but the second is partial, then we detect that and use the
+newer block from the first write. If the second is complete but the
+first is partial, we detect that and use the newer block from the second
+write.
+
+So does anyone know a way to prevent parallel writes in one of the
+common unix standards? Do they say anything about this?
+
+It would seem to me that if the same process does both ping-pong writes,
+then there should be a cheap way to enforce a serial order. I could be
+wrong though.
+
+As to where the first block of the ping-pong should go, maybe we could
+reserve a file with nBlocks space for them, and write the information
+about which block was being written to the XLOG for use in recovery.
+There are many other ways to do it.
+
+;jrnield
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+From jrnield@usol.com Sun Jun 23 08:37:30 2002
+Return-path: <jrnield@usol.com>
+Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCbRF28741
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 08:37:28 -0400 (EDT)
+Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111;
+	Sun, 23 Jun 2002 08:37:23 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
+   mlw
+  <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
+   Tom Lane <tgl@sss.pgh.pa.us>
+In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us>
+References: <200206222317.g5MNHBn23427@candle.pha.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 23 Jun 2002 08:37:53 -0400
+Message-ID: <1024835880.1793.264.camel@localhost.localdomain>
+MIME-Version: 1.0
+Status: OR
+
+On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
+> J. R. Nield wrote:
+> > One other point:
+> > 
+> > Page pre-image logging is fundamentally the same as what Jim Grey's
+> > book[1] would call "careful writes". I don't believe they should be in
+> > the XLOG, because we never need to keep the pre-images after we're sure
+> > the buffer has made it to the disk. Instead, we should have the buffer
+> > IO routines implement ping-pong writes of some kind if we want
+> > protection from partial writes.
+> 
+> Ping-pong writes to where?  We have to fsync, and rather than fsync that
+> area and WAL, we just do WAL.  Not sure about a win there.
+> 
+
+The key question is: do we have some method to ensure that the OS
+doesn't do the writes in parallel?
+
+If the OS will ensure that one of the two block writes of a ping-pong
+completes before the other starts, then we don't need to fsync() at 
+all. 
+
+The only thing we are protecting against is the possibility of both
+writes being partial. If neither is done, that's fine because WAL will
+protect us. If the first write is partial, we will detect that and use
+the old data from the other, then recover from WAL. If the first is
+complete but the second is partial, then we detect that and use the
+newer block from the first write. If the second is complete but the
+first is partial, we detect that and use the newer block from the second
+write.
+
+So does anyone know a way to prevent parallel writes in one of the
+common unix standards? Do they say anything about this?
+
+It would seem to me that if the same process does both ping-pong writes,
+then there should be a cheap way to enforce a serial order. I could be
+wrong though.
+
+As to where the first block of the ping-pong should go, maybe we could
+reserve a file with nBlocks space for them, and write the information
+about which block was being written to the XLOG for use in recovery.
+There are many other ways to do it.
+
+;jrnield
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+From cjs@cynic.net Sun Jun 23 09:33:29 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NDXSF11543
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 09:33:28 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id A83ABF820; Sun, 23 Jun 2002 13:33:15 +0000 (UTC)
+Date: Sun, 23 Jun 2002 22:33:07 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: "J. R. Nield" <jrnield@usol.com>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
+   mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
+   Tom Lane <tgl@sss.pgh.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024835880.1793.264.camel@localhost.localdomain>
+Message-ID: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On 23 Jun 2002, J. R. Nield wrote:
+
+> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
+> > J. R. Nield wrote:
+> > > One other point:
+> > >
+> > > Page pre-image logging is fundamentally the same as what Jim Grey's
+> > > book[1] would call "careful writes". I don't believe they should be in
+> > > the XLOG, because we never need to keep the pre-images after we're sure
+> > > the buffer has made it to the disk. Instead, we should have the buffer
+> > > IO routines implement ping-pong writes of some kind if we want
+> > > protection from partial writes.
+> >
+> > Ping-pong writes to where?  We have to fsync, and rather than fsync that
+> > area and WAL, we just do WAL.  Not sure about a win there.
+
+Presumably the win is that, "we never need to keep the pre-images
+after we're sure the buffer has made it to the disk." So the
+pre-image log can be completely ditched when we shut down the
+server, so a full system sync, or whatever. This keeps the log file
+size down, which means faster recovery, less to back up (when we
+start getting transaction logs that can be backed up), etc.
+
+This should also allow us to disable completely the ping-pong writes
+if we have a disk subsystem that we trust. (E.g., a disk array with
+battery backed memory.) That would, in theory, produce a nice little
+performance increase when lots of inserts and/or updates are being
+committed, as we have much, much less to write to the log file.
+
+Are there stats that track, e.g., the bandwidth of writes to the
+log file? I'd be interested in knowing just what kind of savings
+one might see by doing this.
+
+> The key question is: do we have some method to ensure that the OS
+> doesn't do the writes in parallel?...
+> It would seem to me that if the same process does both ping-pong writes,
+> then there should be a cheap way to enforce a serial order. I could be
+> wrong though.
+
+Well, whether or not there's a cheap way depends on whether you consider
+fsync to be cheap. :-)
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From pgsql-hackers-owner+M24073@postgresql.org Sun Jun 23 11:19:59 2002
+Return-path: <pgsql-hackers-owner+M24073@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NFJxF19785
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 11:19:59 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 0BD5B475E79; Sun, 23 Jun 2002 11:19:55 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 5C0CB475D6A; Sun, 23 Jun 2002 11:19:50 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id E2353475C4B
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 11:19:47 -0400 (EDT)
+Received: from sss.pgh.pa.us (unknown [192.204.191.242])
+	by postgresql.org (Postfix) with ESMTP id 746F8475AEA
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 11:19:46 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5NFJF108464;
+	Sun, 23 Jun 2002 11:19:15 -0400 (EDT)
+To: Curt Sampson <cjs@cynic.net>
+cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net> 
+References: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
+Comments: In-reply-to Curt Sampson <cjs@cynic.net>
+	message dated "Sun, 23 Jun 2002 22:33:07 +0900"
+Date: Sun, 23 Jun 2002 11:19:15 -0400
+Message-ID: <8461.1024845555@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+Curt Sampson <cjs@cynic.net> writes:
+> This should also allow us to disable completely the ping-pong writes
+> if we have a disk subsystem that we trust.
+
+If we have a disk subsystem we trust, we just disable fsync on the
+WAL and the performance issue largely goes away.
+
+I concur with Bruce: the reason we keep page images in WAL is to
+minimize the number of places we have to fsync, and thus the amount of
+head movement required for a commit.  Putting the page images elsewhere
+cannot be a win AFAICS.
+
+> Well, whether or not there's a cheap way depends on whether you consider
+> fsync to be cheap. :-)
+
+It's never cheap :-(
+
+			regards, tom lane
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+From cjs@cynic.net Sun Jun 23 12:10:44 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NGAgF22907
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 12:10:43 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 57BFDF820; Sun, 23 Jun 2002 16:10:35 +0000 (UTC)
+Date: Mon, 24 Jun 2002 01:10:26 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <8461.1024845555@sss.pgh.pa.us>
+Message-ID: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Sun, 23 Jun 2002, Tom Lane wrote:
+
+> Curt Sampson <cjs@cynic.net> writes:
+> > This should also allow us to disable completely the ping-pong writes
+> > if we have a disk subsystem that we trust.
+>
+> If we have a disk subsystem we trust, we just disable fsync on the
+> WAL and the performance issue largely goes away.
+
+No, you can't do this. If you don't fsync(), there's no guarantee
+that the write ever got out of the computer's buffer cache and to
+the disk subsystem in the first place.
+
+> I concur with Bruce: the reason we keep page images in WAL is to
+> minimize the number of places we have to fsync, and thus the amount of
+> head movement required for a commit.
+
+An fsync() does not necessarially cause head movement, or any real
+disk writes at all. If you're writing to many external disk arrays,
+for example, the fsync() ensures that the data are in the disk array's
+non-volatile or UPS-backed RAM, no more. The array might hold the data
+for quite some time before it actually writes it to disk.
+
+But you're right that it's faster, if you're going to write out changed
+pages and have have the ping-pong file and the transaction log on the
+same disk, just to write out the entire page to the transaction log.
+
+So what we would really need to implement, if we wanted to be more
+efficient with trusted disk subsystems, would be the option of writing
+to the log only the changed row or changed part of the row, or writing
+the entire changed page. I don't know how hard this would be....
+
+> > Well, whether or not there's a cheap way depends on whether you consider
+> > fsync to be cheap. :-)
+>
+> It's never cheap :-(
+
+Actually, with a good external RAID system with non-volatile RAM,
+it's a good two to four orders of magnitude cheaper than writing to a
+directly connected disk that doesn't claim the write is complete until
+it's physically on disk. I'd say that it qualifies as at least "not
+expensive." Not that you want to do it more often than you have to
+anyway....
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From jrnield@usol.com Sun Jun 23 13:56:59 2002
+Return-path: <jrnield@usol.com>
+Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NHusF00335
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 13:56:58 -0400 (EDT)
+Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NHunj18549;
+	Sun, 23 Jun 2002 13:56:49 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker
+  <pgsql-hackers@postgresql.org>
+In-Reply-To: <8461.1024845555@sss.pgh.pa.us>
+References: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net> 
+	<8461.1024845555@sss.pgh.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 23 Jun 2002 13:57:19 -0400
+Message-ID: <1024855044.1793.414.camel@localhost.localdomain>
+MIME-Version: 1.0
+Status: ORr
+
+On Sun, 2002-06-23 at 11:19, Tom Lane wrote: 
+> Curt Sampson <cjs@cynic.net> writes:
+> > This should also allow us to disable completely the ping-pong writes
+> > if we have a disk subsystem that we trust.
+> 
+> If we have a disk subsystem we trust, we just disable fsync on the
+> WAL and the performance issue largely goes away.
+
+It wouldn't work because the OS buffering interferes, and we need those
+WAL records on disk up to the greatest LSN of the Buffer we will be writing.
+
+
+We already buffer WAL ourselves. We also already buffer regular pages.
+Whenever we write a Buffer out of the buffer cache, it is because we
+really want that page on disk and wanted to start an IO. If thats not
+the case, then we should have more block buffers! 
+
+So since we have all this buffering designed especially to meet our
+needs, and since the OS buffering is in the way, can someone explain to
+me why postgresql would ever open a file without the O_DSYNC flag if the
+platform supports it? 
+
+
+
+> 
+> I concur with Bruce: the reason we keep page images in WAL is to
+> minimize the number of places we have to fsync, and thus the amount of
+> head movement required for a commit.  Putting the page images elsewhere
+> cannot be a win AFAICS.
+
+
+Why not put all the page images in a single pre-allocated file and treat
+it as a ring? How could this be any worse than flushing them in the WAL
+log? 
+
+Maybe fsync would be slower with two files, but I don't see how
+fdatasync would be, and most platforms support that. 
+
+What would improve performance would be to have a dbflush process that
+would work in the background flushing buffers in groups and trying to
+stay ahead of ReadBuffer requests. That would let you do the temporary
+side of the ping-pong as a huge O_DSYNC writev(2) request (or
+fdatasync() once) and then write out the other buffers. It would also
+tend to prevent the other backends from blocking on write requests. 
+
+A dbflush could also support aio_read/aio_write on platforms like
+Solaris and WindowsNT that support it. 
+
+Am I correct that right now, buffers only get written when they get
+removed from the free list for reuse? So a released dirty buffer will
+sit in the buffer free list until it becomes the Least Recently Used
+buffer, and will then cause a backend to block for IO in a call to
+BufferAlloc? 
+
+This would explain why we like using the OS buffer cache, and why our
+performance is troublesome when we have to do synchronous IO writes, and
+why fsync() takes so long to complete. All of the backends block for
+each call to BufferAlloc() after a large table update by a single
+backend, and then the OS buffers are always full of our "written" data. 
+
+Am I reading the bufmgr code correctly? I already found an imaginary
+race condition there once :-) 
+
+;jnield 
+
+
+> 
+> > Well, whether or not there's a cheap way depends on whether you consider
+> > fsync to be cheap. :-)
+> 
+> It's never cheap :-(
+> 
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+From cjs@cynic.net Sun Jun 23 14:15:15 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIFEF01698
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 14:15:15 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 796E6F820; Sun, 23 Jun 2002 18:15:08 +0000 (UTC)
+Date: Mon, 24 Jun 2002 03:15:01 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: "J. R. Nield" <jrnield@usol.com>
+cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
+Message-ID: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: ORr
+
+On 23 Jun 2002, J. R. Nield wrote:
+
+> So since we have all this buffering designed especially to meet our
+> needs, and since the OS buffering is in the way, can someone explain to
+> me why postgresql would ever open a file without the O_DSYNC flag if the
+> platform supports it?
+
+It's more code, if there are platforms out there that don't support
+O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
+O_DSYNC could save us a disk arm movement over fsync() because it
+appears to me that fsync is also going to force a metadata update, which
+means that the inode blocks have to be written as well.
+
+> Maybe fsync would be slower with two files, but I don't see how
+> fdatasync would be, and most platforms support that.
+
+Because, if both files are on the same disk, you still have to move
+the disk arm from the cylinder at the current log file write point
+to the cylinder at the current ping-pong file write point. And then back
+again to the log file write point cylinder.
+
+In the end, having a ping-pong file as well seems to me unnecessary
+complexity, especially when anyone interested in really good
+performance is going to buy a disk subsystem that guarantees no
+torn pages and thus will want to turn off the ping-pong file writes
+entirely, anyway.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From jrnield@usol.com Sun Jun 23 14:14:51 2002
+Return-path: <jrnield@usol.com>
+Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIEnF01649
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 14:14:50 -0400 (EDT)
+Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NIEkj19287;
+	Sun, 23 Jun 2002 14:14:46 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Curt Sampson <cjs@cynic.net>
+cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker
+  <pgsql-hackers@postgresql.org>
+In-Reply-To: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
+References: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 23 Jun 2002 14:15:17 -0400
+Message-ID: <1024856120.3054.418.camel@localhost.localdomain>
+MIME-Version: 1.0
+Status: OR
+
+On Sun, 2002-06-23 at 12:10, Curt Sampson wrote:
+> 
+> So what we would really need to implement, if we wanted to be more
+> efficient with trusted disk subsystems, would be the option of writing
+> to the log only the changed row or changed part of the row, or writing
+> the entire changed page. I don't know how hard this would be....
+> 
+We already log that stuff. The page images are in addition to the
+"Logical Changes", so we could just stop logging the page images.
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002
+Return-path: <pgsql-hackers-owner+M24100@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OHDeF08564
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 13:13:40 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 05602475CBE; Mon, 24 Jun 2002 13:11:10 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 13:11:10 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 929A247633B; Mon, 24 Jun 2002 09:26:54 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 962C147631A
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:31:43 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 08:31:43 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id C112D475C3C
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:35:20 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5NJYtL07449;
+	Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206231934.g5NJYtL07449@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
+To: "J. R. Nield" <jrnield@usol.com>
+Date: Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
+cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+J. R. Nield wrote:
+> So since we have all this buffering designed especially to meet our
+> needs, and since the OS buffering is in the way, can someone explain to
+> me why postgresql would ever open a file without the O_DSYNC flag if the
+> platform supports it? 
+
+We sync only WAL, not the other pages, except for the sync() call we do
+during checkpoint when we discard old WAL files.
+
+> > I concur with Bruce: the reason we keep page images in WAL is to
+> > minimize the number of places we have to fsync, and thus the amount of
+> > head movement required for a commit.  Putting the page images elsewhere
+> > cannot be a win AFAICS.
+> 
+> 
+> Why not put all the page images in a single pre-allocated file and treat
+> it as a ring? How could this be any worse than flushing them in the WAL
+> log? 
+> 
+> Maybe fsync would be slower with two files, but I don't see how
+> fdatasync would be, and most platforms support that. 
+
+We have fdatasync option for WAL in postgresql.conf.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+
+
+From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002
+Return-path: <pgsql-hackers-owner+M24091@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:54:22 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 09:48:51 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 2769E4762E3
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:27:39 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 08:27:39 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id ED459475C61
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:37:08 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642;
+	Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
+To: Curt Sampson <cjs@cynic.net>
+Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
+cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+Curt Sampson wrote:
+> On 23 Jun 2002, J. R. Nield wrote:
+> 
+> > So since we have all this buffering designed especially to meet our
+> > needs, and since the OS buffering is in the way, can someone explain to
+> > me why postgresql would ever open a file without the O_DSYNC flag if the
+> > platform supports it?
+> 
+> It's more code, if there are platforms out there that don't support
+> O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
+> O_DSYNC could save us a disk arm movement over fsync() because it
+> appears to me that fsync is also going to force a metadata update, which
+> means that the inode blocks have to be written as well.
+
+Again, see postgresql.conf:
+
+#wal_sync_method = fsync        # the default varies across platforms:
+#                               # fsync, fdatasync, open_sync, or open_datasync
+
+> 
+> > Maybe fsync would be slower with two files, but I don't see how
+> > fdatasync would be, and most platforms support that.
+> 
+> Because, if both files are on the same disk, you still have to move
+> the disk arm from the cylinder at the current log file write point
+> to the cylinder at the current ping-pong file write point. And then back
+> again to the log file write point cylinder.
+> 
+> In the end, having a ping-pong file as well seems to me unnecessary
+> complexity, especially when anyone interested in really good
+> performance is going to buy a disk subsystem that guarantees no
+> torn pages and thus will want to turn off the ping-pong file writes
+> entirely, anyway.
+
+Yes, I don't see writing to two files vs. one to be any win, especially
+when we need to fsync both of them.  What I would really like is to
+avoid the double I/O of writing to WAL and to the data file;  improving
+that would be a huge win.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+
+
+From cjs@cynic.net Sun Jun 23 20:09:44 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O09hF00630
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 20:09:43 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 6F45AF820; Mon, 24 Jun 2002 00:09:38 +0000 (UTC)
+Date: Mon, 24 Jun 2002 09:09:30 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us>
+Message-ID: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Sun, 23 Jun 2002, Bruce Momjian wrote:
+
+> Yes, I don't see writing to two files vs. one to be any win, especially
+> when we need to fsync both of them.  What I would really like is to
+> avoid the double I/O of writing to WAL and to the data file;  improving
+> that would be a huge win.
+
+You mean, the double I/O of writing the block to the WAL and data file?
+(We'd still have to write the changed columns or whatever to the WAL,
+right?)
+
+I'd just add an option to turn it off. If you need it, you need it;
+there's no way around that except to buy hardware that is really going
+to guarantee your writes (which then means you don't need it).
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From jrnield@usol.com Sun Jun 23 21:28:58 2002
+Return-path: <jrnield@usol.com>
+Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O1SuF06381
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 21:28:57 -0400 (EDT)
+Received: from 01-072.024.popsite.net (01-072.024.popsite.net [216.126.160.72])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5O1Ssj09303;
+	Sun, 23 Jun 2002 21:28:55 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
+   Michael Loftis
+  <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker
+  <pgsql-hackers@postgresql.org>
+In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us>
+References: <200206231936.g5NJasa07642@candle.pha.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 23 Jun 2002 21:29:23 -0400
+Message-ID: <1024882167.1793.733.camel@localhost.localdomain>
+MIME-Version: 1.0
+Status: ORr
+
+On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
+> Yes, I don't see writing to two files vs. one to be any win, especially
+> when we need to fsync both of them.  What I would really like is to
+> avoid the double I/O of writing to WAL and to the data file;  improving
+> that would be a huge win.
+> 
+
+If is impossible to do what you want. You can not protect against
+partial writes without writing pages twice and calling fdatasync between
+them while going through a generic filesystem. The best disk array will
+not protect you if the operating system does not align block writes to
+the structure of the underlying device. Even with raw devices, you need
+special support or knowledge of the operating system and/or the disk
+device to ensure that each write request will be atomic to the
+underlying hardware. 
+
+All other systems rely on the fact that you can recover a damaged file
+using the log archive. This means downtime in the rare case, but no data
+loss. Until PostgreSQL can do this, then it will not be acceptable for
+real critical production use. This is not to knock PostgreSQL, because
+it is a very good database system, and clearly the best open-source one.
+It even has feature advantages over the commercial systems. But at the
+end of the day, unless you have complete understanding of the I/O system
+from write(2) through to the disk system, the only sure ways to protect
+against partial writes are by "careful writes" (in the WAL log or
+elsewhere, writing pages twice), or by requiring (and allowing) users to
+do log-replay recovery when a file is corrupted by a partial write. As
+long as there is a UPS, and the operating system doesn't crash, then
+there still should be no partial writes.
+
+If we log pages to WAL, they are useless when archived (after a
+checkpoint). So either we have a separate "log" for them (the ping-pong
+file), or we should at least remove them when archived, which makes log
+archiving more complex but is perfectly doable.
+
+Finally, I would love to hear why we are using the operating system
+buffer manager at all. The OS is acting as a secondary buffer manager
+for us. Why is that? What flaw in our I/O system does this reveal? I
+know that:
+
+>We sync only WAL, not the other pages, except for the sync() call we do
+> during checkpoint when we discard old WAL files.
+
+But this is probably not a good thing. We should only be writing blocks
+when they need to be on disk. We should not be expecting the OS to write
+them "sometime later" and avoid blocking (as long) for the write. If we
+need that, then our buffer management is wrong and we need to fix it.
+The reason we are doing this is because we expect the OS buffer manager
+to do asynchronous I/O for us, but then we don't control the order. That
+is the reason why we have to call fdatasync(), to create "sequence
+points".
+
+The reason we have performance problems with either D_OSYNC or fdatasync
+on the normal relations is because we have no dbflush process. This
+causes an unacceptable amount of I/O blocking by other transactions.
+
+The ORACLE people were not kidding when they said that they could not
+certify Linux for production use until it supported O_DSYNC. Can you
+explain why that was the case?
+
+Finally, let me apologize if the above comes across as somewhat
+belligerent. I know very well that I can't compete with you guys for
+knowledge of the PosgreSQL system. I am still at a loss when I look at
+the optimizer and executor modules, and it will take some time before I
+can follow discussion of that area. Even then, I doubt my ability to
+compare with people like Mr. Lane and Mr. Momjian in experience and
+general intelligence, or in the field of database programming and
+software development in particular. However, this discussion and a
+search of the pgsql-hackers archives reveals this problem to be the KEY
+area of PostgreSQL's failing, and general misunderstanding, when
+compared to its commercial competitors.
+
+Sincerely, 
+
+	J. R. Nield
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+From pgsql-hackers-owner+M24090@postgresql.org Mon Jun 24 12:38:04 2002
+Return-path: <pgsql-hackers-owner+M24090@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGc3F05962
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:38:03 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 81B9F4768DF; Mon, 24 Jun 2002 10:18:05 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 10:18:05 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 81F08476473; Mon, 24 Jun 2002 08:55:28 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id CDDFA475CC3
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:37:44 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 08:37:44 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id 5C971475858
+	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 22:47:12 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5O2ki712992;
+	Sun, 23 Jun 2002 22:46:44 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206240246.g5O2ki712992@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
+To: "J. R. Nield" <jrnield@usol.com>
+Date: Sun, 23 Jun 2002 22:46:44 -0400 (EDT)
+cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+J. R. Nield wrote:
+> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
+> > Yes, I don't see writing to two files vs. one to be any win, especially
+> > when we need to fsync both of them.  What I would really like is to
+> > avoid the double I/O of writing to WAL and to the data file;  improving
+> > that would be a huge win.
+> > 
+> 
+> If is impossible to do what you want. You can not protect against
+> partial writes without writing pages twice and calling fdatasync between
+> them while going through a generic filesystem. The best disk array will
+> not protect you if the operating system does not align block writes to
+> the structure of the underlying device. Even with raw devices, you need
+> special support or knowledge of the operating system and/or the disk
+> device to ensure that each write request will be atomic to the
+> underlying hardware. 
+
+Yes, I suspected it was impossible, but that doesn't mean I want it any
+less.  ;-)
+
+> All other systems rely on the fact that you can recover a damaged file
+> using the log archive. This means downtime in the rare case, but no data
+> loss. Until PostgreSQL can do this, then it will not be acceptable for
+> real critical production use. This is not to knock PostgreSQL, because
+> it is a very good database system, and clearly the best open-source one.
+> It even has feature advantages over the commercial systems. But at the
+> end of the day, unless you have complete understanding of the I/O system
+> from write(2) through to the disk system, the only sure ways to protect
+> against partial writes are by "careful writes" (in the WAL log or
+> elsewhere, writing pages twice), or by requiring (and allowing) users to
+> do log-replay recovery when a file is corrupted by a partial write. As
+> long as there is a UPS, and the operating system doesn't crash, then
+> there still should be no partial writes.
+
+You are talking point-in-time recovery, a major missing feature right
+next to replication, and I agree it makes PostgreSQL unacceptable for
+some applications.  Point taken.
+
+And the interesting thing you are saying is that with point-in-time
+recovery, we don't need to write pre-write images of pages because if we
+detect a partial page write, we then abort the database and tell the
+user to do a point-in-time recovery, basically meaning we are using the
+previous full backup as our pre-write page image and roll forward using
+the logical logs.  This is clearly a nice thing to be able to do because
+it let's you take a pre-write image of the page once during full backup,
+keep it offline, and bring it back in the rare case of a full page write
+failure.  I now can see how the MSSQL tearoff-bits would be used, not
+for recovery, but to detect a partial write and force a point-in-time
+recovery from the administrator.
+
+
+> If we log pages to WAL, they are useless when archived (after a
+> checkpoint). So either we have a separate "log" for them (the ping-pong
+> file), or we should at least remove them when archived, which makes log
+> archiving more complex but is perfectly doable.
+
+Yes, that is how we will do point-in-time recovery;  remove the
+pre-write page images and archive the rest.  It is more complex, but
+having the fsync all in one file is too big a win.
+
+> Finally, I would love to hear why we are using the operating system
+> buffer manager at all. The OS is acting as a secondary buffer manager
+> for us. Why is that? What flaw in our I/O system does this reveal? I
+> know that:
+> 
+> >We sync only WAL, not the other pages, except for the sync() call we do
+> > during checkpoint when we discard old WAL files.
+> 
+> But this is probably not a good thing. We should only be writing blocks
+> when they need to be on disk. We should not be expecting the OS to write
+> them "sometime later" and avoid blocking (as long) for the write. If we
+> need that, then our buffer management is wrong and we need to fix it.
+> The reason we are doing this is because we expect the OS buffer manager
+> to do asynchronous I/O for us, but then we don't control the order. That
+> is the reason why we have to call fdatasync(), to create "sequence
+> points".
+
+Yes.  I think I understand.  It is true we have to fsync WAL because we
+can't control the individual writes by the OS.
+
+> The reason we have performance problems with either D_OSYNC or fdatasync
+> on the normal relations is because we have no dbflush process. This
+> causes an unacceptable amount of I/O blocking by other transactions.
+
+Uh, that would force writes all over the disk. Why do we really care how
+the OS writes them?  If we are going to fsync, let's just do the one
+file and be done with it.  What would a separate flusher process really
+buy us  if it has to use fsync too. The main backend doesn't have to
+wait for the fsync, but then again, we can't say the transaction is
+committed until it hits the disk, so how does a flusher help?
+
+> The ORACLE people were not kidding when they said that they could not
+> certify Linux for production use until it supported O_DSYNC. Can you
+> explain why that was the case?
+
+I don't see O_DSYNC as very different from write/fsync(or fdatasync).
+
+> Finally, let me apologize if the above comes across as somewhat
+> belligerent. I know very well that I can't compete with you guys for
+> knowledge of the PostgreSQL system. I am still at a loss when I look at
+> the optimizer and executor modules, and it will take some time before I
+> can follow discussion of that area. Even then, I doubt my ability to
+> compare with people like Mr. Lane and Mr. Momjian in experience and
+> general intelligence, or in the field of database programming and
+> software development in particular. However, this discussion and a
+> search of the pgsql-hackers archives reveals this problem to be the KEY
+> area of PostgreSQL's failing, and general misunderstanding, when
+> compared to its commercial competitors.
+
+We appreciate your ideas.  Few of us are professional db folks so we are
+always looking for good ideas.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+
+
+From cjs@cynic.net Sun Jun 23 23:40:59 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O3evF17903
+	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 23:40:58 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 37F36F820; Mon, 24 Jun 2002 03:40:54 +0000 (UTC)
+Date: Mon, 24 Jun 2002 12:40:51 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: "J. R. Nield" <jrnield@usol.com>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
+Message-ID: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On 23 Jun 2002, J. R. Nield wrote:
+
+> If is impossible to do what you want. You can not protect against
+> partial writes without writing pages twice and calling fdatasync
+> between them while going through a generic filesystem.
+
+I agree with this.
+
+> The best disk array will not protect you if the operating system does
+> not align block writes to the structure of the underlying device.
+
+This I don't quite understand. Assuming you're using a SCSI drive
+(and this mostly applies to ATAPI/IDE, too), you can do naught but
+align block writes to the structure of the underlying device. When you
+initiate a SCSI WRITE command, you start by telling the device at which
+block to start writing and how many blocks you intend to write. Then you
+start passing the data.
+
+(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
+details for the SCSI WRITE(10) command. You may find the SCSI 2
+specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
+reference here.)
+
+> Even with raw devices, you need special support or knowledge of the
+> operating system and/or the disk device to ensure that each write
+> request will be atomic to the underlying hardware.
+
+Well, so here I guess you're talking about two things:
+
+    1. When you request, say, an 8K block write, will the OS really
+    write it to disk in a single 8K or multiple of 8K SCSI write
+    command?
+
+    2. Does the SCSI device you're writing to consider these writes to
+    be transactional. That is, if the write is interrupted before being
+    completed, does the SCSI device guarantee that the partially-sent
+    data is not written, and the old data is maintained? And of course,
+    does it guarantee that, when it acknowledges a write, that write is
+    now in stable storage and will never go away?
+
+Both of these are not hard to guarantee, actually. For a BSD-based OS,
+for example, just make sure that your filesystem block size is the
+same as or a multiple of the database block size. BSD will never write
+anything other than a block or a sequence of blocks to a disk in a
+single SCSI transaction (unless you've got a really odd SCSI driver).
+And for your disk, buy a Baydel or Clarion disk array, or something
+similar.
+
+Given that it's not hard to set up a system that meets these criteria,
+and this is in fact commonly done for database servers, it would seem a
+good idea for postgres to have the option to take advantage of the time
+and money spent and adjust its performance upward appropriately.
+
+> All other systems rely on the fact that you can recover a damaged file
+> using the log archive.
+
+Not exactly. For MS SQL Server, at any rate, if it detects a page tear
+you cannot restore based on the log file alone. You need a full or
+partial backup that includes that entire torn block.
+
+> This means downtime in the rare case, but no data loss. Until
+> PostgreSQL can do this, then it will not be acceptable for real
+> critical production use.
+
+It seems to me that it is doing this right now. In fact, it's more
+reliable than some commerial systems (such as SQL Server) because it can
+recover from a torn block with just the logfile.
+
+> But at the end of the day, unless you have complete understanding of
+> the I/O system from write(2) through to the disk system, the only sure
+> ways to protect against partial writes are by "careful writes" (in
+> the WAL log or elsewhere, writing pages twice), or by requiring (and
+> allowing) users to do log-replay recovery when a file is corrupted by
+> a partial write.
+
+I don't understand how, without a copy of the old data that was in the
+torn block, you can restore that block from just log file entries. Can
+you explain this to me? Take, as an example, a block with ten tuples,
+only one of which has been changed "recently." (I.e., only that change
+is in the log files.)
+
+> If we log pages to WAL, they are useless when archived (after a
+> checkpoint). So either we have a separate "log" for them (the
+> ping-pong file), or we should at least remove them when archived,
+> which makes log archiving more complex but is perfectly doable.
+
+Right. That seems to me a better option, since we've now got only one
+write point on the disk rather than two.
+
+> Finally, I would love to hear why we are using the operating system
+> buffer manager at all. The OS is acting as a secondary buffer manager
+> for us. Why is that? What flaw in our I/O system does this reveal?
+
+It's acting as a "second-level" buffer manager, yes, but to say it's
+"secondary" may be a bit misleading. On most of the systems I've set
+up, the OS buffer cache is doing the vast majority of the work, and the
+postgres buffering is fairly minimal.
+
+There are some good (and some perhaps not-so-good) reasons to do it this
+way. I'll list them more or less in the order of best to worst:
+
+    1. The OS knows where the blocks physically reside on disk, and
+    postgres does not. Therefore it's in the interest of postgresql to
+    dispatch write responsibility back to the OS as quickly as possible
+    so that the OS can prioritize requests appropriately. Most operating
+    systems use an "elevator" algorithm to minimize disk head movement;
+    but if the OS does not have a block that it could write while the
+    head is "on the way" to another request, it can't write it in that
+    head pass.
+
+    2. Postgres does not know about any "bank-switching" tricks for
+    mapping more physical memory than it has address space. Thus, on
+    32-bit machines, postgres might be limited to mapping 2 or 3 GB of
+    memory, even though the machine has, say, 6 GB of physical RAM. The
+    OS can use all of the available memory for caching; postgres cannot.
+
+    3. A lot of work has been put into the seek algorithms, read-ahead
+    algorithms, block allocation algorithms, etc. in the OS. Why
+    duplicate all that work again in postgres?
+
+When you say things like the following:
+
+> We should only be writing blocks when they need to be on disk. We
+> should not be expecting the OS to write them "sometime later" and
+> avoid blocking (as long) for the write. If we need that, then our
+> buffer management is wrong and we need to fix it.
+
+you appear to be making the arugment that we should take the route of
+other database systems, and use raw devices and our own management of
+disk block allocation. If so, you might want first to look back through
+the archives at the discussion I and several others had about this a
+month or two ago. After looking in detail at what NetBSD, at least, does
+in terms of its disk I/O algorithms and buffering, I've pretty much come
+around, at least for the moment, to the attitude that we should stick
+with using the OS. I wouldn't mind seeing postgres be able to manage all
+of this stuff, but it's a *lot* of work for not all that much benefit
+that I can see.
+
+> The ORACLE people were not kidding when they said that they could not
+> certify Linux for production use until it supported O_DSYNC. Can you
+> explain why that was the case?
+
+I'm suspecting it's because Linux at the time had no raw devices, so
+O_DSYNC was the only other possible method of making sure that disk
+writes actually got to disk.
+
+You certainly don't want to use O_DSYNC if you can use another method,
+because O_DSYNC still goes through the the operating system's buffer
+cache, wasting memory and double-caching things. If you're doing your
+own management, you need either to use a raw device or open files with
+the flag that indicates that the buffer cache should not be used at all
+for reads from and writes to that file.
+
+> However, this discussion and a search of the pgsql-hackers archives
+> reveals this problem to be the KEY area of PostgreSQL's failing, and
+> general misunderstanding, when compared to its commercial competitors.
+
+No, I think it's just that you're under a few minor misapprehensions
+here about what postgres and the OS are actually doing. As I said, I
+went through this whole exact argument a month or two ago, on this very
+list, and I came around to the idea that what postgres is doing now
+works quite well, at least on NetBSD. (Most other OSes have disk I/O
+algorithms that are pretty much as good or better.) There might be a
+very slight advantage to doing all one's own I/O management, but it's
+a huge amount of work, and I think that much effort could be much more
+usefully applied to other areas.
+
+Just as a side note, I've been a NetBSD developer since about '96,
+and have been delving into the details of OS design since well before
+that time, so I'm coming to this with what I hope is reasonably good
+knowledge of how disks work and how operating systems use them. (Not
+that this should stop you from pointing out holes in my arguments. :-))
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From pgsql-hackers-owner+M24112@postgresql.org Mon Jun 24 18:16:36 2002
+Return-path: <pgsql-hackers-owner+M24112@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OMGaF00910
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 18:16:36 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id A2EF1476475; Mon, 24 Jun 2002 16:43:38 -0400 (EDT)
+Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 16:43:38 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id BA57D476148; Mon, 24 Jun 2002 14:14:00 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 93D6A477214
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:59:17 -0400 (EDT)
+Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 13:59:17 2002
+Received: from sss.pgh.pa.us (unknown [192.204.191.242])
+	by postgresql.org (Postfix) with ESMTP id D70AA476401
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 10:06:26 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OE6J117666;
+	Mon, 24 Jun 2002 10:06:19 -0400 (EDT)
+To: Curt Sampson <cjs@cynic.net>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>, "J. R. Nield" <jrnield@usol.com>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net> 
+References: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
+Comments: In-reply-to Curt Sampson <cjs@cynic.net>
+	message dated "Mon, 24 Jun 2002 09:09:30 +0900"
+Date: Mon, 24 Jun 2002 10:06:19 -0400
+Message-ID: <17663.1024927579@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-5.3 required=5.0
+	tests=IN_REP_TO,X_NOT_PRESENT
+	version=2.30
+Status: OR
+
+> On Sun, 23 Jun 2002, Bruce Momjian wrote:
+>> Yes, I don't see writing to two files vs. one to be any win, especially
+>> when we need to fsync both of them.  What I would really like is to
+>> avoid the double I/O of writing to WAL and to the data file;  improving
+>> that would be a huge win.
+
+I don't believe it's possible to eliminate the double I/O.  Keep in mind
+though that in the ideal case (plenty of shared buffers) you are only
+paying two writes per modified block per checkpoint interval --- one to
+the WAL during the first write of the interval, and then a write to the
+real datafile issued by the checkpoint process.  Anything that requires
+transaction commits to write data blocks will likely result in more I/O
+not less, at least for blocks that are modified by several successive
+transactions.
+
+The only thing I've been able to think of that seems like it might
+improve matters is to make the WAL writing logic aware of the layout
+of buffer pages --- specifically, to know that our pages generally
+contain an uninteresting "hole" in the middle, and not write the hole.
+Optimistically this might reduce the WAL data volume by something
+approaching 50%; though pessimistically (if most pages are near full)
+it wouldn't help much.
+
+This was not very feasible when the WAL code was designed because the
+buffer manager needed to cope with both normal pages and pg_log pages,
+but as of 7.2 I think it'd be safe to assume that all pages have the
+standard layout.
+
+			regards, tom lane
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 2: you can get off all lists at once with the unregister command
+    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
+
+
+
+From pgsql-hackers-owner+M24116@postgresql.org Mon Jun 24 20:32:07 2002
+Return-path: <pgsql-hackers-owner+M24116@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0W7F10985
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:32:07 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id EBCE547632E; Mon, 24 Jun 2002 18:54:34 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 18:54:34 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 3EB93476D85; Mon, 24 Jun 2002 17:12:18 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id EBC20476E2E
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:54:40 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 14:54:40 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id 1C8874760C2
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 12:40:53 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5OGeVY06116;
+	Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206241640.g5OGeVY06116@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <17663.1024927579@sss.pgh.pa.us>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+Date: Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
+cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+Tom Lane wrote:
+> > On Sun, 23 Jun 2002, Bruce Momjian wrote:
+> >> Yes, I don't see writing to two files vs. one to be any win, especially
+> >> when we need to fsync both of them.  What I would really like is to
+> >> avoid the double I/O of writing to WAL and to the data file;  improving
+> >> that would be a huge win.
+> 
+> I don't believe it's possible to eliminate the double I/O.  Keep in mind
+> though that in the ideal case (plenty of shared buffers) you are only
+> paying two writes per modified block per checkpoint interval --- one to
+> the WAL during the first write of the interval, and then a write to the
+> real datafile issued by the checkpoint process.  Anything that requires
+> transaction commits to write data blocks will likely result in more I/O
+> not less, at least for blocks that are modified by several successive
+> transactions.
+> 
+> The only thing I've been able to think of that seems like it might
+> improve matters is to make the WAL writing logic aware of the layout
+> of buffer pages --- specifically, to know that our pages generally
+> contain an uninteresting "hole" in the middle, and not write the hole.
+> Optimistically this might reduce the WAL data volume by something
+> approaching 50%; though pessimistically (if most pages are near full)
+> it wouldn't help much.
+
+Good idea.  How about putting the page through or TOAST compression
+routine before writing it to WAL?  Should be pretty easy and fast and
+doesn't require any knowledge of the page format.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+
+
+
+From pgsql-hackers-owner+M24114@postgresql.org Mon Jun 24 17:54:35 2002
+Return-path: <pgsql-hackers-owner+M24114@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLsZF28642
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:54:35 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id BD68F47683C; Mon, 24 Jun 2002 16:46:24 -0400 (EDT)
+Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 16:46:24 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id B2719476B31; Mon, 24 Jun 2002 16:01:51 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 950004770BC
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:59:46 -0400 (EDT)
+Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 14:59:46 2002
+Received: from sss.pgh.pa.us (unknown [192.204.191.242])
+	by postgresql.org (Postfix) with ESMTP id A0756475BB7
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:11:41 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OHB1119826;
+	Mon, 24 Jun 2002 13:11:02 -0400 (EDT)
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
+   Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <200206241640.g5OGeVY06116@candle.pha.pa.us> 
+References: <200206241640.g5OGeVY06116@candle.pha.pa.us>
+Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
+	message dated "Mon, 24 Jun 2002 12:40:31 -0400"
+Date: Mon, 24 Jun 2002 13:11:01 -0400
+Message-ID: <19823.1024938661@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-5.3 required=5.0
+	tests=IN_REP_TO,X_NOT_PRESENT
+	version=2.30
+Status: OR
+
+Bruce Momjian <pgman@candle.pha.pa.us> writes:
+>> The only thing I've been able to think of that seems like it might
+>> improve matters is to make the WAL writing logic aware of the layout
+>> of buffer pages --- specifically, to know that our pages generally
+>> contain an uninteresting "hole" in the middle, and not write the hole.
+>> Optimistically this might reduce the WAL data volume by something
+>> approaching 50%; though pessimistically (if most pages are near full)
+>> it wouldn't help much.
+
+> Good idea.  How about putting the page through or TOAST compression
+> routine before writing it to WAL?  Should be pretty easy and fast and
+> doesn't require any knowledge of the page format.
+
+Easy, maybe, but fast definitely NOT.  The compressor is not speedy.
+Given that we have to be holding various locks while we build WAL
+records, I do not think it's a good idea to add CPU time there.
+
+Also, compressing already-compressed data is not a win ...
+
+			regards, tom lane
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 3: if posting/reading through Usenet, please send an appropriate
+subscribe-nomail command to majordomo@postgresql.org so that your
+message can get through to the mailing list cleanly
+
+
+
+From jrnield@usol.com Mon Jun 24 16:49:25 2002
+Return-path: <jrnield@usol.com>
+Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OKnNF23393
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 16:49:24 -0400 (EDT)
+Received: from 08-113.024.popsite.net (08-113.024.popsite.net [66.19.4.113])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5OKnHV19100;
+	Mon, 24 Jun 2002 16:49:18 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Curt Sampson <cjs@cynic.net>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+In-Reply-To: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
+References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 24 Jun 2002 16:49:42 -0400
+Message-ID: <1024951786.1793.865.camel@localhost.localdomain>
+MIME-Version: 1.0
+Status: ORr
+
+On Sun, 2002-06-23 at 23:40, Curt Sampson wrote:
+> On 23 Jun 2002, J. R. Nield wrote:
+> 
+> > If is impossible to do what you want. You can not protect against
+> > partial writes without writing pages twice and calling fdatasync
+> > between them while going through a generic filesystem.
+> 
+> I agree with this.
+> 
+> > The best disk array will not protect you if the operating system does
+> > not align block writes to the structure of the underlying device.
+> 
+> This I don't quite understand. Assuming you're using a SCSI drive
+> (and this mostly applies to ATAPI/IDE, too), you can do naught but
+> align block writes to the structure of the underlying device. When you
+> initiate a SCSI WRITE command, you start by telling the device at which
+> block to start writing and how many blocks you intend to write. Then you
+> start passing the data.
+> 
+
+All I'm saying is that the entire postgresql block write must be
+converted into exactly one SCSI write command in all cases, and I don't
+know a portable way to ensure this. 
+
+> > Even with raw devices, you need special support or knowledge of the
+> > operating system and/or the disk device to ensure that each write
+> > request will be atomic to the underlying hardware.
+> 
+> Well, so here I guess you're talking about two things:
+> 
+>     1. When you request, say, an 8K block write, will the OS really
+>     write it to disk in a single 8K or multiple of 8K SCSI write
+>     command?
+> 
+>     2. Does the SCSI device you're writing to consider these writes to
+>     be transactional. That is, if the write is interrupted before being
+>     completed, does the SCSI device guarantee that the partially-sent
+>     data is not written, and the old data is maintained? And of course,
+>     does it guarantee that, when it acknowledges a write, that write is
+>     now in stable storage and will never go away?
+> 
+> Both of these are not hard to guarantee, actually. For a BSD-based OS,
+> for example, just make sure that your filesystem block size is the
+> same as or a multiple of the database block size. BSD will never write
+> anything other than a block or a sequence of blocks to a disk in a
+> single SCSI transaction (unless you've got a really odd SCSI driver).
+> And for your disk, buy a Baydel or Clarion disk array, or something
+> similar.
+> 
+> Given that it's not hard to set up a system that meets these criteria,
+> and this is in fact commonly done for database servers, it would seem a
+> good idea for postgres to have the option to take advantage of the time
+> and money spent and adjust its performance upward appropriately.
+
+I agree with this. My point was only that you need to know what
+guarantees your operating system/hardware combination provides on a
+case-by-case basis, and there is no standard way for a program to
+discover this. Most system administrators are not going to know this
+either, unless databases are their main responsibility.
+
+> 
+> > All other systems rely on the fact that you can recover a damaged file
+> > using the log archive.
+> 
+> Not exactly. For MS SQL Server, at any rate, if it detects a page tear
+> you cannot restore based on the log file alone. You need a full or
+> partial backup that includes that entire torn block.
+> 
+
+I should have been more specific: you need a backup of the file from
+some time ago, plus all the archived logs from then until the current
+log sequence number.
+
+> > This means downtime in the rare case, but no data loss. Until
+> > PostgreSQL can do this, then it will not be acceptable for real
+> > critical production use.
+> 
+> It seems to me that it is doing this right now. In fact, it's more
+> reliable than some commerial systems (such as SQL Server) because it can
+> recover from a torn block with just the logfile.
+
+Again, what I meant to say is that the commercial systems can recover
+with an old file backup + logs. How old the backup can be depends only
+on how much time you are willing to spend playing the logs forward. So
+if you do a full backup once a week, and multiplex and backup the logs,
+then even if a backup tape gets destroyed you can still survive. It just
+takes longer.
+
+Also, postgreSQL can't recover from any other type of block corruption,
+while the commercial systems can. That's what I meant by the "critical
+production use" comment, which was sort-of unfair.
+
+So I would say they are equally reliable for torn pages (but not bad
+blocks), and the commercial systems let you trade potential recovery
+time for not having to write the blocks twice. You do need to back-up
+the log archives though.
+
+> 
+> > But at the end of the day, unless you have complete understanding of
+> > the I/O system from write(2) through to the disk system, the only sure
+> > ways to protect against partial writes are by "careful writes" (in
+> > the WAL log or elsewhere, writing pages twice), or by requiring (and
+> > allowing) users to do log-replay recovery when a file is corrupted by
+> > a partial write.
+> 
+> I don't understand how, without a copy of the old data that was in the
+> torn block, you can restore that block from just log file entries. Can
+> you explain this to me? Take, as an example, a block with ten tuples,
+> only one of which has been changed "recently." (I.e., only that change
+> is in the log files.)
+>
+> 
+> > If we log pages to WAL, they are useless when archived (after a
+> > checkpoint). So either we have a separate "log" for them (the
+> > ping-pong file), or we should at least remove them when archived,
+> > which makes log archiving more complex but is perfectly doable.
+> 
+> Right. That seems to me a better option, since we've now got only one
+> write point on the disk rather than two.
+
+OK. I agree with this now.
+
+> 
+> > Finally, I would love to hear why we are using the operating system
+> > buffer manager at all. The OS is acting as a secondary buffer manager
+> > for us. Why is that? What flaw in our I/O system does this reveal?
+> 
+> It's acting as a "second-level" buffer manager, yes, but to say it's
+> "secondary" may be a bit misleading. On most of the systems I've set
+> up, the OS buffer cache is doing the vast majority of the work, and the
+> postgres buffering is fairly minimal.
+> 
+> There are some good (and some perhaps not-so-good) reasons to do it this
+> way. I'll list them more or less in the order of best to worst:
+> 
+>     1. The OS knows where the blocks physically reside on disk, and
+>     postgres does not. Therefore it's in the interest of postgresql to
+>     dispatch write responsibility back to the OS as quickly as possible
+>     so that the OS can prioritize requests appropriately. Most operating
+>     systems use an "elevator" algorithm to minimize disk head movement;
+>     but if the OS does not have a block that it could write while the
+>     head is "on the way" to another request, it can't write it in that
+>     head pass.
+> 
+>     2. Postgres does not know about any "bank-switching" tricks for
+>     mapping more physical memory than it has address space. Thus, on
+>     32-bit machines, postgres might be limited to mapping 2 or 3 GB of
+>     memory, even though the machine has, say, 6 GB of physical RAM. The
+>     OS can use all of the available memory for caching; postgres cannot.
+> 
+>     3. A lot of work has been put into the seek algorithms, read-ahead
+>     algorithms, block allocation algorithms, etc. in the OS. Why
+>     duplicate all that work again in postgres?
+> 
+> When you say things like the following:
+> 
+> > We should only be writing blocks when they need to be on disk. We
+> > should not be expecting the OS to write them "sometime later" and
+> > avoid blocking (as long) for the write. If we need that, then our
+> > buffer management is wrong and we need to fix it.
+> 
+> you appear to be making the arugment that we should take the route of
+> other database systems, and use raw devices and our own management of
+> disk block allocation. If so, you might want first to look back through
+> the archives at the discussion I and several others had about this a
+> month or two ago. After looking in detail at what NetBSD, at least, does
+> in terms of its disk I/O algorithms and buffering, I've pretty much come
+> around, at least for the moment, to the attitude that we should stick
+> with using the OS. I wouldn't mind seeing postgres be able to manage all
+> of this stuff, but it's a *lot* of work for not all that much benefit
+> that I can see.
+
+I'll back off on that. I don't know if we want to use the OS buffer
+manager, but shouldn't we try to have our buffer manager group writes
+together by files, and pro-actively get them out to disk? Right now, it
+looks like all our write requests are delayed as long as possible and
+the order in which they are written is pretty-much random, as is the
+backend that writes the block, so there is no locality of reference even
+when the blocks are adjacent on disk, and the write calls are spread-out
+over all the backends.
+
+Would it not be the case that things like read-ahead, grouping writes,
+and caching written data are probably best done by PostgreSQL, because
+only our buffer manager can understand when they will be useful or when
+they will thrash the cache?
+
+I may likely be wrong on this, and I haven't done any performance
+testing. I shouldn't have brought this up alongside the logging issues,
+but there seemed to be some question about whether the OS was actually
+doing all these things behind the scene.
+
+
+> 
+> > The ORACLE people were not kidding when they said that they could not
+> > certify Linux for production use until it supported O_DSYNC. Can you
+> > explain why that was the case?
+> 
+> I'm suspecting it's because Linux at the time had no raw devices, so
+> O_DSYNC was the only other possible method of making sure that disk
+> writes actually got to disk.
+> 
+> You certainly don't want to use O_DSYNC if you can use another method,
+> because O_DSYNC still goes through the the operating system's buffer
+> cache, wasting memory and double-caching things. If you're doing your
+> own management, you need either to use a raw device or open files with
+> the flag that indicates that the buffer cache should not be used at all
+> for reads from and writes to that file.
+
+Would O_DSYNC|O_RSYNC turn off the cache? 
+
+> 
+> > However, this discussion and a search of the pgsql-hackers archives
+> > reveals this problem to be the KEY area of PostgreSQL's failing, and
+> > general misunderstanding, when compared to its commercial competitors.
+> 
+> No, I think it's just that you're under a few minor misapprehensions
+> here about what postgres and the OS are actually doing. As I said, I
+> went through this whole exact argument a month or two ago, on this very
+> list, and I came around to the idea that what postgres is doing now
+> works quite well, at least on NetBSD. (Most other OSes have disk I/O
+> algorithms that are pretty much as good or better.) There might be a
+> very slight advantage to doing all one's own I/O management, but it's
+> a huge amount of work, and I think that much effort could be much more
+> usefully applied to other areas.
+
+I will look for that discussion in the archives.
+
+The logging issue is a key one I think. At least I would be very nervous
+as a DBA if I were running a system where any damaged file would cause
+data loss.
+
+Does anyone know what the major barriers to infinite log replay are in
+PostgreSQL? I'm trying to look for everything that might need to be
+changed outside xlog.c, but surely this has come up before. Searching
+the archives hasn't revealed much.
+
+
+
+As to the I/O issue:
+
+Since you know a lot about NetBSD internals, I'd be interested in
+hearing about what postgresql looks like to the NetBSD buffer manager.
+Am I right that strings of successive writes get randomized? What do our
+cache-hit percentages look like? I'm going to do some experimenting with
+this.
+
+> 
+> Just as a side note, I've been a NetBSD developer since about '96,
+> and have been delving into the details of OS design since well before
+> that time, so I'm coming to this with what I hope is reasonably good
+> knowledge of how disks work and how operating systems use them. (Not
+> that this should stop you from pointing out holes in my arguments. :-))
+> 
+
+This stuff is very difficult to get right. Glad to know you follow this
+list.
+
+
+> cjs
+> -- 
+> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+>     Don't you know, in this new Dark Age, we're all light.  --XTC
+> 
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+From tgl@sss.pgh.pa.us Mon Jun 24 17:16:06 2002
+Return-path: <tgl@sss.pgh.pa.us>
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLG5F25284
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:16:05 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLG2121379;
+	Mon, 24 Jun 2002 17:16:02 -0400 (EDT)
+To: "J. R. Nield" <jrnield@usol.com>
+cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> 
+References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net> <1024951786.1793.865.camel@localhost.localdomain>
+Comments: In-reply-to "J. R. Nield" <jrnield@usol.com>
+	message dated "24 Jun 2002 16:49:42 -0400"
+Date: Mon, 24 Jun 2002 17:16:01 -0400
+Message-ID: <21376.1024953361@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Status: OR
+
+"J. R. Nield" <jrnield@usol.com> writes:
+> Also, postgreSQL can't recover from any other type of block corruption,
+> while the commercial systems can.
+
+Say again?
+
+> Would it not be the case that things like read-ahead, grouping writes,
+> and caching written data are probably best done by PostgreSQL, because
+> only our buffer manager can understand when they will be useful or when
+> they will thrash the cache?
+
+I think you have been missing the point.  No one denies that there will
+be some incremental gain if we do all that.  However, the conclusion of
+everyone who has thought much about it (and I see Curt has joined that
+group) is that the effort would be far out of proportion to the probable
+gain.  There are a lot of other things we desperately need to spend time
+on that would not amount to re-engineering large quantities of OS-level
+code.  Given that most Unixen have perfectly respectable disk management
+subsystems, we prefer to tune our code to make use of that stuff, rather
+than follow the "conventional wisdom" that databases need to bypass it.
+
+Oracle can afford to do that sort of thing because they have umpteen
+thousand developers available.  Postgres does not.
+
+			regards, tom lane
+
+From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002
+Return-path: <pgsql-hackers-owner+M24128@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P21vF19918
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:01:57 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 540B8475B33; Mon, 24 Jun 2002 21:34:40 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 21:34:40 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 0A13F476965; Mon, 24 Jun 2002 19:30:14 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id B4F62476E4A
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:53:59 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 18:53:59 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id 36043475BF6
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:25:28 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5OLPFG26140;
+	Mon, 24 Jun 2002 17:25:15 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206242125.g5OLPFG26140@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
+To: "J. R. Nield" <jrnield@usol.com>
+Date: Mon, 24 Jun 2002 17:25:14 -0400 (EDT)
+cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+J. R. Nield wrote:
+> > This I don't quite understand. Assuming you're using a SCSI drive
+> > (and this mostly applies to ATAPI/IDE, too), you can do naught but
+> > align block writes to the structure of the underlying device. When you
+> > initiate a SCSI WRITE command, you start by telling the device at which
+> > block to start writing and how many blocks you intend to write. Then you
+> > start passing the data.
+> > 
+> 
+> All I'm saying is that the entire postgresql block write must be
+> converted into exactly one SCSI write command in all cases, and I don't
+> know a portable way to ensure this. 
+
+...
+
+> I agree with this. My point was only that you need to know what
+> guarantees your operating system/hardware combination provides on a
+> case-by-case basis, and there is no standard way for a program to
+> discover this. Most system administrators are not going to know this
+> either, unless databases are their main responsibility.
+
+Yes, agreed.  >1% are going to know the answer to this question so we
+have to assume worst case.
+
+> > It seems to me that it is doing this right now. In fact, it's more
+> > reliable than some commerial systems (such as SQL Server) because it can
+> > recover from a torn block with just the logfile.
+> 
+> Again, what I meant to say is that the commercial systems can recover
+> with an old file backup + logs. How old the backup can be depends only
+> on how much time you are willing to spend playing the logs forward. So
+> if you do a full backup once a week, and multiplex and backup the logs,
+> then even if a backup tape gets destroyed you can still survive. It just
+> takes longer.
+> 
+> Also, postgreSQL can't recover from any other type of block corruption,
+> while the commercial systems can. That's what I meant by the "critical
+> production use" comment, which was sort-of unfair.
+> 
+> So I would say they are equally reliable for torn pages (but not bad
+> blocks), and the commercial systems let you trade potential recovery
+> time for not having to write the blocks twice. You do need to back-up
+> the log archives though.
+
+Yes, good tradeoff analysis.  We recover from partial writes quicker,
+and don't require saving of log files, _but_ we don't recover from bad
+disk blocks.  Good summary.
+
+> I'll back off on that. I don't know if we want to use the OS buffer
+> manager, but shouldn't we try to have our buffer manager group writes
+> together by files, and pro-actively get them out to disk? Right now, it
+> looks like all our write requests are delayed as long as possible and
+> the order in which they are written is pretty-much random, as is the
+> backend that writes the block, so there is no locality of reference even
+> when the blocks are adjacent on disk, and the write calls are spread-out
+> over all the backends.
+> 
+> Would it not be the case that things like read-ahead, grouping writes,
+> and caching written data are probably best done by PostgreSQL, because
+> only our buffer manager can understand when they will be useful or when
+> they will thrash the cache?
+
+The OS should handle all of this.  We are doing main table writes but no
+sync until checkpoint, so the OS can keep those blocks around and write
+them at its convenience.  It knows the size of the buffer cache and when
+stuff is forced to disk.  We can't second-guess that.
+
+> I may likely be wrong on this, and I haven't done any performance
+> testing. I shouldn't have brought this up alongside the logging issues,
+> but there seemed to be some question about whether the OS was actually
+> doing all these things behind the scene.
+
+It had better.  Looking at the kernel source is the way to know.
+
+> Does anyone know what the major barriers to infinite log replay are in
+> PostgreSQL? I'm trying to look for everything that might need to be
+> changed outside xlog.c, but surely this has come up before. Searching
+> the archives hasn't revealed much.
+
+This has been brought up.  Could we just save WAL files and get replay? 
+I believe some things have to be added to WAL to allow this, but it
+seems possible.  However, the pg_dump is just a data dump and does not
+have the file offsets and things.  Somehow you would need a tar-type
+backup of the database, and with a running db, it is hard to get a valid
+snapshot of that.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 3: if posting/reading through Usenet, please send an appropriate
+subscribe-nomail command to majordomo@postgresql.org so that your
+message can get through to the mailing list cleanly
+
+
+
+From tgl@sss.pgh.pa.us Mon Jun 24 17:31:57 2002
+Return-path: <tgl@sss.pgh.pa.us>
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLVuF26684
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
+	Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us> 
+References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
+Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
+	message dated "Mon, 24 Jun 2002 17:25:14 -0400"
+Date: Mon, 24 Jun 2002 17:31:56 -0400
+Message-ID: <21482.1024954316@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Status: ORr
+
+Bruce Momjian <pgman@candle.pha.pa.us> writes:
+>> Does anyone know what the major barriers to infinite log replay are in
+>> PostgreSQL? I'm trying to look for everything that might need to be
+>> changed outside xlog.c, but surely this has come up before. Searching
+>> the archives hasn't revealed much.
+
+> This has been brought up.  Could we just save WAL files and get replay? 
+> I believe some things have to be added to WAL to allow this, but it
+> seems possible.
+
+The Red Hat group has been looking at this somewhat; so far there seem
+to be some minor tweaks that would be needed, but no showstoppers.
+
+> Somehow you would need a tar-type
+> backup of the database, and with a running db, it is hard to get a valid
+> snapshot of that.
+
+But you don't *need* a "valid snapshot", only a correct copy of
+every block older than the first checkpoint in your WAL log series.
+Any inconsistencies in your tar dump will look like repairable damage;
+replaying the WAL log will fix 'em.
+
+			regards, tom lane
+
+From pgsql-hackers-owner+M24131@postgresql.org Mon Jun 24 21:15:06 2002
+Return-path: <pgsql-hackers-owner+M24131@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P1F5F15390
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 21:15:05 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id B76174768CC; Mon, 24 Jun 2002 20:59:56 -0400 (EDT)
+Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 20:59:56 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 8724C47742E; Mon, 24 Jun 2002 20:17:44 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 4E472476875
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:37:46 -0400 (EDT)
+Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 18:37:46 2002
+Received: from sss.pgh.pa.us (unknown [192.204.191.242])
+	by postgresql.org (Postfix) with ESMTP id CFCC9476A7A
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:32:02 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
+	Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us> 
+References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
+Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
+	message dated "Mon, 24 Jun 2002 17:25:14 -0400"
+Date: Mon, 24 Jun 2002 17:31:56 -0400
+Message-ID: <21482.1024954316@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-5.3 required=5.0
+	tests=IN_REP_TO,X_NOT_PRESENT
+	version=2.30
+Status: OR
+
+Bruce Momjian <pgman@candle.pha.pa.us> writes:
+>> Does anyone know what the major barriers to infinite log replay are in
+>> PostgreSQL? I'm trying to look for everything that might need to be
+>> changed outside xlog.c, but surely this has come up before. Searching
+>> the archives hasn't revealed much.
+
+> This has been brought up.  Could we just save WAL files and get replay? 
+> I believe some things have to be added to WAL to allow this, but it
+> seems possible.
+
+The Red Hat group has been looking at this somewhat; so far there seem
+to be some minor tweaks that would be needed, but no showstoppers.
+
+> Somehow you would need a tar-type
+> backup of the database, and with a running db, it is hard to get a valid
+> snapshot of that.
+
+But you don't *need* a "valid snapshot", only a correct copy of
+every block older than the first checkpoint in your WAL log series.
+Any inconsistencies in your tar dump will look like repairable damage;
+replaying the WAL log will fix 'em.
+
+			regards, tom lane
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+
+
+
+From pgsql-hackers-owner+M24133@postgresql.org Mon Jun 24 22:19:55 2002
+Return-path: <pgsql-hackers-owner+M24133@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P2JsF21543
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:19:54 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id 42391476E53; Mon, 24 Jun 2002 22:09:49 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 22:09:49 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 191654774EB; Mon, 24 Jun 2002 20:26:08 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 8EB90476101
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 19:43:19 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 19:43:19 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id 08018476931
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:33:53 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5OLXhl26908;
+	Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206242133.g5OLXhl26908@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <21482.1024954316@sss.pgh.pa.us>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+Date: Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
+cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+Tom Lane wrote:
+> Bruce Momjian <pgman@candle.pha.pa.us> writes:
+> >> Does anyone know what the major barriers to infinite log replay are in
+> >> PostgreSQL? I'm trying to look for everything that might need to be
+> >> changed outside xlog.c, but surely this has come up before. Searching
+> >> the archives hasn't revealed much.
+> 
+> > This has been brought up.  Could we just save WAL files and get replay? 
+> > I believe some things have to be added to WAL to allow this, but it
+> > seems possible.
+> 
+> The Red Hat group has been looking at this somewhat; so far there seem
+> to be some minor tweaks that would be needed, but no showstoppers.
+
+
+Good.
+
+> > Somehow you would need a tar-type
+> > backup of the database, and with a running db, it is hard to get a valid
+> > snapshot of that.
+> 
+> But you don't *need* a "valid snapshot", only a correct copy of
+> every block older than the first checkpoint in your WAL log series.
+> Any inconsistencies in your tar dump will look like repairable damage;
+> replaying the WAL log will fix 'em.
+
+Yes, my point was that you need physical file backups, not pg_dump, and
+you have to be tricky about the files changing during the backup.  You
+_can_ work around changes to the files during backup.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+/usr/local/bin/mime: cannot create /dev/ttyp3: permission denied
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+
+
+
+From pgsql-hackers-owner+M24139@postgresql.org Tue Jun 25 00:00:22 2002
+Return-path: <pgsql-hackers-owner+M24139@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P40LF00838
+	for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 00:00:21 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id CBAE8476E94; Mon, 24 Jun 2002 23:44:51 -0400 (EDT)
+Mailbox-Line: From jrnield@usol.com  Mon Jun 24 23:44:51 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id C5076476871; Mon, 24 Jun 2002 22:25:46 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 8DF57476979
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 22:08:31 -0400 (EDT)
+Mailbox-Line: From jrnield@usol.com  Mon Jun 24 22:08:31 2002
+Received: from hades.usol.com (hades.usol.com [208.232.58.41])
+	by postgresql.org (Postfix) with ESMTP id 298D2476101
+	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 20:27:46 -0400 (EDT)
+Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
+	Mon, 24 Jun 2002 20:27:37 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
+References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
+	<1024951786.1793.865.camel@localhost.localdomain> 
+	<21376.1024953361@sss.pgh.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 24 Jun 2002 20:28:00 -0400
+Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
+MIME-Version: 1.0
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
+ 
+> I think you have been missing the point...  
+Yes, this appears to be the case. Thanks especially to Curt for clearing
+things up for me.
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+
+
+From jrnield@usol.com Mon Jun 24 20:27:45 2002
+Return-path: <jrnield@usol.com>
+Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0RhF10711
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:27:44 -0400 (EDT)
+Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
+	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
+	Mon, 24 Jun 2002 20:27:37 -0400
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+From: "J. R. Nield" <jrnield@usol.com>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
+References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
+	<1024951786.1793.865.camel@localhost.localdomain> 
+	<21376.1024953361@sss.pgh.pa.us>
+Content-Type: text/plain
+Content-Transfer-Encoding: 7bit
+X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
+Date: 24 Jun 2002 20:28:00 -0400
+Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
+MIME-Version: 1.0
+Status: OR
+
+On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
+ 
+> I think you have been missing the point...  
+Yes, this appears to be the case. Thanks especially to Curt for clearing
+things up for me.
+
+-- 
+J. R. Nield
+jrnield@usol.com
+
+
+
+
+From cjs@cynic.net Mon Jun 24 23:32:23 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net ([63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P3WMF28287
+	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 23:32:23 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 28AB5F820; Tue, 25 Jun 2002 03:32:08 +0000 (UTC)
+Date: Tue, 25 Jun 2002 12:32:05 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: "J. R. Nield" <jrnield@usol.com>
+cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
+Message-ID: <Pine.NEB.4.43.0206251229010.17448-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On 24 Jun 2002, J. R. Nield wrote:
+
+> All I'm saying is that the entire postgresql block write must be
+> converted into exactly one SCSI write command in all cases, and I don't
+> know a portable way to ensure this.
+
+No, there's no portable way. All you can do is give the admin who
+is able to set things up safely the ability to turn of the now-unneeded
+(and expensive) safety-related stuff that postgres does.
+
+> I agree with this. My point was only that you need to know what
+> guarantees your operating system/hardware combination provides on a
+> case-by-case basis, and there is no standard way for a program to
+> discover this. Most system administrators are not going to know this
+> either, unless databases are their main responsibility.
+
+Certainly this is true of pretty much every database system out there.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From cjs@cynic.net Tue Jun 25 01:09:02 2002
+Return-path: <cjs@cynic.net>
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P591F07292
+	for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 01:09:01 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+	by academic.cynic.net (Postfix) with ESMTP
+	id 517BEF820; Tue, 25 Jun 2002 05:09:02 +0000 (UTC)
+Date: Tue, 25 Jun 2002 14:08:59 +0900 (JST)
+From: Curt Sampson <cjs@cynic.net>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
+In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
+Message-ID: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: ORr
+
+On Mon, 24 Jun 2002, Tom Lane wrote:
+
+> There are a lot of other things we desperately need to spend time
+> on that would not amount to re-engineering large quantities of OS-level
+> code.  Given that most Unixen have perfectly respectable disk management
+> subsystems, we prefer to tune our code to make use of that stuff, rather
+> than follow the "conventional wisdom" that databases need to bypass it.
+> ...
+> Oracle can afford to do that sort of thing because they have umpteen
+> thousand developers available.  Postgres does not.
+
+Well, Oracle also started out, a long long time ago, on systems without
+unified buffer cache and so on, and so they *had* to write this stuff
+because otherwise data would not be cached. So Oracle can also afford to
+maintain it now because the code already exists.
+
+cjs
+-- 
+Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
+    Don't you know, in this new Dark Age, we're all light.  --XTC
+
+
+From pgsql-hackers-owner+M24154@postgresql.org Tue Jun 25 09:22:38 2002
+Return-path: <pgsql-hackers-owner+M24154@postgresql.org>
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDMbF03932
+	for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 09:22:37 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP
+	id C12C3475E4A; Tue, 25 Jun 2002 09:22:32 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Tue Jun 25 09:22:32 2002
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by postgresql.org (Postfix) with SMTP
+	id 65471475C7A; Tue, 25 Jun 2002 09:22:23 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+	by localhost (Postfix) with ESMTP id 97C8C475A7C
+	for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:20 -0400 (EDT)
+Mailbox-Line: From pgman@candle.pha.pa.us  Tue Jun 25 09:22:20 2002
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+	by postgresql.org (Postfix) with ESMTP id 42C0B475A64
+	for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:19 -0400 (EDT)
+Received: (from pgman@localhost)
+	by candle.pha.pa.us (8.11.6/8.10.1) id g5PDM5B03772;
+	Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200206251322.g5PDM5B03772@candle.pha.pa.us>
+Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
+In-Reply-To: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
+To: Curt Sampson <cjs@cynic.net>
+Date: Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
+cc: Tom Lane <tgl@sss.pgh.pa.us>, "J. R. Nield" <jrnield@usol.com>,
+   PostgreSQL Hacker <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+X-Spam-Status: No, hits=-3.4 required=5.0
+	tests=IN_REP_TO
+	version=2.30
+Status: OR
+
+Curt Sampson wrote:
+> On Mon, 24 Jun 2002, Tom Lane wrote:
+> 
+> > There are a lot of other things we desperately need to spend time
+> > on that would not amount to re-engineering large quantities of OS-level
+> > code.  Given that most Unixen have perfectly respectable disk management
+> > subsystems, we prefer to tune our code to make use of that stuff, rather
+> > than follow the "conventional wisdom" that databases need to bypass it.
+> > ...
+> > Oracle can afford to do that sort of thing because they have umpteen
+> > thousand developers available.  Postgres does not.
+> 
+> Well, Oracle also started out, a long long time ago, on systems without
+> unified buffer cache and so on, and so they *had* to write this stuff
+> because otherwise data would not be cached. So Oracle can also afford to
+> maintain it now because the code already exists.
+
+Well, actually, it isn't unified buffer cache that is the issue, but
+rather the older SysV file system had pretty poor performance so
+bypassing it was a bigger win that it is today.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+
+