From 543539bb35c45e2b8783daf296c83c3696ad2834 Mon Sep 17 00:00:00 2001 From: Bruce Momjian Date: Mon, 26 Aug 2002 23:14:15 +0000 Subject: [PATCH] Add discussion of pre-write pages to WAL. --- doc/TODO.detail/wal | 2700 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2700 insertions(+) create mode 100644 doc/TODO.detail/wal diff --git a/doc/TODO.detail/wal b/doc/TODO.detail/wal new file mode 100644 index 0000000000..6842d8daab --- /dev/null +++ b/doc/TODO.detail/wal @@ -0,0 +1,2700 @@ +From cjs@cynic.net Thu Jun 20 22:18:27 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5L2IPo22195 + for ; Thu, 20 Jun 2002 22:18:26 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 88216F821; Fri, 21 Jun 2002 02:18:17 +0000 (UTC) +Date: Fri, 21 Jun 2002 11:18:14 +0900 (JST) +From: Curt Sampson +To: Bruce Momjian +cc: Michael Loftis , mlw , + PostgreSQL-development +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Thu, 20 Jun 2002, Bruce Momjian wrote: + +> > MS SQL Server has an interesting way of dealing with this. They have a +> > "torn" bit in each 512-byte chunk of a page, and this bit is set the +> > same for each chunk. When they are about to write out a page, they first +> > flip all of the torn bits and then do the write. If the write does not +> > complete due to a system crash or whatever, this can be detected later +> > because the torn bits won't match across the entire page. +> +> I was wondering, how does knowing the block is corrupt help MS SQL? + +I'm trying to recall, but I can't off hand. I'll have to look it +up in my Inside SQL Server book, which is at home right now, +unfortunately. I'll bring the book into work and let you know the +details later. + +> Right now, we write changed pages to WAL, then later write them to disk. + +Ah. You write the entire page? MS writes only the changed tuple. +And DB2, in fact, goes one better and writes only the part of the +tuple up to the change, IIRC. Thus, if you put smaller and/or more +frequently changed columns first, you'll have smaller logs. + +> I have always been looking for a way to prevent these WAL writes. The +> 512-byte bit seems interesting, but how does it help? + +Well, this would at least let you reduce the write to the 512-byte +chunk that changed, rather than writing the entire 8K page. + +> And how does the bit help them with partial block writes? Is the bit at +> the end of the block? Is that reliable? + +The bit is somewhere within every 512 byte "disk page" within the +8192 byte "filesystem/database page." So an 8KB page is divided up +like this: + + | <----------------------- 8 Kb ----------------------> | + + | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b | + +Thus, the tear bits start out like this: + + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + +After a successful write of the entire page, you have this: + + | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | + +If the write is unsuccessful, you end up with something like this: + + | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | + +And now you know which parts of your page got written, and which +parts didn't. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From cjs@cynic.net Sat Jun 22 04:41:54 2002 +Return-path: +Received: from academic.cynic.net ([63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5M8fpF04711 + for ; Sat, 22 Jun 2002 04:41:53 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 415C8F820; Sat, 22 Jun 2002 08:41:33 +0000 (UTC) +Date: Sat, 22 Jun 2002 17:41:30 +0900 (JST) +From: Curt Sampson +To: Tom Lane +cc: Bruce Momjian , Michael Loftis , + mlw , + PostgreSQL-development +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <19332.1024668861@sss.pgh.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Fri, 21 Jun 2002, Tom Lane wrote: + +> Curt Sampson writes: +> > And now you know which parts of your page got written, and which +> > parts didn't. +> +> Yes ... and what do you *do* about it? + +Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207: + + torn page detection When TRUE, this option causes a bit to be + flipped for each 512-byte sector in a database page (8 KB) + whenever the page is written to disk. This option allows + SQL Server to detect incomplete I/O operations caused by + power failures or other system outages. If a bit is in the + wrong state when the page is later read by SQL Server, this + means the page was written incorrectly; a torn page has + been detected. Although SQL Server database pages are 8 + KB, disks perform I/O operations using 512-byte sectors. + Therefore, 16 sectors are written per database page. A + torn page can occur if the system crashes (for example, + because of power failure) between the time the operating + system writes the first 512-byte sector to disk and the + completion of the 8-KB I/O operation. If the first sector + of a database page is successfully written before the crash, + it will appear that the database page on disk was updated, + although it might not have succeeded. Using battery-backed + disk caches can ensure that data is [sic] successfully + written to disk or not written at all. In this case, don't + set torn page detection to TRUE, as it isn't needed. If a + torn page is detected, the database will need to be restored + from backup because it will be physically inconsistent. + +As I understand it, this is not a problem for postgres becuase the +entire page is written to the log. So postgres is safe, but quite +inefficient. (It would be much more efficient to write just the +changed tuple, or even just the changed values within the tuple, +to the log.) + +Adding these torn bits would allow posgres at least to write to +the log just the 512-byte sectors that have changed, rather than +the entire 8 KB page. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From pgsql-hackers-owner+M24060@postgresql.org Sat Jun 22 18:31:21 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5MMVKF20014 + for ; Sat, 22 Jun 2002 18:31:20 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 0ADFE476090; Sat, 22 Jun 2002 18:31:10 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 6B372475A96; Sat, 22 Jun 2002 18:28:42 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 47AD2475935 + for ; Sat, 22 Jun 2002 18:28:40 -0400 (EDT) +Received: from hades.usol.com (hades.usol.com [208.232.58.41]) + by postgresql.org (Postfix) with ESMTP id 1D5DA476166 + for ; Sat, 22 Jun 2002 18:23:16 -0400 (EDT) +Received: from 01-081.024.popsite.net (01-081.024.popsite.net [216.126.160.81]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5MMMOj11344; + Sat, 22 Jun 2002 18:22:25 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Michael Loftis , + mlw , + PostgreSQL Hacker , + Tom Lane +In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us> +References: <200206210158.g5L1wFk20118@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +Message-ID: <1024784514.1793.242.camel@localhost.localdomain> +MIME-Version: 1.0 +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 22 Jun 2002 18:22:58 -0400 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: ORr + +On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote: +> I was wondering, how does knowing the block is corrupt help MS SQL? +> Right now, we write changed pages to WAL, then later write them to disk. +> I have always been looking for a way to prevent these WAL writes. The +> 512-byte bit seems interesting, but how does it help? +> +> And how does the bit help them with partial block writes? Is the bit at +> the end of the block? Is that reliable? +> + +My understanding of this is as follows: + +1) On most commercial systems, if you get a corrupted block (from +partial write or whatever) you need to restore the file(s) from the most +recent backup, and replay the log from the log archive (usually only the +damaged files will be written to during replay). + +2) If you can't deal with the downtime to recover the file, then EMC, +Sun, or IBM will sell you an expensive disk array with an NVRAM cache +that will do atomic writes. Some plain-vanilla SCSI disks are also +capable of atomic writes, though usually they don't use NVRAM to do it. + +The database must then make sure that each page-write gets translated +into exactly one SCSI-level write. This is one reason why ORACLE and +Sybase recommend that you use raw disk partitions for high availability. +Some operating systems support this through the filesystem, but it is OS +dependent. I think Solaris 7 & 8 has support for this, but I'm not sure. + +PostgreSQL has trouble because it can neither archive logs for replay, +nor use raw disk partitions. + + +One other point: + +Page pre-image logging is fundamentally the same as what Jim Grey's +book[1] would call "careful writes". I don't believe they should be in +the XLOG, because we never need to keep the pre-images after we're sure +the buffer has made it to the disk. Instead, we should have the buffer +IO routines implement ping-pong writes of some kind if we want +protection from partial writes. + + +Does any of this make sense? + + + +;jrnield + + +[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts + and Techniques". Morgan Kaufmann. + +-- +J. R. Nield +jrnield@usol.com + + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + +From pgsql-hackers-owner+M24068@postgresql.org Sun Jun 23 08:40:27 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCeQF01601 + for ; Sun, 23 Jun 2002 08:40:27 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 8AC4B475CBC; Sun, 23 Jun 2002 08:40:22 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 4683647599D; Sun, 23 Jun 2002 08:37:40 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 0D57847592A + for ; Sun, 23 Jun 2002 08:37:38 -0400 (EDT) +Received: from hades.usol.com (hades.usol.com [208.232.58.41]) + by postgresql.org (Postfix) with ESMTP id 75326475876 + for ; Sun, 23 Jun 2002 08:37:36 -0400 (EDT) +Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111; + Sun, 23 Jun 2002 08:37:23 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Michael Loftis , + mlw , + PostgreSQL Hacker , + Tom Lane +In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us> +References: <200206222317.g5MNHBn23427@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 08:37:53 -0400 +Message-ID: <1024835880.1793.264.camel@localhost.localdomain> +MIME-Version: 1.0 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: OR + +On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: +> J. R. Nield wrote: +> > One other point: +> > +> > Page pre-image logging is fundamentally the same as what Jim Grey's +> > book[1] would call "careful writes". I don't believe they should be in +> > the XLOG, because we never need to keep the pre-images after we're sure +> > the buffer has made it to the disk. Instead, we should have the buffer +> > IO routines implement ping-pong writes of some kind if we want +> > protection from partial writes. +> +> Ping-pong writes to where? We have to fsync, and rather than fsync that +> area and WAL, we just do WAL. Not sure about a win there. +> + +The key question is: do we have some method to ensure that the OS +doesn't do the writes in parallel? + +If the OS will ensure that one of the two block writes of a ping-pong +completes before the other starts, then we don't need to fsync() at +all. + +The only thing we are protecting against is the possibility of both +writes being partial. If neither is done, that's fine because WAL will +protect us. If the first write is partial, we will detect that and use +the old data from the other, then recover from WAL. If the first is +complete but the second is partial, then we detect that and use the +newer block from the first write. If the second is complete but the +first is partial, we detect that and use the newer block from the second +write. + +So does anyone know a way to prevent parallel writes in one of the +common unix standards? Do they say anything about this? + +It would seem to me that if the same process does both ping-pong writes, +then there should be a cheap way to enforce a serial order. I could be +wrong though. + +As to where the first block of the ping-pong should go, maybe we could +reserve a file with nBlocks space for them, and write the information +about which block was being written to the XLOG for use in recovery. +There are many other ways to do it. + +;jrnield + +-- +J. R. Nield +jrnield@usol.com + + + + +---------------------------(end of broadcast)--------------------------- +TIP 6: Have you searched our list archives? + +http://archives.postgresql.org + +From jrnield@usol.com Sun Jun 23 08:37:30 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCbRF28741 + for ; Sun, 23 Jun 2002 08:37:28 -0400 (EDT) +Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111; + Sun, 23 Jun 2002 08:37:23 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Michael Loftis , + mlw + , + PostgreSQL Hacker , + Tom Lane +In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us> +References: <200206222317.g5MNHBn23427@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 08:37:53 -0400 +Message-ID: <1024835880.1793.264.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: OR + +On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: +> J. R. Nield wrote: +> > One other point: +> > +> > Page pre-image logging is fundamentally the same as what Jim Grey's +> > book[1] would call "careful writes". I don't believe they should be in +> > the XLOG, because we never need to keep the pre-images after we're sure +> > the buffer has made it to the disk. Instead, we should have the buffer +> > IO routines implement ping-pong writes of some kind if we want +> > protection from partial writes. +> +> Ping-pong writes to where? We have to fsync, and rather than fsync that +> area and WAL, we just do WAL. Not sure about a win there. +> + +The key question is: do we have some method to ensure that the OS +doesn't do the writes in parallel? + +If the OS will ensure that one of the two block writes of a ping-pong +completes before the other starts, then we don't need to fsync() at +all. + +The only thing we are protecting against is the possibility of both +writes being partial. If neither is done, that's fine because WAL will +protect us. If the first write is partial, we will detect that and use +the old data from the other, then recover from WAL. If the first is +complete but the second is partial, then we detect that and use the +newer block from the first write. If the second is complete but the +first is partial, we detect that and use the newer block from the second +write. + +So does anyone know a way to prevent parallel writes in one of the +common unix standards? Do they say anything about this? + +It would seem to me that if the same process does both ping-pong writes, +then there should be a cheap way to enforce a serial order. I could be +wrong though. + +As to where the first block of the ping-pong should go, maybe we could +reserve a file with nBlocks space for them, and write the information +about which block was being written to the XLOG for use in recovery. +There are many other ways to do it. + +;jrnield + +-- +J. R. Nield +jrnield@usol.com + + + + +From cjs@cynic.net Sun Jun 23 09:33:29 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NDXSF11543 + for ; Sun, 23 Jun 2002 09:33:28 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id A83ABF820; Sun, 23 Jun 2002 13:33:15 +0000 (UTC) +Date: Sun, 23 Jun 2002 22:33:07 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Bruce Momjian , Michael Loftis , + mlw , + PostgreSQL Hacker , + Tom Lane +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024835880.1793.264.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On 23 Jun 2002, J. R. Nield wrote: + +> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: +> > J. R. Nield wrote: +> > > One other point: +> > > +> > > Page pre-image logging is fundamentally the same as what Jim Grey's +> > > book[1] would call "careful writes". I don't believe they should be in +> > > the XLOG, because we never need to keep the pre-images after we're sure +> > > the buffer has made it to the disk. Instead, we should have the buffer +> > > IO routines implement ping-pong writes of some kind if we want +> > > protection from partial writes. +> > +> > Ping-pong writes to where? We have to fsync, and rather than fsync that +> > area and WAL, we just do WAL. Not sure about a win there. + +Presumably the win is that, "we never need to keep the pre-images +after we're sure the buffer has made it to the disk." So the +pre-image log can be completely ditched when we shut down the +server, so a full system sync, or whatever. This keeps the log file +size down, which means faster recovery, less to back up (when we +start getting transaction logs that can be backed up), etc. + +This should also allow us to disable completely the ping-pong writes +if we have a disk subsystem that we trust. (E.g., a disk array with +battery backed memory.) That would, in theory, produce a nice little +performance increase when lots of inserts and/or updates are being +committed, as we have much, much less to write to the log file. + +Are there stats that track, e.g., the bandwidth of writes to the +log file? I'd be interested in knowing just what kind of savings +one might see by doing this. + +> The key question is: do we have some method to ensure that the OS +> doesn't do the writes in parallel?... +> It would seem to me that if the same process does both ping-pong writes, +> then there should be a cheap way to enforce a serial order. I could be +> wrong though. + +Well, whether or not there's a cheap way depends on whether you consider +fsync to be cheap. :-) + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From pgsql-hackers-owner+M24073@postgresql.org Sun Jun 23 11:19:59 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NFJxF19785 + for ; Sun, 23 Jun 2002 11:19:59 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 0BD5B475E79; Sun, 23 Jun 2002 11:19:55 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 5C0CB475D6A; Sun, 23 Jun 2002 11:19:50 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id E2353475C4B + for ; Sun, 23 Jun 2002 11:19:47 -0400 (EDT) +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id 746F8475AEA + for ; Sun, 23 Jun 2002 11:19:46 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5NFJF108464; + Sun, 23 Jun 2002 11:19:15 -0400 (EDT) +To: Curt Sampson +cc: "J. R. Nield" , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: +References: +Comments: In-reply-to Curt Sampson + message dated "Sun, 23 Jun 2002 22:33:07 +0900" +Date: Sun, 23 Jun 2002 11:19:15 -0400 +Message-ID: <8461.1024845555@sss.pgh.pa.us> +From: Tom Lane +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: OR + +Curt Sampson writes: +> This should also allow us to disable completely the ping-pong writes +> if we have a disk subsystem that we trust. + +If we have a disk subsystem we trust, we just disable fsync on the +WAL and the performance issue largely goes away. + +I concur with Bruce: the reason we keep page images in WAL is to +minimize the number of places we have to fsync, and thus the amount of +head movement required for a commit. Putting the page images elsewhere +cannot be a win AFAICS. + +> Well, whether or not there's a cheap way depends on whether you consider +> fsync to be cheap. :-) + +It's never cheap :-( + + regards, tom lane + +---------------------------(end of broadcast)--------------------------- +TIP 5: Have you checked our extensive FAQ? + +http://www.postgresql.org/users-lounge/docs/faq.html + +From cjs@cynic.net Sun Jun 23 12:10:44 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NGAgF22907 + for ; Sun, 23 Jun 2002 12:10:43 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 57BFDF820; Sun, 23 Jun 2002 16:10:35 +0000 (UTC) +Date: Mon, 24 Jun 2002 01:10:26 +0900 (JST) +From: Curt Sampson +To: Tom Lane +cc: "J. R. Nield" , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <8461.1024845555@sss.pgh.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Sun, 23 Jun 2002, Tom Lane wrote: + +> Curt Sampson writes: +> > This should also allow us to disable completely the ping-pong writes +> > if we have a disk subsystem that we trust. +> +> If we have a disk subsystem we trust, we just disable fsync on the +> WAL and the performance issue largely goes away. + +No, you can't do this. If you don't fsync(), there's no guarantee +that the write ever got out of the computer's buffer cache and to +the disk subsystem in the first place. + +> I concur with Bruce: the reason we keep page images in WAL is to +> minimize the number of places we have to fsync, and thus the amount of +> head movement required for a commit. + +An fsync() does not necessarially cause head movement, or any real +disk writes at all. If you're writing to many external disk arrays, +for example, the fsync() ensures that the data are in the disk array's +non-volatile or UPS-backed RAM, no more. The array might hold the data +for quite some time before it actually writes it to disk. + +But you're right that it's faster, if you're going to write out changed +pages and have have the ping-pong file and the transaction log on the +same disk, just to write out the entire page to the transaction log. + +So what we would really need to implement, if we wanted to be more +efficient with trusted disk subsystems, would be the option of writing +to the log only the changed row or changed part of the row, or writing +the entire changed page. I don't know how hard this would be.... + +> > Well, whether or not there's a cheap way depends on whether you consider +> > fsync to be cheap. :-) +> +> It's never cheap :-( + +Actually, with a good external RAID system with non-volatile RAM, +it's a good two to four orders of magnitude cheaper than writing to a +directly connected disk that doesn't claim the write is complete until +it's physically on disk. I'd say that it qualifies as at least "not +expensive." Not that you want to do it more often than you have to +anyway.... + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From jrnield@usol.com Sun Jun 23 13:56:59 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NHusF00335 + for ; Sun, 23 Jun 2002 13:56:58 -0400 (EDT) +Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NHunj18549; + Sun, 23 Jun 2002 13:56:49 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Tom Lane +cc: Curt Sampson , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker + +In-Reply-To: <8461.1024845555@sss.pgh.pa.us> +References: + <8461.1024845555@sss.pgh.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 13:57:19 -0400 +Message-ID: <1024855044.1793.414.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: ORr + +On Sun, 2002-06-23 at 11:19, Tom Lane wrote: +> Curt Sampson writes: +> > This should also allow us to disable completely the ping-pong writes +> > if we have a disk subsystem that we trust. +> +> If we have a disk subsystem we trust, we just disable fsync on the +> WAL and the performance issue largely goes away. + +It wouldn't work because the OS buffering interferes, and we need those +WAL records on disk up to the greatest LSN of the Buffer we will be writing. + + +We already buffer WAL ourselves. We also already buffer regular pages. +Whenever we write a Buffer out of the buffer cache, it is because we +really want that page on disk and wanted to start an IO. If thats not +the case, then we should have more block buffers! + +So since we have all this buffering designed especially to meet our +needs, and since the OS buffering is in the way, can someone explain to +me why postgresql would ever open a file without the O_DSYNC flag if the +platform supports it? + + + +> +> I concur with Bruce: the reason we keep page images in WAL is to +> minimize the number of places we have to fsync, and thus the amount of +> head movement required for a commit. Putting the page images elsewhere +> cannot be a win AFAICS. + + +Why not put all the page images in a single pre-allocated file and treat +it as a ring? How could this be any worse than flushing them in the WAL +log? + +Maybe fsync would be slower with two files, but I don't see how +fdatasync would be, and most platforms support that. + +What would improve performance would be to have a dbflush process that +would work in the background flushing buffers in groups and trying to +stay ahead of ReadBuffer requests. That would let you do the temporary +side of the ping-pong as a huge O_DSYNC writev(2) request (or +fdatasync() once) and then write out the other buffers. It would also +tend to prevent the other backends from blocking on write requests. + +A dbflush could also support aio_read/aio_write on platforms like +Solaris and WindowsNT that support it. + +Am I correct that right now, buffers only get written when they get +removed from the free list for reuse? So a released dirty buffer will +sit in the buffer free list until it becomes the Least Recently Used +buffer, and will then cause a backend to block for IO in a call to +BufferAlloc? + +This would explain why we like using the OS buffer cache, and why our +performance is troublesome when we have to do synchronous IO writes, and +why fsync() takes so long to complete. All of the backends block for +each call to BufferAlloc() after a large table update by a single +backend, and then the OS buffers are always full of our "written" data. + +Am I reading the bufmgr code correctly? I already found an imaginary +race condition there once :-) + +;jnield + + +> +> > Well, whether or not there's a cheap way depends on whether you consider +> > fsync to be cheap. :-) +> +> It's never cheap :-( +> +-- +J. R. Nield +jrnield@usol.com + + +From cjs@cynic.net Sun Jun 23 14:15:15 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIFEF01698 + for ; Sun, 23 Jun 2002 14:15:15 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 796E6F820; Sun, 23 Jun 2002 18:15:08 +0000 (UTC) +Date: Mon, 24 Jun 2002 03:15:01 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Tom Lane , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: ORr + +On 23 Jun 2002, J. R. Nield wrote: + +> So since we have all this buffering designed especially to meet our +> needs, and since the OS buffering is in the way, can someone explain to +> me why postgresql would ever open a file without the O_DSYNC flag if the +> platform supports it? + +It's more code, if there are platforms out there that don't support +O_DYSNC. (We still have to keep the old fsync code.) On the other hand, +O_DSYNC could save us a disk arm movement over fsync() because it +appears to me that fsync is also going to force a metadata update, which +means that the inode blocks have to be written as well. + +> Maybe fsync would be slower with two files, but I don't see how +> fdatasync would be, and most platforms support that. + +Because, if both files are on the same disk, you still have to move +the disk arm from the cylinder at the current log file write point +to the cylinder at the current ping-pong file write point. And then back +again to the log file write point cylinder. + +In the end, having a ping-pong file as well seems to me unnecessary +complexity, especially when anyone interested in really good +performance is going to buy a disk subsystem that guarantees no +torn pages and thus will want to turn off the ping-pong file writes +entirely, anyway. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From jrnield@usol.com Sun Jun 23 14:14:51 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIEnF01649 + for ; Sun, 23 Jun 2002 14:14:50 -0400 (EDT) +Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NIEkj19287; + Sun, 23 Jun 2002 14:14:46 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Curt Sampson +cc: Tom Lane , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker + +In-Reply-To: +References: +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 14:15:17 -0400 +Message-ID: <1024856120.3054.418.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: OR + +On Sun, 2002-06-23 at 12:10, Curt Sampson wrote: +> +> So what we would really need to implement, if we wanted to be more +> efficient with trusted disk subsystems, would be the option of writing +> to the log only the changed row or changed part of the row, or writing +> the entire changed page. I don't know how hard this would be.... +> +We already log that stuff. The page images are in addition to the +"Logical Changes", so we could just stop logging the page images. + +-- +J. R. Nield +jrnield@usol.com + + + + +From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OHDeF08564 + for ; Mon, 24 Jun 2002 13:13:40 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 05602475CBE; Mon, 24 Jun 2002 13:11:10 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 13:11:10 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 929A247633B; Mon, 24 Jun 2002 09:26:54 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 962C147631A + for ; Mon, 24 Jun 2002 08:31:43 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:31:43 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id C112D475C3C + for ; Sun, 23 Jun 2002 15:35:20 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5NJYtL07449; + Sun, 23 Jun 2002 15:34:55 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206231934.g5NJYtL07449@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain> +To: "J. R. Nield" +Date: Sun, 23 Jun 2002 15:34:55 -0400 (EDT) +cc: Tom Lane , Curt Sampson , + Michael Loftis , mlw , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +J. R. Nield wrote: +> So since we have all this buffering designed especially to meet our +> needs, and since the OS buffering is in the way, can someone explain to +> me why postgresql would ever open a file without the O_DSYNC flag if the +> platform supports it? + +We sync only WAL, not the other pages, except for the sync() call we do +during checkpoint when we discard old WAL files. + +> > I concur with Bruce: the reason we keep page images in WAL is to +> > minimize the number of places we have to fsync, and thus the amount of +> > head movement required for a commit. Putting the page images elsewhere +> > cannot be a win AFAICS. +> +> +> Why not put all the page images in a single pre-allocated file and treat +> it as a ring? How could this be any worse than flushing them in the WAL +> log? +> +> Maybe fsync would be slower with two files, but I don't see how +> fdatasync would be, and most platforms support that. + +We have fdatasync option for WAL in postgresql.conf. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + + + +From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208 + for ; Mon, 24 Jun 2002 12:54:22 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 2769E4762E3 + for ; Mon, 24 Jun 2002 08:27:39 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id ED459475C61 + for ; Sun, 23 Jun 2002 15:37:08 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642; + Sun, 23 Jun 2002 15:36:54 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: +To: Curt Sampson +Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT) +cc: "J. R. Nield" , Tom Lane , + Michael Loftis , mlw , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +Curt Sampson wrote: +> On 23 Jun 2002, J. R. Nield wrote: +> +> > So since we have all this buffering designed especially to meet our +> > needs, and since the OS buffering is in the way, can someone explain to +> > me why postgresql would ever open a file without the O_DSYNC flag if the +> > platform supports it? +> +> It's more code, if there are platforms out there that don't support +> O_DYSNC. (We still have to keep the old fsync code.) On the other hand, +> O_DSYNC could save us a disk arm movement over fsync() because it +> appears to me that fsync is also going to force a metadata update, which +> means that the inode blocks have to be written as well. + +Again, see postgresql.conf: + +#wal_sync_method = fsync # the default varies across platforms: +# # fsync, fdatasync, open_sync, or open_datasync + +> +> > Maybe fsync would be slower with two files, but I don't see how +> > fdatasync would be, and most platforms support that. +> +> Because, if both files are on the same disk, you still have to move +> the disk arm from the cylinder at the current log file write point +> to the cylinder at the current ping-pong file write point. And then back +> again to the log file write point cylinder. +> +> In the end, having a ping-pong file as well seems to me unnecessary +> complexity, especially when anyone interested in really good +> performance is going to buy a disk subsystem that guarantees no +> torn pages and thus will want to turn off the ping-pong file writes +> entirely, anyway. + +Yes, I don't see writing to two files vs. one to be any win, especially +when we need to fsync both of them. What I would really like is to +avoid the double I/O of writing to WAL and to the data file; improving +that would be a huge win. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + + + +From cjs@cynic.net Sun Jun 23 20:09:44 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O09hF00630 + for ; Sun, 23 Jun 2002 20:09:43 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 6F45AF820; Mon, 24 Jun 2002 00:09:38 +0000 (UTC) +Date: Mon, 24 Jun 2002 09:09:30 +0900 (JST) +From: Curt Sampson +To: Bruce Momjian +cc: "J. R. Nield" , Tom Lane , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Sun, 23 Jun 2002, Bruce Momjian wrote: + +> Yes, I don't see writing to two files vs. one to be any win, especially +> when we need to fsync both of them. What I would really like is to +> avoid the double I/O of writing to WAL and to the data file; improving +> that would be a huge win. + +You mean, the double I/O of writing the block to the WAL and data file? +(We'd still have to write the changed columns or whatever to the WAL, +right?) + +I'd just add an option to turn it off. If you need it, you need it; +there's no way around that except to buy hardware that is really going +to guarantee your writes (which then means you don't need it). + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From jrnield@usol.com Sun Jun 23 21:28:58 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O1SuF06381 + for ; Sun, 23 Jun 2002 21:28:57 -0400 (EDT) +Received: from 01-072.024.popsite.net (01-072.024.popsite.net [216.126.160.72]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5O1Ssj09303; + Sun, 23 Jun 2002 21:28:55 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Tom Lane , + Michael Loftis + , mlw , + PostgreSQL Hacker + +In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us> +References: <200206231936.g5NJasa07642@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 21:29:23 -0400 +Message-ID: <1024882167.1793.733.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: ORr + +On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: +> Yes, I don't see writing to two files vs. one to be any win, especially +> when we need to fsync both of them. What I would really like is to +> avoid the double I/O of writing to WAL and to the data file; improving +> that would be a huge win. +> + +If is impossible to do what you want. You can not protect against +partial writes without writing pages twice and calling fdatasync between +them while going through a generic filesystem. The best disk array will +not protect you if the operating system does not align block writes to +the structure of the underlying device. Even with raw devices, you need +special support or knowledge of the operating system and/or the disk +device to ensure that each write request will be atomic to the +underlying hardware. + +All other systems rely on the fact that you can recover a damaged file +using the log archive. This means downtime in the rare case, but no data +loss. Until PostgreSQL can do this, then it will not be acceptable for +real critical production use. This is not to knock PostgreSQL, because +it is a very good database system, and clearly the best open-source one. +It even has feature advantages over the commercial systems. But at the +end of the day, unless you have complete understanding of the I/O system +from write(2) through to the disk system, the only sure ways to protect +against partial writes are by "careful writes" (in the WAL log or +elsewhere, writing pages twice), or by requiring (and allowing) users to +do log-replay recovery when a file is corrupted by a partial write. As +long as there is a UPS, and the operating system doesn't crash, then +there still should be no partial writes. + +If we log pages to WAL, they are useless when archived (after a +checkpoint). So either we have a separate "log" for them (the ping-pong +file), or we should at least remove them when archived, which makes log +archiving more complex but is perfectly doable. + +Finally, I would love to hear why we are using the operating system +buffer manager at all. The OS is acting as a secondary buffer manager +for us. Why is that? What flaw in our I/O system does this reveal? I +know that: + +>We sync only WAL, not the other pages, except for the sync() call we do +> during checkpoint when we discard old WAL files. + +But this is probably not a good thing. We should only be writing blocks +when they need to be on disk. We should not be expecting the OS to write +them "sometime later" and avoid blocking (as long) for the write. If we +need that, then our buffer management is wrong and we need to fix it. +The reason we are doing this is because we expect the OS buffer manager +to do asynchronous I/O for us, but then we don't control the order. That +is the reason why we have to call fdatasync(), to create "sequence +points". + +The reason we have performance problems with either D_OSYNC or fdatasync +on the normal relations is because we have no dbflush process. This +causes an unacceptable amount of I/O blocking by other transactions. + +The ORACLE people were not kidding when they said that they could not +certify Linux for production use until it supported O_DSYNC. Can you +explain why that was the case? + +Finally, let me apologize if the above comes across as somewhat +belligerent. I know very well that I can't compete with you guys for +knowledge of the PosgreSQL system. I am still at a loss when I look at +the optimizer and executor modules, and it will take some time before I +can follow discussion of that area. Even then, I doubt my ability to +compare with people like Mr. Lane and Mr. Momjian in experience and +general intelligence, or in the field of database programming and +software development in particular. However, this discussion and a +search of the pgsql-hackers archives reveals this problem to be the KEY +area of PostgreSQL's failing, and general misunderstanding, when +compared to its commercial competitors. + +Sincerely, + + J. R. Nield + +-- +J. R. Nield +jrnield@usol.com + + + + +From pgsql-hackers-owner+M24090@postgresql.org Mon Jun 24 12:38:04 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGc3F05962 + for ; Mon, 24 Jun 2002 12:38:03 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 81B9F4768DF; Mon, 24 Jun 2002 10:18:05 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 10:18:05 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 81F08476473; Mon, 24 Jun 2002 08:55:28 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id CDDFA475CC3 + for ; Mon, 24 Jun 2002 08:37:44 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:37:44 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id 5C971475858 + for ; Sun, 23 Jun 2002 22:47:12 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5O2ki712992; + Sun, 23 Jun 2002 22:46:44 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206240246.g5O2ki712992@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain> +To: "J. R. Nield" +Date: Sun, 23 Jun 2002 22:46:44 -0400 (EDT) +cc: Curt Sampson , Tom Lane , + Michael Loftis , mlw , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +J. R. Nield wrote: +> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: +> > Yes, I don't see writing to two files vs. one to be any win, especially +> > when we need to fsync both of them. What I would really like is to +> > avoid the double I/O of writing to WAL and to the data file; improving +> > that would be a huge win. +> > +> +> If is impossible to do what you want. You can not protect against +> partial writes without writing pages twice and calling fdatasync between +> them while going through a generic filesystem. The best disk array will +> not protect you if the operating system does not align block writes to +> the structure of the underlying device. Even with raw devices, you need +> special support or knowledge of the operating system and/or the disk +> device to ensure that each write request will be atomic to the +> underlying hardware. + +Yes, I suspected it was impossible, but that doesn't mean I want it any +less. ;-) + +> All other systems rely on the fact that you can recover a damaged file +> using the log archive. This means downtime in the rare case, but no data +> loss. Until PostgreSQL can do this, then it will not be acceptable for +> real critical production use. This is not to knock PostgreSQL, because +> it is a very good database system, and clearly the best open-source one. +> It even has feature advantages over the commercial systems. But at the +> end of the day, unless you have complete understanding of the I/O system +> from write(2) through to the disk system, the only sure ways to protect +> against partial writes are by "careful writes" (in the WAL log or +> elsewhere, writing pages twice), or by requiring (and allowing) users to +> do log-replay recovery when a file is corrupted by a partial write. As +> long as there is a UPS, and the operating system doesn't crash, then +> there still should be no partial writes. + +You are talking point-in-time recovery, a major missing feature right +next to replication, and I agree it makes PostgreSQL unacceptable for +some applications. Point taken. + +And the interesting thing you are saying is that with point-in-time +recovery, we don't need to write pre-write images of pages because if we +detect a partial page write, we then abort the database and tell the +user to do a point-in-time recovery, basically meaning we are using the +previous full backup as our pre-write page image and roll forward using +the logical logs. This is clearly a nice thing to be able to do because +it let's you take a pre-write image of the page once during full backup, +keep it offline, and bring it back in the rare case of a full page write +failure. I now can see how the MSSQL tearoff-bits would be used, not +for recovery, but to detect a partial write and force a point-in-time +recovery from the administrator. + + +> If we log pages to WAL, they are useless when archived (after a +> checkpoint). So either we have a separate "log" for them (the ping-pong +> file), or we should at least remove them when archived, which makes log +> archiving more complex but is perfectly doable. + +Yes, that is how we will do point-in-time recovery; remove the +pre-write page images and archive the rest. It is more complex, but +having the fsync all in one file is too big a win. + +> Finally, I would love to hear why we are using the operating system +> buffer manager at all. The OS is acting as a secondary buffer manager +> for us. Why is that? What flaw in our I/O system does this reveal? I +> know that: +> +> >We sync only WAL, not the other pages, except for the sync() call we do +> > during checkpoint when we discard old WAL files. +> +> But this is probably not a good thing. We should only be writing blocks +> when they need to be on disk. We should not be expecting the OS to write +> them "sometime later" and avoid blocking (as long) for the write. If we +> need that, then our buffer management is wrong and we need to fix it. +> The reason we are doing this is because we expect the OS buffer manager +> to do asynchronous I/O for us, but then we don't control the order. That +> is the reason why we have to call fdatasync(), to create "sequence +> points". + +Yes. I think I understand. It is true we have to fsync WAL because we +can't control the individual writes by the OS. + +> The reason we have performance problems with either D_OSYNC or fdatasync +> on the normal relations is because we have no dbflush process. This +> causes an unacceptable amount of I/O blocking by other transactions. + +Uh, that would force writes all over the disk. Why do we really care how +the OS writes them? If we are going to fsync, let's just do the one +file and be done with it. What would a separate flusher process really +buy us if it has to use fsync too. The main backend doesn't have to +wait for the fsync, but then again, we can't say the transaction is +committed until it hits the disk, so how does a flusher help? + +> The ORACLE people were not kidding when they said that they could not +> certify Linux for production use until it supported O_DSYNC. Can you +> explain why that was the case? + +I don't see O_DSYNC as very different from write/fsync(or fdatasync). + +> Finally, let me apologize if the above comes across as somewhat +> belligerent. I know very well that I can't compete with you guys for +> knowledge of the PostgreSQL system. I am still at a loss when I look at +> the optimizer and executor modules, and it will take some time before I +> can follow discussion of that area. Even then, I doubt my ability to +> compare with people like Mr. Lane and Mr. Momjian in experience and +> general intelligence, or in the field of database programming and +> software development in particular. However, this discussion and a +> search of the pgsql-hackers archives reveals this problem to be the KEY +> area of PostgreSQL's failing, and general misunderstanding, when +> compared to its commercial competitors. + +We appreciate your ideas. Few of us are professional db folks so we are +always looking for good ideas. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 6: Have you searched our list archives? + +http://archives.postgresql.org + + + +From cjs@cynic.net Sun Jun 23 23:40:59 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O3evF17903 + for ; Sun, 23 Jun 2002 23:40:58 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 37F36F820; Mon, 24 Jun 2002 03:40:54 +0000 (UTC) +Date: Mon, 24 Jun 2002 12:40:51 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Bruce Momjian , Tom Lane , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On 23 Jun 2002, J. R. Nield wrote: + +> If is impossible to do what you want. You can not protect against +> partial writes without writing pages twice and calling fdatasync +> between them while going through a generic filesystem. + +I agree with this. + +> The best disk array will not protect you if the operating system does +> not align block writes to the structure of the underlying device. + +This I don't quite understand. Assuming you're using a SCSI drive +(and this mostly applies to ATAPI/IDE, too), you can do naught but +align block writes to the structure of the underlying device. When you +initiate a SCSI WRITE command, you start by telling the device at which +block to start writing and how many blocks you intend to write. Then you +start passing the data. + +(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter +details for the SCSI WRITE(10) command. You may find the SCSI 2 +specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful +reference here.) + +> Even with raw devices, you need special support or knowledge of the +> operating system and/or the disk device to ensure that each write +> request will be atomic to the underlying hardware. + +Well, so here I guess you're talking about two things: + + 1. When you request, say, an 8K block write, will the OS really + write it to disk in a single 8K or multiple of 8K SCSI write + command? + + 2. Does the SCSI device you're writing to consider these writes to + be transactional. That is, if the write is interrupted before being + completed, does the SCSI device guarantee that the partially-sent + data is not written, and the old data is maintained? And of course, + does it guarantee that, when it acknowledges a write, that write is + now in stable storage and will never go away? + +Both of these are not hard to guarantee, actually. For a BSD-based OS, +for example, just make sure that your filesystem block size is the +same as or a multiple of the database block size. BSD will never write +anything other than a block or a sequence of blocks to a disk in a +single SCSI transaction (unless you've got a really odd SCSI driver). +And for your disk, buy a Baydel or Clarion disk array, or something +similar. + +Given that it's not hard to set up a system that meets these criteria, +and this is in fact commonly done for database servers, it would seem a +good idea for postgres to have the option to take advantage of the time +and money spent and adjust its performance upward appropriately. + +> All other systems rely on the fact that you can recover a damaged file +> using the log archive. + +Not exactly. For MS SQL Server, at any rate, if it detects a page tear +you cannot restore based on the log file alone. You need a full or +partial backup that includes that entire torn block. + +> This means downtime in the rare case, but no data loss. Until +> PostgreSQL can do this, then it will not be acceptable for real +> critical production use. + +It seems to me that it is doing this right now. In fact, it's more +reliable than some commerial systems (such as SQL Server) because it can +recover from a torn block with just the logfile. + +> But at the end of the day, unless you have complete understanding of +> the I/O system from write(2) through to the disk system, the only sure +> ways to protect against partial writes are by "careful writes" (in +> the WAL log or elsewhere, writing pages twice), or by requiring (and +> allowing) users to do log-replay recovery when a file is corrupted by +> a partial write. + +I don't understand how, without a copy of the old data that was in the +torn block, you can restore that block from just log file entries. Can +you explain this to me? Take, as an example, a block with ten tuples, +only one of which has been changed "recently." (I.e., only that change +is in the log files.) + +> If we log pages to WAL, they are useless when archived (after a +> checkpoint). So either we have a separate "log" for them (the +> ping-pong file), or we should at least remove them when archived, +> which makes log archiving more complex but is perfectly doable. + +Right. That seems to me a better option, since we've now got only one +write point on the disk rather than two. + +> Finally, I would love to hear why we are using the operating system +> buffer manager at all. The OS is acting as a secondary buffer manager +> for us. Why is that? What flaw in our I/O system does this reveal? + +It's acting as a "second-level" buffer manager, yes, but to say it's +"secondary" may be a bit misleading. On most of the systems I've set +up, the OS buffer cache is doing the vast majority of the work, and the +postgres buffering is fairly minimal. + +There are some good (and some perhaps not-so-good) reasons to do it this +way. I'll list them more or less in the order of best to worst: + + 1. The OS knows where the blocks physically reside on disk, and + postgres does not. Therefore it's in the interest of postgresql to + dispatch write responsibility back to the OS as quickly as possible + so that the OS can prioritize requests appropriately. Most operating + systems use an "elevator" algorithm to minimize disk head movement; + but if the OS does not have a block that it could write while the + head is "on the way" to another request, it can't write it in that + head pass. + + 2. Postgres does not know about any "bank-switching" tricks for + mapping more physical memory than it has address space. Thus, on + 32-bit machines, postgres might be limited to mapping 2 or 3 GB of + memory, even though the machine has, say, 6 GB of physical RAM. The + OS can use all of the available memory for caching; postgres cannot. + + 3. A lot of work has been put into the seek algorithms, read-ahead + algorithms, block allocation algorithms, etc. in the OS. Why + duplicate all that work again in postgres? + +When you say things like the following: + +> We should only be writing blocks when they need to be on disk. We +> should not be expecting the OS to write them "sometime later" and +> avoid blocking (as long) for the write. If we need that, then our +> buffer management is wrong and we need to fix it. + +you appear to be making the arugment that we should take the route of +other database systems, and use raw devices and our own management of +disk block allocation. If so, you might want first to look back through +the archives at the discussion I and several others had about this a +month or two ago. After looking in detail at what NetBSD, at least, does +in terms of its disk I/O algorithms and buffering, I've pretty much come +around, at least for the moment, to the attitude that we should stick +with using the OS. I wouldn't mind seeing postgres be able to manage all +of this stuff, but it's a *lot* of work for not all that much benefit +that I can see. + +> The ORACLE people were not kidding when they said that they could not +> certify Linux for production use until it supported O_DSYNC. Can you +> explain why that was the case? + +I'm suspecting it's because Linux at the time had no raw devices, so +O_DSYNC was the only other possible method of making sure that disk +writes actually got to disk. + +You certainly don't want to use O_DSYNC if you can use another method, +because O_DSYNC still goes through the the operating system's buffer +cache, wasting memory and double-caching things. If you're doing your +own management, you need either to use a raw device or open files with +the flag that indicates that the buffer cache should not be used at all +for reads from and writes to that file. + +> However, this discussion and a search of the pgsql-hackers archives +> reveals this problem to be the KEY area of PostgreSQL's failing, and +> general misunderstanding, when compared to its commercial competitors. + +No, I think it's just that you're under a few minor misapprehensions +here about what postgres and the OS are actually doing. As I said, I +went through this whole exact argument a month or two ago, on this very +list, and I came around to the idea that what postgres is doing now +works quite well, at least on NetBSD. (Most other OSes have disk I/O +algorithms that are pretty much as good or better.) There might be a +very slight advantage to doing all one's own I/O management, but it's +a huge amount of work, and I think that much effort could be much more +usefully applied to other areas. + +Just as a side note, I've been a NetBSD developer since about '96, +and have been delving into the details of OS design since well before +that time, so I'm coming to this with what I hope is reasonably good +knowledge of how disks work and how operating systems use them. (Not +that this should stop you from pointing out holes in my arguments. :-)) + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From pgsql-hackers-owner+M24112@postgresql.org Mon Jun 24 18:16:36 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OMGaF00910 + for ; Mon, 24 Jun 2002 18:16:36 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id A2EF1476475; Mon, 24 Jun 2002 16:43:38 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:43:38 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id BA57D476148; Mon, 24 Jun 2002 14:14:00 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 93D6A477214 + for ; Mon, 24 Jun 2002 13:59:17 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 13:59:17 2002 +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id D70AA476401 + for ; Mon, 24 Jun 2002 10:06:26 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OE6J117666; + Mon, 24 Jun 2002 10:06:19 -0400 (EDT) +To: Curt Sampson +cc: Bruce Momjian , "J. R. Nield" , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: +References: +Comments: In-reply-to Curt Sampson + message dated "Mon, 24 Jun 2002 09:09:30 +0900" +Date: Mon, 24 Jun 2002 10:06:19 -0400 +Message-ID: <17663.1024927579@sss.pgh.pa.us> +From: Tom Lane +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-5.3 required=5.0 + tests=IN_REP_TO,X_NOT_PRESENT + version=2.30 +Status: OR + +> On Sun, 23 Jun 2002, Bruce Momjian wrote: +>> Yes, I don't see writing to two files vs. one to be any win, especially +>> when we need to fsync both of them. What I would really like is to +>> avoid the double I/O of writing to WAL and to the data file; improving +>> that would be a huge win. + +I don't believe it's possible to eliminate the double I/O. Keep in mind +though that in the ideal case (plenty of shared buffers) you are only +paying two writes per modified block per checkpoint interval --- one to +the WAL during the first write of the interval, and then a write to the +real datafile issued by the checkpoint process. Anything that requires +transaction commits to write data blocks will likely result in more I/O +not less, at least for blocks that are modified by several successive +transactions. + +The only thing I've been able to think of that seems like it might +improve matters is to make the WAL writing logic aware of the layout +of buffer pages --- specifically, to know that our pages generally +contain an uninteresting "hole" in the middle, and not write the hole. +Optimistically this might reduce the WAL data volume by something +approaching 50%; though pessimistically (if most pages are near full) +it wouldn't help much. + +This was not very feasible when the WAL code was designed because the +buffer manager needed to cope with both normal pages and pg_log pages, +but as of 7.2 I think it'd be safe to assume that all pages have the +standard layout. + + regards, tom lane + + + +---------------------------(end of broadcast)--------------------------- +TIP 2: you can get off all lists at once with the unregister command + (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) + + + +From pgsql-hackers-owner+M24116@postgresql.org Mon Jun 24 20:32:07 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0W7F10985 + for ; Mon, 24 Jun 2002 20:32:07 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id EBCE547632E; Mon, 24 Jun 2002 18:54:34 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 18:54:34 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 3EB93476D85; Mon, 24 Jun 2002 17:12:18 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id EBC20476E2E + for ; Mon, 24 Jun 2002 14:54:40 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 14:54:40 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id 1C8874760C2 + for ; Mon, 24 Jun 2002 12:40:53 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5OGeVY06116; + Mon, 24 Jun 2002 12:40:31 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206241640.g5OGeVY06116@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <17663.1024927579@sss.pgh.pa.us> +To: Tom Lane +Date: Mon, 24 Jun 2002 12:40:31 -0400 (EDT) +cc: Curt Sampson , "J. R. Nield" , + Michael Loftis , mlw , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +Tom Lane wrote: +> > On Sun, 23 Jun 2002, Bruce Momjian wrote: +> >> Yes, I don't see writing to two files vs. one to be any win, especially +> >> when we need to fsync both of them. What I would really like is to +> >> avoid the double I/O of writing to WAL and to the data file; improving +> >> that would be a huge win. +> +> I don't believe it's possible to eliminate the double I/O. Keep in mind +> though that in the ideal case (plenty of shared buffers) you are only +> paying two writes per modified block per checkpoint interval --- one to +> the WAL during the first write of the interval, and then a write to the +> real datafile issued by the checkpoint process. Anything that requires +> transaction commits to write data blocks will likely result in more I/O +> not less, at least for blocks that are modified by several successive +> transactions. +> +> The only thing I've been able to think of that seems like it might +> improve matters is to make the WAL writing logic aware of the layout +> of buffer pages --- specifically, to know that our pages generally +> contain an uninteresting "hole" in the middle, and not write the hole. +> Optimistically this might reduce the WAL data volume by something +> approaching 50%; though pessimistically (if most pages are near full) +> it wouldn't help much. + +Good idea. How about putting the page through or TOAST compression +routine before writing it to WAL? Should be pretty easy and fast and +doesn't require any knowledge of the page format. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org + + + +From pgsql-hackers-owner+M24114@postgresql.org Mon Jun 24 17:54:35 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLsZF28642 + for ; Mon, 24 Jun 2002 17:54:35 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id BD68F47683C; Mon, 24 Jun 2002 16:46:24 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:46:24 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id B2719476B31; Mon, 24 Jun 2002 16:01:51 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 950004770BC + for ; Mon, 24 Jun 2002 14:59:46 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 14:59:46 2002 +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id A0756475BB7 + for ; Mon, 24 Jun 2002 13:11:41 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OHB1119826; + Mon, 24 Jun 2002 13:11:02 -0400 (EDT) +To: Bruce Momjian +cc: Curt Sampson , "J. R. Nield" , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206241640.g5OGeVY06116@candle.pha.pa.us> +References: <200206241640.g5OGeVY06116@candle.pha.pa.us> +Comments: In-reply-to Bruce Momjian + message dated "Mon, 24 Jun 2002 12:40:31 -0400" +Date: Mon, 24 Jun 2002 13:11:01 -0400 +Message-ID: <19823.1024938661@sss.pgh.pa.us> +From: Tom Lane +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-5.3 required=5.0 + tests=IN_REP_TO,X_NOT_PRESENT + version=2.30 +Status: OR + +Bruce Momjian writes: +>> The only thing I've been able to think of that seems like it might +>> improve matters is to make the WAL writing logic aware of the layout +>> of buffer pages --- specifically, to know that our pages generally +>> contain an uninteresting "hole" in the middle, and not write the hole. +>> Optimistically this might reduce the WAL data volume by something +>> approaching 50%; though pessimistically (if most pages are near full) +>> it wouldn't help much. + +> Good idea. How about putting the page through or TOAST compression +> routine before writing it to WAL? Should be pretty easy and fast and +> doesn't require any knowledge of the page format. + +Easy, maybe, but fast definitely NOT. The compressor is not speedy. +Given that we have to be holding various locks while we build WAL +records, I do not think it's a good idea to add CPU time there. + +Also, compressing already-compressed data is not a win ... + + regards, tom lane + + + +---------------------------(end of broadcast)--------------------------- +TIP 3: if posting/reading through Usenet, please send an appropriate +subscribe-nomail command to majordomo@postgresql.org so that your +message can get through to the mailing list cleanly + + + +From jrnield@usol.com Mon Jun 24 16:49:25 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OKnNF23393 + for ; Mon, 24 Jun 2002 16:49:24 -0400 (EDT) +Received: from 08-113.024.popsite.net (08-113.024.popsite.net [66.19.4.113]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5OKnHV19100; + Mon, 24 Jun 2002 16:49:18 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Curt Sampson +cc: Bruce Momjian , Tom Lane , + PostgreSQL Hacker +In-Reply-To: +References: +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 24 Jun 2002 16:49:42 -0400 +Message-ID: <1024951786.1793.865.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: ORr + +On Sun, 2002-06-23 at 23:40, Curt Sampson wrote: +> On 23 Jun 2002, J. R. Nield wrote: +> +> > If is impossible to do what you want. You can not protect against +> > partial writes without writing pages twice and calling fdatasync +> > between them while going through a generic filesystem. +> +> I agree with this. +> +> > The best disk array will not protect you if the operating system does +> > not align block writes to the structure of the underlying device. +> +> This I don't quite understand. Assuming you're using a SCSI drive +> (and this mostly applies to ATAPI/IDE, too), you can do naught but +> align block writes to the structure of the underlying device. When you +> initiate a SCSI WRITE command, you start by telling the device at which +> block to start writing and how many blocks you intend to write. Then you +> start passing the data. +> + +All I'm saying is that the entire postgresql block write must be +converted into exactly one SCSI write command in all cases, and I don't +know a portable way to ensure this. + +> > Even with raw devices, you need special support or knowledge of the +> > operating system and/or the disk device to ensure that each write +> > request will be atomic to the underlying hardware. +> +> Well, so here I guess you're talking about two things: +> +> 1. When you request, say, an 8K block write, will the OS really +> write it to disk in a single 8K or multiple of 8K SCSI write +> command? +> +> 2. Does the SCSI device you're writing to consider these writes to +> be transactional. That is, if the write is interrupted before being +> completed, does the SCSI device guarantee that the partially-sent +> data is not written, and the old data is maintained? And of course, +> does it guarantee that, when it acknowledges a write, that write is +> now in stable storage and will never go away? +> +> Both of these are not hard to guarantee, actually. For a BSD-based OS, +> for example, just make sure that your filesystem block size is the +> same as or a multiple of the database block size. BSD will never write +> anything other than a block or a sequence of blocks to a disk in a +> single SCSI transaction (unless you've got a really odd SCSI driver). +> And for your disk, buy a Baydel or Clarion disk array, or something +> similar. +> +> Given that it's not hard to set up a system that meets these criteria, +> and this is in fact commonly done for database servers, it would seem a +> good idea for postgres to have the option to take advantage of the time +> and money spent and adjust its performance upward appropriately. + +I agree with this. My point was only that you need to know what +guarantees your operating system/hardware combination provides on a +case-by-case basis, and there is no standard way for a program to +discover this. Most system administrators are not going to know this +either, unless databases are their main responsibility. + +> +> > All other systems rely on the fact that you can recover a damaged file +> > using the log archive. +> +> Not exactly. For MS SQL Server, at any rate, if it detects a page tear +> you cannot restore based on the log file alone. You need a full or +> partial backup that includes that entire torn block. +> + +I should have been more specific: you need a backup of the file from +some time ago, plus all the archived logs from then until the current +log sequence number. + +> > This means downtime in the rare case, but no data loss. Until +> > PostgreSQL can do this, then it will not be acceptable for real +> > critical production use. +> +> It seems to me that it is doing this right now. In fact, it's more +> reliable than some commerial systems (such as SQL Server) because it can +> recover from a torn block with just the logfile. + +Again, what I meant to say is that the commercial systems can recover +with an old file backup + logs. How old the backup can be depends only +on how much time you are willing to spend playing the logs forward. So +if you do a full backup once a week, and multiplex and backup the logs, +then even if a backup tape gets destroyed you can still survive. It just +takes longer. + +Also, postgreSQL can't recover from any other type of block corruption, +while the commercial systems can. That's what I meant by the "critical +production use" comment, which was sort-of unfair. + +So I would say they are equally reliable for torn pages (but not bad +blocks), and the commercial systems let you trade potential recovery +time for not having to write the blocks twice. You do need to back-up +the log archives though. + +> +> > But at the end of the day, unless you have complete understanding of +> > the I/O system from write(2) through to the disk system, the only sure +> > ways to protect against partial writes are by "careful writes" (in +> > the WAL log or elsewhere, writing pages twice), or by requiring (and +> > allowing) users to do log-replay recovery when a file is corrupted by +> > a partial write. +> +> I don't understand how, without a copy of the old data that was in the +> torn block, you can restore that block from just log file entries. Can +> you explain this to me? Take, as an example, a block with ten tuples, +> only one of which has been changed "recently." (I.e., only that change +> is in the log files.) +> +> +> > If we log pages to WAL, they are useless when archived (after a +> > checkpoint). So either we have a separate "log" for them (the +> > ping-pong file), or we should at least remove them when archived, +> > which makes log archiving more complex but is perfectly doable. +> +> Right. That seems to me a better option, since we've now got only one +> write point on the disk rather than two. + +OK. I agree with this now. + +> +> > Finally, I would love to hear why we are using the operating system +> > buffer manager at all. The OS is acting as a secondary buffer manager +> > for us. Why is that? What flaw in our I/O system does this reveal? +> +> It's acting as a "second-level" buffer manager, yes, but to say it's +> "secondary" may be a bit misleading. On most of the systems I've set +> up, the OS buffer cache is doing the vast majority of the work, and the +> postgres buffering is fairly minimal. +> +> There are some good (and some perhaps not-so-good) reasons to do it this +> way. I'll list them more or less in the order of best to worst: +> +> 1. The OS knows where the blocks physically reside on disk, and +> postgres does not. Therefore it's in the interest of postgresql to +> dispatch write responsibility back to the OS as quickly as possible +> so that the OS can prioritize requests appropriately. Most operating +> systems use an "elevator" algorithm to minimize disk head movement; +> but if the OS does not have a block that it could write while the +> head is "on the way" to another request, it can't write it in that +> head pass. +> +> 2. Postgres does not know about any "bank-switching" tricks for +> mapping more physical memory than it has address space. Thus, on +> 32-bit machines, postgres might be limited to mapping 2 or 3 GB of +> memory, even though the machine has, say, 6 GB of physical RAM. The +> OS can use all of the available memory for caching; postgres cannot. +> +> 3. A lot of work has been put into the seek algorithms, read-ahead +> algorithms, block allocation algorithms, etc. in the OS. Why +> duplicate all that work again in postgres? +> +> When you say things like the following: +> +> > We should only be writing blocks when they need to be on disk. We +> > should not be expecting the OS to write them "sometime later" and +> > avoid blocking (as long) for the write. If we need that, then our +> > buffer management is wrong and we need to fix it. +> +> you appear to be making the arugment that we should take the route of +> other database systems, and use raw devices and our own management of +> disk block allocation. If so, you might want first to look back through +> the archives at the discussion I and several others had about this a +> month or two ago. After looking in detail at what NetBSD, at least, does +> in terms of its disk I/O algorithms and buffering, I've pretty much come +> around, at least for the moment, to the attitude that we should stick +> with using the OS. I wouldn't mind seeing postgres be able to manage all +> of this stuff, but it's a *lot* of work for not all that much benefit +> that I can see. + +I'll back off on that. I don't know if we want to use the OS buffer +manager, but shouldn't we try to have our buffer manager group writes +together by files, and pro-actively get them out to disk? Right now, it +looks like all our write requests are delayed as long as possible and +the order in which they are written is pretty-much random, as is the +backend that writes the block, so there is no locality of reference even +when the blocks are adjacent on disk, and the write calls are spread-out +over all the backends. + +Would it not be the case that things like read-ahead, grouping writes, +and caching written data are probably best done by PostgreSQL, because +only our buffer manager can understand when they will be useful or when +they will thrash the cache? + +I may likely be wrong on this, and I haven't done any performance +testing. I shouldn't have brought this up alongside the logging issues, +but there seemed to be some question about whether the OS was actually +doing all these things behind the scene. + + +> +> > The ORACLE people were not kidding when they said that they could not +> > certify Linux for production use until it supported O_DSYNC. Can you +> > explain why that was the case? +> +> I'm suspecting it's because Linux at the time had no raw devices, so +> O_DSYNC was the only other possible method of making sure that disk +> writes actually got to disk. +> +> You certainly don't want to use O_DSYNC if you can use another method, +> because O_DSYNC still goes through the the operating system's buffer +> cache, wasting memory and double-caching things. If you're doing your +> own management, you need either to use a raw device or open files with +> the flag that indicates that the buffer cache should not be used at all +> for reads from and writes to that file. + +Would O_DSYNC|O_RSYNC turn off the cache? + +> +> > However, this discussion and a search of the pgsql-hackers archives +> > reveals this problem to be the KEY area of PostgreSQL's failing, and +> > general misunderstanding, when compared to its commercial competitors. +> +> No, I think it's just that you're under a few minor misapprehensions +> here about what postgres and the OS are actually doing. As I said, I +> went through this whole exact argument a month or two ago, on this very +> list, and I came around to the idea that what postgres is doing now +> works quite well, at least on NetBSD. (Most other OSes have disk I/O +> algorithms that are pretty much as good or better.) There might be a +> very slight advantage to doing all one's own I/O management, but it's +> a huge amount of work, and I think that much effort could be much more +> usefully applied to other areas. + +I will look for that discussion in the archives. + +The logging issue is a key one I think. At least I would be very nervous +as a DBA if I were running a system where any damaged file would cause +data loss. + +Does anyone know what the major barriers to infinite log replay are in +PostgreSQL? I'm trying to look for everything that might need to be +changed outside xlog.c, but surely this has come up before. Searching +the archives hasn't revealed much. + + + +As to the I/O issue: + +Since you know a lot about NetBSD internals, I'd be interested in +hearing about what postgresql looks like to the NetBSD buffer manager. +Am I right that strings of successive writes get randomized? What do our +cache-hit percentages look like? I'm going to do some experimenting with +this. + +> +> Just as a side note, I've been a NetBSD developer since about '96, +> and have been delving into the details of OS design since well before +> that time, so I'm coming to this with what I hope is reasonably good +> knowledge of how disks work and how operating systems use them. (Not +> that this should stop you from pointing out holes in my arguments. :-)) +> + +This stuff is very difficult to get right. Glad to know you follow this +list. + + +> cjs +> -- +> Curt Sampson +81 90 7737 2974 http://www.netbsd.org +> Don't you know, in this new Dark Age, we're all light. --XTC +> +-- +J. R. Nield +jrnield@usol.com + + + + +From tgl@sss.pgh.pa.us Mon Jun 24 17:16:06 2002 +Return-path: +Received: from sss.pgh.pa.us (root@[192.204.191.242]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLG5F25284 + for ; Mon, 24 Jun 2002 17:16:05 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLG2121379; + Mon, 24 Jun 2002 17:16:02 -0400 (EDT) +To: "J. R. Nield" +cc: Curt Sampson , Bruce Momjian , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> +References: <1024951786.1793.865.camel@localhost.localdomain> +Comments: In-reply-to "J. R. Nield" + message dated "24 Jun 2002 16:49:42 -0400" +Date: Mon, 24 Jun 2002 17:16:01 -0400 +Message-ID: <21376.1024953361@sss.pgh.pa.us> +From: Tom Lane +Status: OR + +"J. R. Nield" writes: +> Also, postgreSQL can't recover from any other type of block corruption, +> while the commercial systems can. + +Say again? + +> Would it not be the case that things like read-ahead, grouping writes, +> and caching written data are probably best done by PostgreSQL, because +> only our buffer manager can understand when they will be useful or when +> they will thrash the cache? + +I think you have been missing the point. No one denies that there will +be some incremental gain if we do all that. However, the conclusion of +everyone who has thought much about it (and I see Curt has joined that +group) is that the effort would be far out of proportion to the probable +gain. There are a lot of other things we desperately need to spend time +on that would not amount to re-engineering large quantities of OS-level +code. Given that most Unixen have perfectly respectable disk management +subsystems, we prefer to tune our code to make use of that stuff, rather +than follow the "conventional wisdom" that databases need to bypass it. + +Oracle can afford to do that sort of thing because they have umpteen +thousand developers available. Postgres does not. + + regards, tom lane + +From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P21vF19918 + for ; Mon, 24 Jun 2002 22:01:57 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 540B8475B33; Mon, 24 Jun 2002 21:34:40 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 21:34:40 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 0A13F476965; Mon, 24 Jun 2002 19:30:14 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id B4F62476E4A + for ; Mon, 24 Jun 2002 18:53:59 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 18:53:59 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id 36043475BF6 + for ; Mon, 24 Jun 2002 17:25:28 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5OLPFG26140; + Mon, 24 Jun 2002 17:25:15 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206242125.g5OLPFG26140@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> +To: "J. R. Nield" +Date: Mon, 24 Jun 2002 17:25:14 -0400 (EDT) +cc: Curt Sampson , Tom Lane , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +J. R. Nield wrote: +> > This I don't quite understand. Assuming you're using a SCSI drive +> > (and this mostly applies to ATAPI/IDE, too), you can do naught but +> > align block writes to the structure of the underlying device. When you +> > initiate a SCSI WRITE command, you start by telling the device at which +> > block to start writing and how many blocks you intend to write. Then you +> > start passing the data. +> > +> +> All I'm saying is that the entire postgresql block write must be +> converted into exactly one SCSI write command in all cases, and I don't +> know a portable way to ensure this. + +... + +> I agree with this. My point was only that you need to know what +> guarantees your operating system/hardware combination provides on a +> case-by-case basis, and there is no standard way for a program to +> discover this. Most system administrators are not going to know this +> either, unless databases are their main responsibility. + +Yes, agreed. >1% are going to know the answer to this question so we +have to assume worst case. + +> > It seems to me that it is doing this right now. In fact, it's more +> > reliable than some commerial systems (such as SQL Server) because it can +> > recover from a torn block with just the logfile. +> +> Again, what I meant to say is that the commercial systems can recover +> with an old file backup + logs. How old the backup can be depends only +> on how much time you are willing to spend playing the logs forward. So +> if you do a full backup once a week, and multiplex and backup the logs, +> then even if a backup tape gets destroyed you can still survive. It just +> takes longer. +> +> Also, postgreSQL can't recover from any other type of block corruption, +> while the commercial systems can. That's what I meant by the "critical +> production use" comment, which was sort-of unfair. +> +> So I would say they are equally reliable for torn pages (but not bad +> blocks), and the commercial systems let you trade potential recovery +> time for not having to write the blocks twice. You do need to back-up +> the log archives though. + +Yes, good tradeoff analysis. We recover from partial writes quicker, +and don't require saving of log files, _but_ we don't recover from bad +disk blocks. Good summary. + +> I'll back off on that. I don't know if we want to use the OS buffer +> manager, but shouldn't we try to have our buffer manager group writes +> together by files, and pro-actively get them out to disk? Right now, it +> looks like all our write requests are delayed as long as possible and +> the order in which they are written is pretty-much random, as is the +> backend that writes the block, so there is no locality of reference even +> when the blocks are adjacent on disk, and the write calls are spread-out +> over all the backends. +> +> Would it not be the case that things like read-ahead, grouping writes, +> and caching written data are probably best done by PostgreSQL, because +> only our buffer manager can understand when they will be useful or when +> they will thrash the cache? + +The OS should handle all of this. We are doing main table writes but no +sync until checkpoint, so the OS can keep those blocks around and write +them at its convenience. It knows the size of the buffer cache and when +stuff is forced to disk. We can't second-guess that. + +> I may likely be wrong on this, and I haven't done any performance +> testing. I shouldn't have brought this up alongside the logging issues, +> but there seemed to be some question about whether the OS was actually +> doing all these things behind the scene. + +It had better. Looking at the kernel source is the way to know. + +> Does anyone know what the major barriers to infinite log replay are in +> PostgreSQL? I'm trying to look for everything that might need to be +> changed outside xlog.c, but surely this has come up before. Searching +> the archives hasn't revealed much. + +This has been brought up. Could we just save WAL files and get replay? +I believe some things have to be added to WAL to allow this, but it +seems possible. However, the pg_dump is just a data dump and does not +have the file offsets and things. Somehow you would need a tar-type +backup of the database, and with a running db, it is hard to get a valid +snapshot of that. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 3: if posting/reading through Usenet, please send an appropriate +subscribe-nomail command to majordomo@postgresql.org so that your +message can get through to the mailing list cleanly + + + +From tgl@sss.pgh.pa.us Mon Jun 24 17:31:57 2002 +Return-path: +Received: from sss.pgh.pa.us (root@[192.204.191.242]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLVuF26684 + for ; Mon, 24 Jun 2002 17:31:56 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485; + Mon, 24 Jun 2002 17:31:56 -0400 (EDT) +To: Bruce Momjian +cc: "J. R. Nield" , Curt Sampson , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us> +References: <200206242125.g5OLPFG26140@candle.pha.pa.us> +Comments: In-reply-to Bruce Momjian + message dated "Mon, 24 Jun 2002 17:25:14 -0400" +Date: Mon, 24 Jun 2002 17:31:56 -0400 +Message-ID: <21482.1024954316@sss.pgh.pa.us> +From: Tom Lane +Status: ORr + +Bruce Momjian writes: +>> Does anyone know what the major barriers to infinite log replay are in +>> PostgreSQL? I'm trying to look for everything that might need to be +>> changed outside xlog.c, but surely this has come up before. Searching +>> the archives hasn't revealed much. + +> This has been brought up. Could we just save WAL files and get replay? +> I believe some things have to be added to WAL to allow this, but it +> seems possible. + +The Red Hat group has been looking at this somewhat; so far there seem +to be some minor tweaks that would be needed, but no showstoppers. + +> Somehow you would need a tar-type +> backup of the database, and with a running db, it is hard to get a valid +> snapshot of that. + +But you don't *need* a "valid snapshot", only a correct copy of +every block older than the first checkpoint in your WAL log series. +Any inconsistencies in your tar dump will look like repairable damage; +replaying the WAL log will fix 'em. + + regards, tom lane + +From pgsql-hackers-owner+M24131@postgresql.org Mon Jun 24 21:15:06 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P1F5F15390 + for ; Mon, 24 Jun 2002 21:15:05 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id B76174768CC; Mon, 24 Jun 2002 20:59:56 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 20:59:56 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 8724C47742E; Mon, 24 Jun 2002 20:17:44 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 4E472476875 + for ; Mon, 24 Jun 2002 18:37:46 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 18:37:46 2002 +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id CFCC9476A7A + for ; Mon, 24 Jun 2002 17:32:02 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485; + Mon, 24 Jun 2002 17:31:56 -0400 (EDT) +To: Bruce Momjian +cc: "J. R. Nield" , Curt Sampson , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us> +References: <200206242125.g5OLPFG26140@candle.pha.pa.us> +Comments: In-reply-to Bruce Momjian + message dated "Mon, 24 Jun 2002 17:25:14 -0400" +Date: Mon, 24 Jun 2002 17:31:56 -0400 +Message-ID: <21482.1024954316@sss.pgh.pa.us> +From: Tom Lane +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-5.3 required=5.0 + tests=IN_REP_TO,X_NOT_PRESENT + version=2.30 +Status: OR + +Bruce Momjian writes: +>> Does anyone know what the major barriers to infinite log replay are in +>> PostgreSQL? I'm trying to look for everything that might need to be +>> changed outside xlog.c, but surely this has come up before. Searching +>> the archives hasn't revealed much. + +> This has been brought up. Could we just save WAL files and get replay? +> I believe some things have to be added to WAL to allow this, but it +> seems possible. + +The Red Hat group has been looking at this somewhat; so far there seem +to be some minor tweaks that would be needed, but no showstoppers. + +> Somehow you would need a tar-type +> backup of the database, and with a running db, it is hard to get a valid +> snapshot of that. + +But you don't *need* a "valid snapshot", only a correct copy of +every block older than the first checkpoint in your WAL log series. +Any inconsistencies in your tar dump will look like repairable damage; +replaying the WAL log will fix 'em. + + regards, tom lane + + + +---------------------------(end of broadcast)--------------------------- +TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org + + + +From pgsql-hackers-owner+M24133@postgresql.org Mon Jun 24 22:19:55 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P2JsF21543 + for ; Mon, 24 Jun 2002 22:19:54 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 42391476E53; Mon, 24 Jun 2002 22:09:49 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 22:09:49 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 191654774EB; Mon, 24 Jun 2002 20:26:08 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 8EB90476101 + for ; Mon, 24 Jun 2002 19:43:19 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 19:43:19 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id 08018476931 + for ; Mon, 24 Jun 2002 17:33:53 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5OLXhl26908; + Mon, 24 Jun 2002 17:33:43 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206242133.g5OLXhl26908@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <21482.1024954316@sss.pgh.pa.us> +To: Tom Lane +Date: Mon, 24 Jun 2002 17:33:43 -0400 (EDT) +cc: "J. R. Nield" , Curt Sampson , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +Tom Lane wrote: +> Bruce Momjian writes: +> >> Does anyone know what the major barriers to infinite log replay are in +> >> PostgreSQL? I'm trying to look for everything that might need to be +> >> changed outside xlog.c, but surely this has come up before. Searching +> >> the archives hasn't revealed much. +> +> > This has been brought up. Could we just save WAL files and get replay? +> > I believe some things have to be added to WAL to allow this, but it +> > seems possible. +> +> The Red Hat group has been looking at this somewhat; so far there seem +> to be some minor tweaks that would be needed, but no showstoppers. + + +Good. + +> > Somehow you would need a tar-type +> > backup of the database, and with a running db, it is hard to get a valid +> > snapshot of that. +> +> But you don't *need* a "valid snapshot", only a correct copy of +> every block older than the first checkpoint in your WAL log series. +> Any inconsistencies in your tar dump will look like repairable damage; +> replaying the WAL log will fix 'em. + +Yes, my point was that you need physical file backups, not pg_dump, and +you have to be tricky about the files changing during the backup. You +_can_ work around changes to the files during backup. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 +/usr/local/bin/mime: cannot create /dev/ttyp3: permission denied + + + +---------------------------(end of broadcast)--------------------------- +TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org + + + +From pgsql-hackers-owner+M24139@postgresql.org Tue Jun 25 00:00:22 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P40LF00838 + for ; Tue, 25 Jun 2002 00:00:21 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id CBAE8476E94; Mon, 24 Jun 2002 23:44:51 -0400 (EDT) +Mailbox-Line: From jrnield@usol.com Mon Jun 24 23:44:51 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id C5076476871; Mon, 24 Jun 2002 22:25:46 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 8DF57476979 + for ; Mon, 24 Jun 2002 22:08:31 -0400 (EDT) +Mailbox-Line: From jrnield@usol.com Mon Jun 24 22:08:31 2002 +Received: from hades.usol.com (hades.usol.com [208.232.58.41]) + by postgresql.org (Postfix) with ESMTP id 298D2476101 + for ; Mon, 24 Jun 2002 20:27:46 -0400 (EDT) +Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261; + Mon, 24 Jun 2002 20:27:37 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Tom Lane +cc: Curt Sampson , Bruce Momjian , + PostgreSQL Hacker +In-Reply-To: <21376.1024953361@sss.pgh.pa.us> +References: + <1024951786.1793.865.camel@localhost.localdomain> + <21376.1024953361@sss.pgh.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 24 Jun 2002 20:28:00 -0400 +Message-ID: <1024964884.3031.876.camel@localhost.localdomain> +MIME-Version: 1.0 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +On Mon, 2002-06-24 at 17:16, Tom Lane wrote: + +> I think you have been missing the point... +Yes, this appears to be the case. Thanks especially to Curt for clearing +things up for me. + +-- +J. R. Nield +jrnield@usol.com + + + + + + +---------------------------(end of broadcast)--------------------------- +TIP 5: Have you checked our extensive FAQ? + +http://www.postgresql.org/users-lounge/docs/faq.html + + + +From jrnield@usol.com Mon Jun 24 20:27:45 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0RhF10711 + for ; Mon, 24 Jun 2002 20:27:44 -0400 (EDT) +Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261; + Mon, 24 Jun 2002 20:27:37 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Tom Lane +cc: Curt Sampson , Bruce Momjian , + PostgreSQL Hacker +In-Reply-To: <21376.1024953361@sss.pgh.pa.us> +References: + <1024951786.1793.865.camel@localhost.localdomain> + <21376.1024953361@sss.pgh.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 24 Jun 2002 20:28:00 -0400 +Message-ID: <1024964884.3031.876.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: OR + +On Mon, 2002-06-24 at 17:16, Tom Lane wrote: + +> I think you have been missing the point... +Yes, this appears to be the case. Thanks especially to Curt for clearing +things up for me. + +-- +J. R. Nield +jrnield@usol.com + + + + +From cjs@cynic.net Mon Jun 24 23:32:23 2002 +Return-path: +Received: from academic.cynic.net ([63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P3WMF28287 + for ; Mon, 24 Jun 2002 23:32:23 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 28AB5F820; Tue, 25 Jun 2002 03:32:08 +0000 (UTC) +Date: Tue, 25 Jun 2002 12:32:05 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Bruce Momjian , Tom Lane , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On 24 Jun 2002, J. R. Nield wrote: + +> All I'm saying is that the entire postgresql block write must be +> converted into exactly one SCSI write command in all cases, and I don't +> know a portable way to ensure this. + +No, there's no portable way. All you can do is give the admin who +is able to set things up safely the ability to turn of the now-unneeded +(and expensive) safety-related stuff that postgres does. + +> I agree with this. My point was only that you need to know what +> guarantees your operating system/hardware combination provides on a +> case-by-case basis, and there is no standard way for a program to +> discover this. Most system administrators are not going to know this +> either, unless databases are their main responsibility. + +Certainly this is true of pretty much every database system out there. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From cjs@cynic.net Tue Jun 25 01:09:02 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P591F07292 + for ; Tue, 25 Jun 2002 01:09:01 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 517BEF820; Tue, 25 Jun 2002 05:09:02 +0000 (UTC) +Date: Tue, 25 Jun 2002 14:08:59 +0900 (JST) +From: Curt Sampson +To: Tom Lane +cc: "J. R. Nield" , Bruce Momjian , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <21376.1024953361@sss.pgh.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: ORr + +On Mon, 24 Jun 2002, Tom Lane wrote: + +> There are a lot of other things we desperately need to spend time +> on that would not amount to re-engineering large quantities of OS-level +> code. Given that most Unixen have perfectly respectable disk management +> subsystems, we prefer to tune our code to make use of that stuff, rather +> than follow the "conventional wisdom" that databases need to bypass it. +> ... +> Oracle can afford to do that sort of thing because they have umpteen +> thousand developers available. Postgres does not. + +Well, Oracle also started out, a long long time ago, on systems without +unified buffer cache and so on, and so they *had* to write this stuff +because otherwise data would not be cached. So Oracle can also afford to +maintain it now because the code already exists. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From pgsql-hackers-owner+M24154@postgresql.org Tue Jun 25 09:22:38 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDMbF03932 + for ; Tue, 25 Jun 2002 09:22:37 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id C12C3475E4A; Tue, 25 Jun 2002 09:22:32 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 09:22:32 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 65471475C7A; Tue, 25 Jun 2002 09:22:23 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 97C8C475A7C + for ; Tue, 25 Jun 2002 09:22:20 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 09:22:20 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id 42C0B475A64 + for ; Tue, 25 Jun 2002 09:22:19 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5PDM5B03772; + Tue, 25 Jun 2002 09:22:05 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206251322.g5PDM5B03772@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: +To: Curt Sampson +Date: Tue, 25 Jun 2002 09:22:05 -0400 (EDT) +cc: Tom Lane , "J. R. Nield" , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +Curt Sampson wrote: +> On Mon, 24 Jun 2002, Tom Lane wrote: +> +> > There are a lot of other things we desperately need to spend time +> > on that would not amount to re-engineering large quantities of OS-level +> > code. Given that most Unixen have perfectly respectable disk management +> > subsystems, we prefer to tune our code to make use of that stuff, rather +> > than follow the "conventional wisdom" that databases need to bypass it. +> > ... +> > Oracle can afford to do that sort of thing because they have umpteen +> > thousand developers available. Postgres does not. +> +> Well, Oracle also started out, a long long time ago, on systems without +> unified buffer cache and so on, and so they *had* to write this stuff +> because otherwise data would not be cached. So Oracle can also afford to +> maintain it now because the code already exists. + +Well, actually, it isn't unified buffer cache that is the issue, but +rather the older SysV file system had pretty poor performance so +bypassing it was a bigger win that it is today. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + + +