From aoki@postgres.Berkeley.EDU Sun Jun 22 19:31:06 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id TAA19488 for ; Sun, 22 Jun 1997 19:31:03 -0400 (EDT) Received: from faerie.CS.Berkeley.EDU (faerie.CS.Berkeley.EDU [128.32.37.53]) by renoir.op.net ($ Revision: 1.12 $) with SMTP id TAA18795 for ; Sun, 22 Jun 1997 19:18:06 -0400 (EDT) Received: from localhost.Berkeley.EDU (localhost.Berkeley.EDU [127.0.0.1]) by faerie.CS.Berkeley.EDU (8.6.10/8.6.3) with SMTP id QAA07816 for maillist@candle.pha.pa.us; Sun, 22 Jun 1997 16:16:44 -0700 Message-Id: <199706222316.QAA07816@faerie.CS.Berkeley.EDU> X-Authentication-Warning: faerie.CS.Berkeley.EDU: Host localhost.Berkeley.EDU didn't use HELO protocol From: aoki@CS.Berkeley.EDU (Paul M. Aoki) To: Bruce Momjian Subject: Re: PostgreSQL psort() function performance Reply-To: aoki@CS.Berkeley.EDU (Paul M. Aoki) In-reply-to: Your message of Sun, 22 Jun 1997 09:45:31 -0400 (EDT) <199706221345.JAA11476@candle.pha.pa.us> Date: Sun, 22 Jun 97 16:16:43 -0700 Sender: aoki@postgres.Berkeley.EDU X-Mts: smtp Status: OR the mariposa distribution (http://mariposa.cs.berkeley.edu/) contains some hacks to nodeSort.c and psort.c that - make psort read directly from the executor node below it (instead of an input relation) - makes the Sort node read directly from the last set of psort runs (instead of an output relation) speeds things up quite a bit. kind of ruins psort for other purposes, though (which is why nbtsort.c exists). i'd merge these in first and see how far that gets you. -- Paul M. Aoki | University of California at Berkeley aoki@CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776 | Berkeley, CA 94720-1776 From owner-pgsql-hackers@hub.org Mon Nov 3 09:31:04 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01676 for ; Mon, 3 Nov 1997 09:31:02 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id JAA07345 for ; Mon, 3 Nov 1997 09:13:20 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA13315; Mon, 3 Nov 1997 08:50:26 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:48:07 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA11722 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:48:02 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA11539 for ; Mon, 3 Nov 1997 08:47:34 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id UAA19066; Mon, 3 Nov 1997 20:48:04 +0700 (KRS) Message-ID: <345DD614.345BF651@sable.krasnoyarsk.su> Date: Mon, 03 Nov 1997 20:48:04 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Marc Howard Zuckman CC: Bruce Momjian , hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Marc Howard Zuckman wrote: > > On Mon, 3 Nov 1997, Bruce Momjian wrote: > > > With fsync off, I just did an insert of 1000 integers into a table > > containing a single int4 column and no indexes, and it completed in 2.3 > > seconds. This is on the new source tree.. That is 434 inserts/second. > > Pretty major performance, or 2.3 ms/insert. This is on a idle PP200 > > with UltraSCSI drives. > > > > With fsync on, the time goes to 51 seconds. Wow, big difference. > > If better alternative error recovery methods were available, perhaps > a facility to replay an interval transactions log from a prior dump, > it would be reasonable to run the backend without fsync and > take advantage of the performance gains. ??? > > I don't know the answer, but I suspect that the commercial databases > don't "fsync" the way pgsql does. Could someone try 1000 int4 inserts using postgres and some commercial database (on the same machine) ? Vadim From owner-pgsql-hackers@hub.org Mon Nov 3 09:01:02 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01183 for ; Mon, 3 Nov 1997 09:01:00 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id IAA06632 for ; Mon, 3 Nov 1997 08:51:58 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA05964; Mon, 3 Nov 1997 08:39:39 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:37:32 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA04729 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:37:26 -0500 (EST) Received: from fallon.classyad.com (root@classyad.com [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA04614 for ; Mon, 3 Nov 1997 08:37:16 -0500 (EST) Received: from fallon.classyad.com (marc@fallon.classyad.com [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id JAA22108; Mon, 3 Nov 1997 09:11:09 -0500 Date: Mon, 3 Nov 1997 09:11:09 -0500 (EST) From: Marc Howard Zuckman To: Bruce Momjian cc: "Vadim B. Mikheev" , hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! In-Reply-To: <199711030513.AAA23474@candle.pha.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-hackers@hub.org Precedence: bulk Status: OR On Mon, 3 Nov 1997, Bruce Momjian wrote: > > > > Removed... > > > > Also, ItemPointerData t_chain (6 bytes) removed from HeapTupleHeader. > > CommandId is uint32 now (up to the 2^32 - 1 commands per transaction). > > DOUBLEALIGN(Sizeof(HeapTupleHeader)) is 40 bytes now. > > > > 1000 inserts (into table with single int4 column, 1 insert per transaction) > > takes 70 - 80 sec now (12.5 - 14 transactions/sec). > > This is hardware/OS limitation: > > > > fd = open ("t", O_RDWR); > > for (i = 1; i <= 1000; i++) > > { > > lseek(fd, 0, SEEK_END); > > write(fd, buf, 56); > > fsync(fd); > > } > > close (fd); > > > > takes 33 - 39 sec and so it's not possible to be faster > > having 2 fsync-s per transaction. > > > > The same test on 6.2.1: 92 - 107 sec > > With fsync off, I just did an insert of 1000 integers into a table > containing a single int4 column and no indexes, and it completed in 2.3 > seconds. This is on the new source tree.. That is 434 inserts/second. > Pretty major performance, or 2.3 ms/insert. This is on a idle PP200 > with UltraSCSI drives. > > With fsync on, the time goes to 51 seconds. Wow, big difference. If better alternative error recovery methods were available, perhaps a facility to replay an interval transactions log from a prior dump, it would be reasonable to run the backend without fsync and take advantage of the performance gains. I don't know the answer, but I suspect that the commercial databases don't "fsync" the way pgsql does. Marc Zuckman marc@fallon.classyad.com _\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ _ Visit The Home and Condo MarketPlace _ _ http://www.ClassyAd.com _ _ _ _ FREE basic property listings/advertisements and searches. _ _ _ _ Try our premium, yet inexpensive services for a real _ _ selling or buying edge! _ _\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ From owner-pgsql-hackers@hub.org Mon Nov 3 11:31:03 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA04080 for ; Mon, 3 Nov 1997 11:31:00 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA13680 for ; Mon, 3 Nov 1997 11:21:30 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA07566; Mon, 3 Nov 1997 11:04:52 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:02:59 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA07372 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:02:52 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA07196 for ; Mon, 3 Nov 1997 11:02:22 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id KAA02525; Mon, 3 Nov 1997 10:42:03 -0500 (EST) From: Bruce Momjian Message-Id: <199711031542.KAA02525@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Mon, 3 Nov 1997 10:42:03 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <345DD614.345BF651@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 3, 97 08:48:04 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > I don't know the answer, but I suspect that the commercial databases > > don't "fsync" the way pgsql does. > > Could someone try 1000 int4 inserts using postgres and > some commercial database (on the same machine) ? I have been thinking about this since seeing the performance change with/without fsync. Commerical databases usually do a log write every 5 or 15 minutes, and guarantee the logs will contain everything up to this time interval. Couldn't we have some such mechanism? Usually they have raw space, so they can control when the data is hitting the disk. Using a file system, some of it may be getting to the disk without our knowing it. What exactly is a scenario where lack of doing explicit fsync's will cause data corruption, rather than just lost data from the past few minutes? I think Vadim has gotten fsync's down to fsync'ing the modified data page, and pg_log. Let's suppose we did not fsync. There could be cases where pg_log was fsync'ed by the OS, and some of the modified data pages are fyncs'ed by the OS, but not others. This would leave us with a partial transaction. However, let's suppose we prevent pg_log from being fsync'ed somehow. Then, because we have a no-overwrite database, we could keep control of this, and write of some data pages, but not others would not cause us problems because the pg_log would show all such transactions, which had not had all their modified data pages fsync'ed, as non-committed. Perhaps we can even set a flag in pg_log every five minutes to indicate whether all buffers for the page have been flushed? That way we could not have to worry about preventing flushing of pg_log. Comments? -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Mon Nov 3 12:00:42 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA04456 for ; Mon, 3 Nov 1997 12:00:40 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA26054; Mon, 3 Nov 1997 11:46:49 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:46:33 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA25932 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:46:30 -0500 (EST) Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id LAA25750 for ; Mon, 3 Nov 1997 11:45:53 -0500 (EST) Received: by orion.SAPserv.Hamburg.dsh.de (Linux Smail3.1.29.1 #1)} id m0xSPfE-000BGZC; Mon, 3 Nov 97 17:47 MET Message-Id: From: wieck@sapserv.debis.de Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: maillist@candle.pha.pa.us (Bruce Momjian) Date: Mon, 3 Nov 1997 17:47:43 +0100 (MET) Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com, hackers@postgreSQL.org Reply-To: wieck@sapserv.debis.de (Jan Wieck) In-Reply-To: <199711031542.KAA02525@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:42:03 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > > > I don't know the answer, but I suspect that the commercial databases > > > don't "fsync" the way pgsql does. > > > > Could someone try 1000 int4 inserts using postgres and > > some commercial database (on the same machine) ? > > I have been thinking about this since seeing the performance change > with/without fsync. > > Commerical databases usually do a log write every 5 or 15 minutes, and > guarantee the logs will contain everything up to this time interval. > Without fsync PostgreSQL would only loose data if the OS crashes between the last write operation of a backend and the next regular update sync. This is seldom but if it happens it really hurts. A database can omit fsync on data files (e.g. tablespaces) if it writes a redo log. With that redo log, a backup can be restored and than all transactions since the backup redone. PostgreSQL doesn't write such a redo log. So an OS crash after the fsync of pg_log could corrupt the database without a chance to recover. Isn't it time to get an (optional) redo log. I don't exactly know all the places where our datafiles can get modified, but I hope this is only done in the heap access methods and vacuum. So these are the places from where the redo log data comes from (plus transaction commit/rollback). Until later, Jan -- #define OPINIONS "they are all mine - not those of debis or daimler-benz" #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================== wieck@sapserv.debis.de (Jan Wieck) # From owner-pgsql-hackers@hub.org Mon Nov 3 14:01:06 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA06775 for ; Mon, 3 Nov 1997 14:01:04 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA22235 for ; Mon, 3 Nov 1997 13:43:15 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA11482; Mon, 3 Nov 1997 13:32:40 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 13:32:02 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA11204 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 13:31:58 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA11119 for ; Mon, 3 Nov 1997 13:31:44 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id MAA05464; Mon, 3 Nov 1997 12:59:01 -0500 (EST) From: Bruce Momjian Message-Id: <199711031759.MAA05464@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: wieck@sapserv.debis.de Date: Mon, 3 Nov 1997 12:59:01 -0500 (EST) Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: from "wieck@sapserv.debis.de" at Nov 3, 97 05:47:43 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > > > > > > I don't know the answer, but I suspect that the commercial databases > > > > don't "fsync" the way pgsql does. > > > > > > Could someone try 1000 int4 inserts using postgres and > > > some commercial database (on the same machine) ? > > > > I have been thinking about this since seeing the performance change > > with/without fsync. > > > > Commerical databases usually do a log write every 5 or 15 minutes, and > > guarantee the logs will contain everything up to this time interval. > > > > Without fsync PostgreSQL would only loose data if the OS > crashes between the last write operation of a backend and the > next regular update sync. This is seldom but if it happens it > really hurts. > > A database can omit fsync on data files (e.g. tablespaces) if > it writes a redo log. With that redo log, a backup can be > restored and than all transactions since the backup redone. > > PostgreSQL doesn't write such a redo log. So an OS crash > after the fsync of pg_log could corrupt the database without > a chance to recover. > > Isn't it time to get an (optional) redo log. I don't exactly > know all the places where our datafiles can get modified, but > I hope this is only done in the heap access methods and > vacuum. So these are the places from where the redo log data > comes from (plus transaction commit/rollback). > Yes, but because we are a non-over-write database, I don't see why we can't just do this without a redo log. Every five minutes, we fsync() all dirty pages, mark all completed transactions as fsync'ed in pg_log, and fsync() pg_log. On postmaster startup, any transaction marked as completed, but not marked as fsync'ed gets marked as aborted. Of course, all vacuum operations would have to be fsync'ed. Comments? -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Mon Nov 3 16:46:01 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id QAA10292 for ; Mon, 3 Nov 1997 16:45:59 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id QAA02040 for ; Mon, 3 Nov 1997 16:42:40 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id QAA17422; Mon, 3 Nov 1997 16:34:28 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 16:34:10 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id QAA17210 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 16:34:06 -0500 (EST) Received: from fallon.classyad.com (root@classyad.com [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id QAA16690 for ; Mon, 3 Nov 1997 16:33:27 -0500 (EST) Received: from fallon.classyad.com (marc@fallon.classyad.com [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id RAA32498; Mon, 3 Nov 1997 17:33:42 -0500 Date: Mon, 3 Nov 1997 17:33:42 -0500 (EST) From: Marc Howard Zuckman To: Bruce Momjian cc: wieck@sapserv.debis.de, vadim@sable.krasnoyarsk.su, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! In-Reply-To: <199711031759.MAA05464@candle.pha.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-hackers@hub.org Precedence: bulk Status: OR On Mon, 3 Nov 1997, Bruce Momjian wrote: > > > > > > > > > > I don't know the answer, but I suspect that the commercial databases > > > > > don't "fsync" the way pgsql does. > > > > > > > > Could someone try 1000 int4 inserts using postgres and > > > > some commercial database (on the same machine) ? > > > > > > I have been thinking about this since seeing the performance change > > > with/without fsync. > > > > > > Commerical databases usually do a log write every 5 or 15 minutes, and > > > guarantee the logs will contain everything up to this time interval. > > > > > > > Without fsync PostgreSQL would only loose data if the OS > > crashes between the last write operation of a backend and the > > next regular update sync. This is seldom but if it happens it > > really hurts. > > > > A database can omit fsync on data files (e.g. tablespaces) if > > it writes a redo log. With that redo log, a backup can be > > restored and than all transactions since the backup redone. > > > > PostgreSQL doesn't write such a redo log. So an OS crash > > after the fsync of pg_log could corrupt the database without > > a chance to recover. > > > > Isn't it time to get an (optional) redo log. I don't exactly > > know all the places where our datafiles can get modified, but > > I hope this is only done in the heap access methods and > > vacuum. So these are the places from where the redo log data > > comes from (plus transaction commit/rollback). > > > > Yes, but because we are a non-over-write database, I don't see why we > can't just do this without a redo log. Because if the hard drive is the reason for the failure (instead of power out, OS bites dust, etc), the database won't be of much help. The redo log should be on a device different than the database. Marc Zuckman marc@fallon.classyad.com _\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ _ Visit The Home and Condo MarketPlace _ _ http://www.ClassyAd.com _ _ _ _ FREE basic property listings/advertisements and searches. _ _ _ _ Try our premium, yet inexpensive services for a real _ _ selling or buying edge! _ _\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ From maillist Mon Nov 3 22:59:31 1997 Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id WAA16264; Mon, 3 Nov 1997 22:59:31 -0500 (EST) From: Bruce Momjian Message-Id: <199711040359.WAA16264@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: maillist@candle.pha.pa.us (Bruce Momjian) Date: Mon, 3 Nov 1997 22:59:30 -0500 (EST) Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <199711031542.KAA02525@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:42:03 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: OR > > > > I don't know the answer, but I suspect that the commercial databases > > > don't "fsync" the way pgsql does. > > > > Could someone try 1000 int4 inserts using postgres and > > some commercial database (on the same machine) ? > > I have been thinking about this since seeing the performance change > with/without fsync. > > Commercial databases usually do a log write every 5 or 15 minutes, and > guarantee the logs will contain everything up to this time interval. > > Couldn't we have some such mechanism? Usually they have raw space, so > they can control when the data is hitting the disk. Using a file > system, some of it may be getting to the disk without our knowing it. > > What exactly is a scenario where lack of doing explicit fsync's will > cause data corruption, rather than just lost data from the past few > minutes? > > I think Vadim has gotten fsync's down to fsync'ing the modified data > page, and pg_log. > > Let's suppose we did not fsync. There could be cases where pg_log was > fsync'ed by the OS, and some of the modified data pages are fyncs'ed by > the OS, but not others. This would leave us with a partial transaction. > > However, let's suppose we prevent pg_log from being fsync'ed somehow. > Then, because we have a no-overwrite database, we could keep control of > this, and write of some data pages, but not others would not cause us > problems because the pg_log would show all such transactions, which had > not had all their modified data pages fsync'ed, as non-committed. > > Perhaps we can even set a flag in pg_log every five minutes to indicate > whether all buffers for the page have been flushed? That way we could > not have to worry about preventing flushing of pg_log. > > Comments? OK, here is a more formal description of what I am suggesting. It will give us commercial dbms reliability with no-fsync performance. Commercial dbms's usually only give restore up to 5 minutes before the crash, and this is what I am suggesting. If we can do this, we can remove the no-fsync option. First, lets suppose there exists a shared queue that is visible to all backends and the postmaster that allows transaction id's to be added to the queue. We also add a bit to the pg_log record called 'been_synced' that is initially false. OK, once a backend starts a transaction, it puts a transaction id in pg_log. Once the transaction is finished, it is marked as committed. At the same time, we now put the transaction id on the shared queue. Every five minutes, or as defined by the administrator, the postmaster does a sync() call. On my OS, anyone use can call sync, and I think this is typical. update/pagecleaner does this every 30 seconds anyway, so it is no big deal for the postmaster to call it every 5 minutes. The nice thing about this is that the OS does the syncing of all the dirty pages for us. (An alarm() call can set up this 5 minute timing.) The postmaster then locks the shared transaction id queue, makes a copy of the entries in the queue, clears the queue, and unlocks the queue. It does this so no one else modifies the queue while it is being cleared. The postmaster then goes through pg_log, and marks each transaction as 'been_synced'. The postmaster also performs this on shutdown. On postmaster startup, all transactions are checked and any transaction that is marked as committed but not 'been_synced' is marked as not committed. In this way, we prevent non-synced or partially synced transactions from being used. Of course, vacuum would have to do normal fsyncs because it is removing the transaction log. We need the shared transaction id queue because there is no way to find the newly committed transactions since the last sync. A transaction can last for hours. -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Tue Nov 4 02:13:08 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id CAA17544 for ; Tue, 4 Nov 1997 02:13:06 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id CAA14126; Tue, 4 Nov 1997 02:07:55 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 02:04:59 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id CAA12859 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 02:04:51 -0500 (EST) Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id CAA12625 for ; Tue, 4 Nov 1997 02:04:12 -0500 (EST) Received: by orion.SAPserv.Hamburg.dsh.de (Linux Smail3.1.29.1 #1)} id m0xSd44-000BFQC; Tue, 4 Nov 97 08:06 MET Message-Id: From: wieck@sapserv.debis.de Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: maillist@candle.pha.pa.us (Bruce Momjian) Date: Tue, 4 Nov 1997 08:06:16 +0100 (MET) Cc: maillist@candle.pha.pa.us, vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com, hackers@postgreSQL.org Reply-To: wieck@sapserv.debis.de (Jan Wieck) In-Reply-To: <199711040359.WAA16264@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:59:30 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > OK, here is a more formal description of what I am suggesting. It will > give us commercial dbms reliability with no-fsync performance. > Commercial dbms's usually only give restore up to 5 minutes before the > crash, and this is what I am suggesting. If we can do this, we can > remove the no-fsync option. I'm not 100% sure but as far as I know Oracle, it can recover up to the last committed transaction using the online redo logs. And even if commercial dbms's aren't able to do that, it should be our target. > [description about transaction queue] This all depends on the fact that PostgreSQL is a no overwrite dbms. Otherwise the space of deleted tuples might get overwritten by later transactions and the information is finally lost. Another issue: All we up to now though of are crashes where the database files are still usable after restart. But take the simple case of a write error. A new bad block or track will get remapped (in some way) but the data in it is lost. So we end up with one or more totally corrupted database files. And I don't trust mirrored disks farer than I can throw them. A bug in the OS or a memory failure (many new PeeCee boards don't support parity and even with parity a two bit failure is still the wrong data but with a valid parity bit) can also currupt the data. I still prefer redo logs. They should reside on a different disk and the possibility of loosing the database files along with the redo log is very small. Until later, Jan -- #define OPINIONS "they are all mine - not those of debis or daimler-benz" #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================== wieck@sapserv.debis.de (Jan Wieck) # From vadim@sable.krasnoyarsk.su Tue Nov 4 04:12:50 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18487 for ; Tue, 4 Nov 1997 04:12:48 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id EAA03152 for ; Tue, 4 Nov 1997 04:12:06 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <345EE75D.398A68D@sable.krasnoyarsk.su> Date: Tue, 04 Nov 1997 16:14:05 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711040359.WAA16264@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > OK, here is a more formal description of what I am suggesting. It will > give us commercial dbms reliability with no-fsync performance. > Commercial dbms's usually only give restore up to 5 minutes before the ^^^^^^^^^^^^^^^^^^^^^^^ I'm sure that this is not true! If on-line redo_file is damaged then you have single ability: restore your last backup. In all other cases database will be recovered up to the last committed transaction automatically! DBMS-s using WAL have to fsync only redo file on commit (and they do it!), non-overwriting systems have to fsync data files and transaction log. We could optimize fsync-s for multi-user environment: do not fsync when we're ensured that our changes flushed to disk by another backend. > crash, and this is what I am suggesting. If we can do this, we can > remove the no-fsync option. > ... > > On postmaster startup, all transactions are checked and any transaction > that is marked as committed but not 'been_synced' is marked as not > committed. In this way, we prevent non-synced or partially synced > transactions from being used. And what should users (ensured that their transaction are committed) do in this case ? Vadim From owner-pgsql-hackers@hub.org Tue Nov 4 04:21:04 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18536 for ; Tue, 4 Nov 1997 04:21:01 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id EAA15551; Tue, 4 Nov 1997 04:15:15 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 04:14:23 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id EAA14464 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 04:14:18 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id EAA13437 for ; Tue, 4 Nov 1997 04:13:33 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS) Message-ID: <345EE75D.398A68D@sable.krasnoyarsk.su> Date: Tue, 04 Nov 1997 16:14:05 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711040359.WAA16264@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Bruce Momjian wrote: > > OK, here is a more formal description of what I am suggesting. It will > give us commercial dbms reliability with no-fsync performance. > Commercial dbms's usually only give restore up to 5 minutes before the ^^^^^^^^^^^^^^^^^^^^^^^ I'm sure that this is not true! If on-line redo_file is damaged then you have single ability: restore your last backup. In all other cases database will be recovered up to the last committed transaction automatically! DBMS-s using WAL have to fsync only redo file on commit (and they do it!), non-overwriting systems have to fsync data files and transaction log. We could optimize fsync-s for multi-user environment: do not fsync when we're ensured that our changes flushed to disk by another backend. > crash, and this is what I am suggesting. If we can do this, we can > remove the no-fsync option. > ... > > On postmaster startup, all transactions are checked and any transaction > that is marked as committed but not 'been_synced' is marked as not > committed. In this way, we prevent non-synced or partially synced > transactions from being used. And what should users (ensured that their transaction are committed) do in this case ? Vadim From owner-pgsql-hackers@hub.org Tue Nov 4 06:43:00 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id GAA19743 for ; Tue, 4 Nov 1997 06:42:57 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id GAA10352; Tue, 4 Nov 1997 06:36:08 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 06:35:42 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id GAA10158 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 06:35:37 -0500 (EST) Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id GAA10096 for ; Tue, 4 Nov 1997 06:35:27 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id GAA19665; Tue, 4 Nov 1997 06:35:10 -0500 (EST) From: Bruce Momjian Message-Id: <199711041135.GAA19665@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: wieck@sapserv.debis.de Date: Tue, 4 Nov 1997 06:35:10 -0500 (EST) Cc: hackers@postgreSQL.org (PostgreSQL-development) In-Reply-To: from "wieck@sapserv.debis.de" at Nov 4, 97 08:06:16 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > > OK, here is a more formal description of what I am suggesting. It will > > give us commercial dbms reliability with no-fsync performance. > > Commercial dbms's usually only give restore up to 5 minutes before the > > crash, and this is what I am suggesting. If we can do this, we can > > remove the no-fsync option. > > I'm not 100% sure but as far as I know Oracle, it can recover > up to the last committed transaction using the online redo > logs. And even if commercial dbms's aren't able to do that, > it should be our target. > > > [description about transaction queue] > > This all depends on the fact that PostgreSQL is a no > overwrite dbms. Otherwise the space of deleted tuples might > get overwritten by later transactions and the information is > finally lost. > > Another issue: All we up to now though of are crashes where > the database files are still usable after restart. But take > the simple case of a write error. A new bad block or track > will get remapped (in some way) but the data in it is lost. > So we end up with one or more totally corrupted database > files. And I don't trust mirrored disks farer than I can > throw them. A bug in the OS or a memory failure (many new > PeeCee boards don't support parity and even with parity a two > bit failure is still the wrong data but with a valid parity > bit) can also currupt the data. > > I still prefer redo logs. They should reside on a different > disk and the possibility of loosing the database files along > with the redo log is very small. I have been thinking about re-do logs, and I think it is a good idea. It would not be hard to have the queries spit out to a separate file configurable by the user. -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Tue Nov 4 07:31:01 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22051 for ; Tue, 4 Nov 1997 07:30:59 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07444 for ; Tue, 4 Nov 1997 07:25:14 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08818; Tue, 4 Nov 1997 07:03:30 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:02:44 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08418 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:29 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08331 for ; Tue, 4 Nov 1997 07:02:07 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id GAA21484; Tue, 4 Nov 1997 06:50:24 -0500 (EST) From: Bruce Momjian Message-Id: <199711041150.GAA21484@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Tue, 4 Nov 1997 06:50:24 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <345EE75D.398A68D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 04:14:05 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > Bruce Momjian wrote: > > > > OK, here is a more formal description of what I am suggesting. It will > > give us commercial dbms reliability with no-fsync performance. > > Commercial dbms's usually only give restore up to 5 minutes before the > ^^^^^^^^^^^^^^^^^^^^^^^ > I'm sure that this is not true! You may be right. This five minute figure is when you restore from your previous backup, then restore from the log file. Can't we do something like sync every 5 seconds, rather than after every transaction? It just seems like such overkill. Actually, I found a problem with my description. Because pg_log is not fsync'ed, after a crash, pages with new transactions could have been flushed to disk, but not the pg_log table that contains the transaction ids. The problem is that the new backend could assign a transaction id that is already in use. We could set a flag upon successful shutdown, and if it is not set on reboot, either do a vacuum to find the max transaction id, and invalidate all them not in pg_log as synced, or increase the next transaction id to some huge number and invalidate all them in between. > If on-line redo_file is damaged then you have > single ability: restore your last backup. > In all other cases database will be recovered up to the last > committed transaction automatically! > > DBMS-s using WAL have to fsync only redo file on commit > (and they do it!), non-overwriting systems have to > fsync data files and transaction log. > > We could optimize fsync-s for multi-user environment: do not > fsync when we're ensured that our changes flushed to disk by > another backend. > > > crash, and this is what I am suggesting. If we can do this, we can > > remove the no-fsync option. > > > ... > > > > On postmaster startup, all transactions are checked and any transaction > > that is marked as committed but not 'been_synced' is marked as not > > committed. In this way, we prevent non-synced or partially synced > > transactions from being used. > > And what should users (ensured that their transaction are > committed) do in this case ? > > Vadim > > -- Bruce Momjian maillist@candle.pha.pa.us From wieck@sapserv.debis.de Tue Nov 4 07:01:00 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA21697 for ; Tue, 4 Nov 1997 07:00:58 -0500 (EST) From: wieck@sapserv.debis.de Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by renoir.op.net (o1/$ Revision: 1.14 $) with SMTP id GAA06401 for ; Tue, 4 Nov 1997 06:48:25 -0500 (EST) Received: by orion.SAPserv.Hamburg.dsh.de (Linux Smail3.1.29.1 #1)} id m0xShVQ-000BGZC; Tue, 4 Nov 97 12:50 MET Message-Id: Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: maillist@candle.pha.pa.us (Bruce Momjian) Date: Tue, 4 Nov 1997 12:50:45 +0100 (MET) Cc: wieck@sapserv.debis.de, hackers@postgreSQL.org Reply-To: wieck@sapserv.debis.de (Jan Wieck) In-Reply-To: <199711041135.GAA19665@candle.pha.pa.us> from "Bruce Momjian" at Nov 4, 97 06:35:10 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Status: OR Bruce Momjian wrote: > I have been thinking about re-do logs, and I think it is a good idea. > It would not be hard to have the queries spit out to a separate file > configurable by the user. This way the recovery process will be very complicated. When multiple backends run concurrently, there are multiple transactions active at the same time. And what tuples are affected by an update e.g. depends much on the timing. I had something different in mind. The redo log contains the information from the executor (e.g. the transactionId, the tupleId and the new tuple values when calling ExecReplace()) and the information which transactions commit and which not. When recovering, those operations where the transactions committed are again passed to the executors functions that do the real updates with the values from the logfile. Until later, Jan -- #define OPINIONS "they are all mine - not those of debis or daimler-benz" #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================== wieck@sapserv.debis.de (Jan Wieck) # From owner-pgsql-hackers@hub.org Tue Nov 4 07:30:59 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22048 for ; Tue, 4 Nov 1997 07:30:57 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07189 for ; Tue, 4 Nov 1997 07:18:02 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08856; Tue, 4 Nov 1997 07:03:37 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:03:03 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08487 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:46 -0500 (EST) Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08192 for ; Tue, 4 Nov 1997 07:02:02 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id HAA21653; Tue, 4 Nov 1997 07:00:20 -0500 (EST) From: Bruce Momjian Message-Id: <199711041200.HAA21653@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!u To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Tue, 4 Nov 1997 07:00:19 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <345EE75D.398A68D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 04:14:05 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > Bruce Momjian wrote: > > > > OK, here is a more formal description of what I am suggesting. It will > > give us commercial dbms reliability with no-fsync performance. > > Commercial dbms's usually only give restore up to 5 minutes before the > ^^^^^^^^^^^^^^^^^^^^^^^ > I'm sure that this is not true! > If on-line redo_file is damaged then you have > single ability: restore your last backup. > In all other cases database will be recovered up to the last > committed transaction automatically! I doubt commercial dbms's sync to disk after every transaction. They pick a time, maybe five seconds, and see all dirty pages get flushed by then. What they do do is to make certain that you are restored to a consistent state, perhaps 15 seconds ago. -- Bruce Momjian maillist@candle.pha.pa.us From vadim@sable.krasnoyarsk.su Tue Nov 4 07:32:45 1997 Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22066 for ; Tue, 4 Nov 1997 07:32:35 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20889; Tue, 4 Nov 1997 19:35:12 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <345F1680.60E33853@sable.krasnoyarsk.su> Date: Tue, 04 Nov 1997 19:35:12 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Jan Wieck CC: Bruce Momjian , marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR wieck@sapserv.debis.de wrote: > > I still prefer redo logs. They should reside on a different > disk and the possibility of loosing the database files along > with the redo log is very small. Agreed. This way we could don't fsync data files and fsync both redo and pg_log. This is much faster. Vadim From vadim@sable.krasnoyarsk.su Tue Nov 4 08:00:58 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id IAA22371 for ; Tue, 4 Nov 1997 08:00:56 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA08540 for ; Tue, 4 Nov 1997 07:57:25 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20935; Tue, 4 Nov 1997 19:59:46 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <345F1C42.1F1A7590@sable.krasnoyarsk.su> Date: Tue, 04 Nov 1997 19:59:46 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Jan Wieck CC: Bruce Momjian , hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR wieck@sapserv.debis.de wrote: > > Bruce Momjian wrote: > > I have been thinking about re-do logs, and I think it is a good idea. > > It would not be hard to have the queries spit out to a separate file > > configurable by the user. > > This way the recovery process will be very complicated. When > multiple backends run concurrently, there are multiple > transactions active at the same time. And what tuples are > affected by an update e.g. depends much on the timing. > > I had something different in mind. The redo log contains the > information from the executor (e.g. the transactionId, the > tupleId and the new tuple values when calling ExecReplace()) > and the information which transactions commit and which not. > When recovering, those operations where the transactions > committed are again passed to the executors functions that do > the real updates with the values from the logfile. It seems that this is what Oracle does, but Sybase writes queries (with transaction ids, of 'course, and before execution) and begin, commit/abort events <-- this is better for non-overwriting system (shorter redo file), but, agreed, recovering is more complicated. Vadim From owner-pgsql-hackers@hub.org Tue Nov 4 22:35:45 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA05060 for ; Tue, 4 Nov 1997 22:35:43 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA26725 for ; Tue, 4 Nov 1997 22:35:10 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id WAA27875; Tue, 4 Nov 1997 22:23:14 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 22:20:55 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id WAA24162 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 22:20:50 -0500 (EST) Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id WAA22727 for ; Tue, 4 Nov 1997 22:20:18 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id WAA04674; Tue, 4 Nov 1997 22:17:52 -0500 (EST) From: Bruce Momjian Message-Id: <199711050317.WAA04674@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Tue, 4 Nov 1997 22:17:52 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <345F14E7.28CC1042@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 07:28:23 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > Bruce Momjian wrote: > > > > > > > > Bruce Momjian wrote: > > > > > > > > OK, here is a more formal description of what I am suggesting. It will > > > > give us commercial dbms reliability with no-fsync performance. > > > > Commercial dbms's usually only give restore up to 5 minutes before the > > > ^^^^^^^^^^^^^^^^^^^^^^^ > > > I'm sure that this is not true! > > > > You may be right. This five minute figure is when you restore from your > > previous backup, then restore from the log file. > > > > Can't we do something like sync every 5 seconds, rather than after every > > transaction? It just seems like such overkill. > > Isn't -F and sync in crontab the same ? OK, let me again try to marshall some (any?) support for my suggestion. Informix version 5/7 has three levels of logging: unbuffered logging(our normal fsync mode), buffered logging, and no logging(our no fsync mode). We don't have buffered logging. Buffered logging guarantees you get put back to a consistent state after an os/server crash, usually to within 30/90 seconds. You do not have any partial transactions lying around, but you do have some transactions that you thought were done, but are not. This is faster then non-buffered logging, but not as fast as no logging. Guess what mode everyone uses? The one we don't have, buffered logging! Unbuffered logging performance is terrible. Non-buffered logging is used to load huge chunks of data during off-hours. The problem we have is that we fsync every transaction, which causes a 9-times slowdown in performance on single-integer inserts. That is a pretty heavy cost. But the alternative we give people is no-fsync mode, where we don't sync anything, and in a crash, you could come back with partially committed data in your database, if pg_log was sync'ed by the database, and only some of the data pages were sync'ed, so if any data was changing within 30 seconds of the crash, you have to restore your previous backup. We really need a middle solution, that gives better data integrity, for a smaller price. > > > > > Actually, I found a problem with my description. Because pg_log is not > > fsync'ed, after a crash, pages with new transactions could have been > > flushed to disk, but not the pg_log table that contains the transaction > > ids. The problem is that the new backend could assign a transaction id > > that is already in use. > > Impossible. Backend flushes pg_variable after fetching nex 32 xids. My suggestion is that we don't need to flush pg_variable or pg_log that much. My suggestion would speed up the test you do with 100 inserts inside a single transaction vs. 100 separate inserts. > > > > We could set a flag upon successful shutdown, and if it is not set on > > reboot, either do a vacuum to find the max transaction id, and > > invalidate all them not in pg_log as synced, or increase the next > > transaction id to some huge number and invalidate all them in between. > > I have a fix for the problem stated above, and it doesn't require a vacuum. We decide to fsync pg_variable and pg_log every 10,000 transactions or oids. Then if the database is brought up, and it was not brought down cleanly, you increment oid and transaction_id by 10,000, because you know you couldn't have gotten more than that. All intermediate transactions that are not marked committed/synced are marked aborted. --------------------------------------------------------------------------- The problem we have with the current system is that we sync by action, not by time interval. If you are doing tons of inserts or updates, it is syncing after every one. What people really want is something that will sync not after every action, but after every minute or five minutes, so when the system is busy, the syncing every minutes is just a small amount, and when the system is idle, no one cares if is syncs, and no one has to wait for the sync to complete. -- Bruce Momjian maillist@candle.pha.pa.us From matti@algonet.se Wed Nov 5 11:02:33 1997 Received: from smtp.algonet.se (tomei.algonet.se [194.213.74.114]) by candle.pha.pa.us (8.8.5/8.8.5) with SMTP id LAA02099 for ; Wed, 5 Nov 1997 11:02:28 -0500 (EST) Received: (qmail 6685 invoked from network); 5 Nov 1997 17:01:06 +0100 Received: from du228-6.ppp.algonet.se (HELO gamma) (root@195.100.6.228) by tomei.algonet.se with SMTP; 5 Nov 1997 17:01:06 +0100 Sender: root Message-ID: <34609871.27EED9D@algonet.se> Date: Wed, 05 Nov 1997 17:02:16 +0100 From: Mattias Kregert Organization: Algonet ISP X-Mailer: Mozilla 3.0Gold (X11; I; Linux 2.0.29 i586) MIME-Version: 1.0 To: Bruce Momjian CC: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711050317.WAA04674@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > We don't have buffered logging. Buffered logging guarantees you get put > back to a consistent state after an os/server crash, usually to within > 30/90 seconds. You do not have any partial transactions lying around, > but you do have some transactions that you thought were done, but are ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > not. ^^^^ > > This is faster then non-buffered logging, but not as fast as no logging. > Guess what mode everyone uses? The one we don't have, buffered logging! Ouch! I would *not* like to use "buffered logging". What's the point in having the wrong data in the database and not knowing what updates, inserts or deletes to do to get the correct data? That's irrecoverable loss of data. Not what *I* want. Do *you* want it? > We really need a middle solution, that gives better data integrity, for > a smaller price. What I would like to have is this: If a backend tells the frontend that a transaction has completed, then that transaction should absolutely not get lost in case of a crash. What is needed is a log of changes since the last backup. This log would preferrably reside on a remote machine or at least another disk. Then, if the power goes in the middle of a disk write, the disk explodes and the computer goes up in flames, you can install Postgresql on a new machine, restore the last backup and re-run the change log. > The problem we have with the current system is that we sync by action, > not by time interval. If you are doing tons of inserts or updates, it > is syncing after every one. What people really want is something that > will sync not after every action, but after every minute or five > minutes, so when the system is busy, the syncing every minutes is just a > small amount, and when the system is idle, no one cares if is syncs, and > no one has to wait for the sync to complete. Yes, but this would only be the first step on the way to better crash-recovery. /* m */ From vadim@sable.krasnoyarsk.su Wed Nov 5 12:20:23 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05156 for ; Wed, 5 Nov 1997 12:20:13 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24123 for ; Wed, 5 Nov 1997 11:44:49 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id XAA23062; Wed, 5 Nov 1997 23:48:52 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <3460A374.41C67EA6@sable.krasnoyarsk.su> Date: Wed, 05 Nov 1997 23:48:52 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711050317.WAA04674@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > OK, let me again try to marshall some (any?) support for my suggestion. > > Informix version 5/7 has three levels of logging: unbuffered > logging(our normal fsync mode), buffered logging, and no logging(our no > fsync mode). > > We don't have buffered logging. Buffered logging guarantees you get put > back to a consistent state after an os/server crash, usually to within > 30/90 seconds. You do not have any partial transactions lying around, > but you do have some transactions that you thought were done, but are > not. > > This is faster then non-buffered logging, but not as fast as no logging. > Guess what mode everyone uses? The one we don't have, buffered logging! > > Unbuffered logging performance is terrible. Non-buffered logging is > used to load huge chunks of data during off-hours. > > The problem we have is that we fsync every transaction, which causes a > 9-times slowdown in performance on single-integer inserts. > > That is a pretty heavy cost. But the alternative we give people is > no-fsync mode, where we don't sync anything, and in a crash, you could > come back with partially committed data in your database, if pg_log was > sync'ed by the database, and only some of the data pages were sync'ed, > so if any data was changing within 30 seconds of the crash, you have to > restore your previous backup. > > We really need a middle solution, that gives better data integrity, for > a smaller price. There is no fsync synchronization currently. How could we be ensured that all modified data pages are flushed when we decided to flush pg_log ? If backend doesn't fsync data pages & pg_log at the commit time then when he must flush them (data first) ? This is what Oracle does: it uses dedicated DBWR process for writing/flushing modified data pages and LGWR process for writing/flushing redo log (redo log is transaction log also). LGWR always flushes log pages when committing, but durty data pages can be flushed _after_ transaction commit when DBWR decides that it's time to do it (ala checkpoints interval). Using redo log we could implement buffered logging quite easy. We can even don't use dedicated processes (but flush redo before pg_log), though having LGWR could simplify things. Without redo log or without some fsync synchronization we can't implement buffered logging. BTW, shared system cache could help with fsync synchonization, but, imho, redo is better (and faster for un-buffered logging too). > > > Actually, I found a problem with my description. Because pg_log is not > > > fsync'ed, after a crash, pages with new transactions could have been > > > flushed to disk, but not the pg_log table that contains the transaction > > > ids. The problem is that the new backend could assign a transaction id > > > that is already in use. > > > > Impossible. Backend flushes pg_variable after fetching nex 32 xids. > > My suggestion is that we don't need to flush pg_variable or pg_log that > much. My suggestion would speed up the test you do with 100 inserts > inside a single transaction vs. 100 separate inserts. > > > > > > > We could set a flag upon successful shutdown, and if it is not set on > > > reboot, either do a vacuum to find the max transaction id, and > > > invalidate all them not in pg_log as synced, or increase the next > > > transaction id to some huge number and invalidate all them in between. > > > > > I have a fix for the problem stated above, and it doesn't require a > vacuum. > > We decide to fsync pg_variable and pg_log every 10,000 transactions or > oids. Then if the database is brought up, and it was not brought down > cleanly, you increment oid and transaction_id by 10,000, because you > know you couldn't have gotten more than that. All intermediate > transactions that are not marked committed/synced are marked aborted. This is what I suppose to do by placing next available oid/xid in shmem: this allows pre-fetch much more than 32 ids at once without losing them when session closed. > The problem we have with the current system is that we sync by action, > not by time interval. If you are doing tons of inserts or updates, it > is syncing after every one. What people really want is something that > will sync not after every action, but after every minute or five > minutes, so when the system is busy, the syncing every minutes is just a > small amount, and when the system is idle, no one cares if is syncs, and > no one has to wait for the sync to complete. When I'm really doing tons of inserts/updates/deletes I use BEGIN/END. But it doesn't work for multi-user environment, of 'course. As for about what people really want, I remember that recently someone said in user list that if one want to have 10-20 inserts/sec then he should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise when using one session, 32 inserts/sec with two sessions inserting in two different tables and only 20 inserts/sec with two sessions inserting in the same table. Imho, this difference between 20 and 32 is more important thing to fix, and these results are not so bad in comparison with others. (BTW, we shouldn't forget about using raw devices to speed up things). Vadim From vadim@sable.krasnoyarsk.su Wed Nov 5 12:20:08 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05150 for ; Wed, 5 Nov 1997 12:20:07 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24889 for ; Wed, 5 Nov 1997 11:59:27 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <3460A6D7.167EB0E7@sable.krasnoyarsk.su> Date: Thu, 06 Nov 1997 00:03:19 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Mattias Kregert CC: Bruce Momjian , pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711050317.WAA04674@candle.pha.pa.us> <34609871.27EED9D@algonet.se> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Mattias Kregert wrote: > > Bruce Momjian wrote: > > > > We don't have buffered logging. Buffered logging guarantees you get put > > back to a consistent state after an os/server crash, usually to within > > 30/90 seconds. You do not have any partial transactions lying around, > > but you do have some transactions that you thought were done, but are > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > not. > ^^^^ > > > > This is faster then non-buffered logging, but not as fast as no logging. > > Guess what mode everyone uses? The one we don't have, buffered logging! > > Ouch! I would *not* like to use "buffered logging". And I. > What's the point in having the wrong data in the database and not > knowing what updates, inserts or deletes to do to get the correct data? > > That's irrecoverable loss of data. Not what *I* want. Do *you* want it? > > > We really need a middle solution, that gives better data integrity, for > > a smaller price. > > What I would like to have is this: > > If a backend tells the frontend that a transaction has completed, > then that transaction should absolutely not get lost in case of a crash. Agreed. > > What is needed is a log of changes since the last backup. This > log would preferrably reside on a remote machine or at least > another disk. Then, if the power goes in the middle of a disk write, > the disk explodes and the computer goes up in flames, you can > install Postgresql on a new machine, restore the last backup and > re-run the change log. Yes. And as I already said - this will speed up things because redo flushing is faster than flushing NNN tables which can be unflushed for some interval. Vadim From owner-pgsql-hackers@hub.org Wed Nov 5 12:20:39 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05168 for ; Wed, 5 Nov 1997 12:20:38 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id MAA25888 for ; Wed, 5 Nov 1997 12:14:14 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id MAA02259; Wed, 5 Nov 1997 12:02:33 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 12:00:21 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id MAA00750 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 12:00:10 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA00598 for ; Wed, 5 Nov 1997 11:59:45 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS) Message-ID: <3460A6D7.167EB0E7@sable.krasnoyarsk.su> Date: Thu, 06 Nov 1997 00:03:19 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Mattias Kregert CC: Bruce Momjian , pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711050317.WAA04674@candle.pha.pa.us> <34609871.27EED9D@algonet.se> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Mattias Kregert wrote: > > Bruce Momjian wrote: > > > > We don't have buffered logging. Buffered logging guarantees you get put > > back to a consistent state after an os/server crash, usually to within > > 30/90 seconds. You do not have any partial transactions lying around, > > but you do have some transactions that you thought were done, but are > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > not. > ^^^^ > > > > This is faster then non-buffered logging, but not as fast as no logging. > > Guess what mode everyone uses? The one we don't have, buffered logging! > > Ouch! I would *not* like to use "buffered logging". And I. > What's the point in having the wrong data in the database and not > knowing what updates, inserts or deletes to do to get the correct data? > > That's irrecoverable loss of data. Not what *I* want. Do *you* want it? > > > We really need a middle solution, that gives better data integrity, for > > a smaller price. > > What I would like to have is this: > > If a backend tells the frontend that a transaction has completed, > then that transaction should absolutely not get lost in case of a crash. Agreed. > > What is needed is a log of changes since the last backup. This > log would preferrably reside on a remote machine or at least > another disk. Then, if the power goes in the middle of a disk write, > the disk explodes and the computer goes up in flames, you can > install Postgresql on a new machine, restore the last backup and > re-run the change log. Yes. And as I already said - this will speed up things because redo flushing is faster than flushing NNN tables which can be unflushed for some interval. Vadim From owner-pgsql-hackers@hub.org Wed Nov 5 14:01:02 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07017 for ; Wed, 5 Nov 1997 14:00:59 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA01759 for ; Wed, 5 Nov 1997 13:52:36 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA03611; Wed, 5 Nov 1997 13:29:43 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 13:27:48 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA03291 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 13:27:41 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA02823 for ; Wed, 5 Nov 1997 13:26:20 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id NAA05863; Wed, 5 Nov 1997 13:16:09 -0500 (EST) From: Bruce Momjian Message-Id: <199711051816.NAA05863@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Wed, 5 Nov 1997 13:16:09 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <3460A374.41C67EA6@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 5, 97 11:48:52 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > There is no fsync synchronization currently. > How could we be ensured that all modified data pages are flushed > when we decided to flush pg_log ? > If backend doesn't fsync data pages & pg_log at the commit time > then when he must flush them (data first) ? My idea was to have the backend do a 'sync' that causes the OS to sync all dirty pages, then mark all committed transactions on pg_log as 'synced'. Then sync pg_log. That way, there is a clear system where we know everything is flushed to disk, and we mark the transactions as synced. The only time that synced flag is used, is when the database starts up, and it sees that the previous shutdown was not clean. What am I missing here? > > This is what Oracle does: > > it uses dedicated DBWR process for writing/flushing modified > data pages and LGWR process for writing/flushing redo log > (redo log is transaction log also). LGWR always flushes log pages > when committing, but durty data pages can be flushed _after_ transaction > commit when DBWR decides that it's time to do it (ala checkpoints interval). > > Using redo log we could implement buffered logging quite easy. > We can even don't use dedicated processes (but flush redo before pg_log), > though having LGWR could simplify things. > > Without redo log or without some fsync synchronization we can't implement > buffered logging. BTW, shared system cache could help with > fsync synchonization, but, imho, redo is better (and faster for > un-buffered logging too). > I suggested my solution because it is clean, does flushing in one central location(postmaster), and does quick restores. > > > > Actually, I found a problem with my description. Because pg_log is not > > > > fsync'ed, after a crash, pages with new transactions could have been > > > > flushed to disk, but not the pg_log table that contains the transaction > > > > ids. The problem is that the new backend could assign a transaction id > > > > that is already in use. > > > > > > Impossible. Backend flushes pg_variable after fetching nex 32 xids. > > > > My suggestion is that we don't need to flush pg_variable or pg_log that > > much. My suggestion would speed up the test you do with 100 inserts > > inside a single transaction vs. 100 separate inserts. > > > > > > > > > > We could set a flag upon successful shutdown, and if it is not set on > > > > reboot, either do a vacuum to find the max transaction id, and > > > > invalidate all them not in pg_log as synced, or increase the next > > > > transaction id to some huge number and invalidate all them in between. > > > > > > > > I have a fix for the problem stated above, and it doesn't require a > > vacuum. > > > > We decide to fsync pg_variable and pg_log every 10,000 transactions or > > oids. Then if the database is brought up, and it was not brought down > > cleanly, you increment oid and transaction_id by 10,000, because you > > know you couldn't have gotten more than that. All intermediate > > transactions that are not marked committed/synced are marked aborted. > > This is what I suppose to do by placing next available oid/xid > in shmem: this allows pre-fetch much more than 32 ids at once > without losing them when session closed. > > > The problem we have with the current system is that we sync by action, > > not by time interval. If you are doing tons of inserts or updates, it > > is syncing after every one. What people really want is something that > > will sync not after every action, but after every minute or five > > minutes, so when the system is busy, the syncing every minutes is just a > > small amount, and when the system is idle, no one cares if is syncs, and > > no one has to wait for the sync to complete. > > When I'm really doing tons of inserts/updates/deletes I use > BEGIN/END. But it doesn't work for multi-user environment, of 'course. > As for about what people really want, I remember that recently someone > said in user list that if one want to have 10-20 inserts/sec then he > should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise > when using one session, 32 inserts/sec with two sessions inserting > in two different tables and only 20 inserts/sec with two sessions > inserting in the same table. Imho, this difference between 20 and 32 > is more important thing to fix, and these results are not so bad > in comparison with others. > > (BTW, we shouldn't forget about using raw devices to speed up things). > > Vadim > -- Bruce Momjian maillist@candle.pha.pa.us From james@blarg.net Wed Nov 5 13:26:46 1997 Received: from animal.blarg.net (mail@animal.blarg.net [206.114.144.1]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id NAA06130 for ; Wed, 5 Nov 1997 13:26:26 -0500 (EST) Received: from animal.blarg.net (james@animal.blarg.net [206.114.144.1]) by animal.blarg.net (8.8.5/8.8.4) with SMTP id KAA09775; Wed, 5 Nov 1997 10:26:10 -0800 Date: Wed, 5 Nov 1997 10:26:10 -0800 (PST) From: "James A. Hillyerd" To: Bruce Momjian cc: Mattias Kregert , pgsql-hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! In-Reply-To: <199711051615.LAA02260@candle.pha.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: OR On Wed, 5 Nov 1997, Bruce Momjian wrote: > > The strange thing I am hearing is that the people who use PostgreSQL are > more worried about data recovery from a crash than million-dollar > companies that use commercial databases. > If I may throw in my 2 cents, I'd prefer to see that database in a consistent state, with the data being up to date as of 1 minute or less before the crash. I'd rather have higher performance than up to the second data. -james [ James A. Hillyerd (JH2162) - james@blarg.net - Web Developer ] [ http://www.blarg.net/~james/ http://www.hyperglyphics.com/ ] [ 1024/B11C3751 CA 1C B3 A9 07 2F 57 C9 91 F4 73 F2 19 A4 C5 88 ] From vadim@sable.krasnoyarsk.su Wed Nov 5 14:24:03 1997 Received: from renoir.op.net (root@renoir.op.net [206.84.208.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07830 for ; Wed, 5 Nov 1997 14:24:02 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id OAA02778 for ; Wed, 5 Nov 1997 14:13:45 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id CAA23376; Thu, 6 Nov 1997 02:17:51 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <3460C65E.446B9B3D@sable.krasnoyarsk.su> Date: Thu, 06 Nov 1997 02:17:50 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711051816.NAA05863@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > > There is no fsync synchronization currently. > > How could we be ensured that all modified data pages are flushed > > when we decided to flush pg_log ? > > If backend doesn't fsync data pages & pg_log at the commit time > > then when he must flush them (data first) ? > > My idea was to have the backend do a 'sync' that causes the OS to sync > all dirty pages, then mark all committed transactions on pg_log as > 'synced'. Then sync pg_log. That way, there is a clear system where we > know everything is flushed to disk, and we mark the transactions as > synced. > > The only time that synced flag is used, is when the database starts up, > and it sees that the previous shutdown was not clean. > > What am I missing here? Ok, I see. But we can avoid 'synced' flag: we can make (just before sync-ing data pages) in-memory copies of "on-line" durty pg_log pages to being written/fsynced and perform write/fsync from these copies without stopping new commits in "on-line" page(s) (nothing must go to disk from "on-line" log pages). Vadim From owner-pgsql-hackers@hub.org Wed Nov 5 14:32:25 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA08101 for ; Wed, 5 Nov 1997 14:32:21 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id OAA22970; Wed, 5 Nov 1997 14:26:47 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:24:59 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA22344 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:24:56 -0500 (EST) Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA22319 for ; Wed, 5 Nov 1997 14:24:38 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id OAA07661; Wed, 5 Nov 1997 14:22:46 -0500 (EST) From: Bruce Momjian Message-Id: <199711051922.OAA07661@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Wed, 5 Nov 1997 14:22:45 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <3460A374.41C67EA6@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 5, 97 11:48:52 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Just a clarification. When I say the postmaster issues a sync, I mean sync(2), not fsync(2). The sync flushes all dirty pages on all file systems. Ordinary users can issue this, and update usually does this every 30 seconds anyway. By using this, we let the kernel figure out which buffers are dirty. We don't have to figure this out in the postmaster. Then we update the pg_log table to mark those transactions as synced. On recovery from a crash, we mark the committed transactions as uncommitted if they do not have the synced flag. -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Wed Nov 5 15:11:07 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA08751 for ; Wed, 5 Nov 1997 15:10:59 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA01986; Wed, 5 Nov 1997 15:01:24 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:59:32 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA01414 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:59:28 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA01403 for ; Wed, 5 Nov 1997 14:59:14 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id OAA08283; Wed, 5 Nov 1997 14:53:55 -0500 (EST) From: Bruce Momjian Message-Id: <199711051953.OAA08283@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Wed, 5 Nov 1997 14:53:54 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <3460C65E.446B9B3D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 6, 97 02:17:50 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > The only time that synced flag is used, is when the database starts up, > > and it sees that the previous shutdown was not clean. > > > > What am I missing here? > > Ok, I see. But we can avoid 'synced' flag: we can make (just before > sync-ing data pages) in-memory copies of "on-line" durty pg_log pages > to being written/fsynced and perform write/fsync from these copies > without stopping new commits in "on-line" page(s) (nothing must go > to disk from "on-line" log pages). [Working late tonight?] OK, now I am lost. We need the sync'ed flag so when we start the postmaster, and we see the database we not shut down properly, we use the flag to clear the commit flag from comitted transactions that were not sync'ed by the postmaster. In my opinion, we don't need any extra copies of pg_log, we can set those sync'ed flags while others are making changes, because before we did our sync, we gathered a list of committed transaction ids from the shared transaction id queue that I mentioned a while ago. We need this queue so we can find the newly-committed transactions that do not have a sync flag. Another way we could do this would be to scan pg_log before we sync, getting all the committed transaction ids without sync flags. No lock is needed on the table. If we miss some new ones, we will get them next time we scan. The problem I saw is that there is no way to see when to stop scanning the pg_log table for such transactions, so I thought each backend would have to put its newly committed transactions in a separate place. Maybe I am wrong. This syncing method just seems so natural since we have pg_log. That is why I keep bringing it up until people tell me I am stupid. This transaction commit/sync stuff is complicated, and takes a while to hash out in a group. --------------------------------------------------------------------------- I just re-read your description, and I see what you are saying. My idea has pg_log commit flag be real commit flags while the system is running, but on reboot after failure, we remove the commit flags on non-synced stuff before we start up. Your idea is to make pg_log commit flags only appear in in-memory copies of pg_log, and write the commit flags to disk only after the sync is done. Either way will work. The question is, "Which is easier?" The OS is going to sync pg_log on its own. We would almost need a second copy of pg_log, one copy to be used on postmaster startup, and a second to be used by running backends, and the postmaster would make a copy of the running backend pg_log, sync the disks, and copy it to the boot copy. I don't see how the backend is going to figure out which pg_log pages were modified and need to be sent to the boot copy of pg_log. Now that I am thinking, here is a good idea. Instead of a fancy transaction queue, what if we just have the backend record the lowest numbered transaction they commit in a shared memory area. If the current transaction id they commit is greater than the minimum, then change nothing. That way, the backend could copy all pg_log pages containing that minimum pg_log transaction id up to the most recent pg_log page, do the sync, and copy just those to the boot copy of pg_log. This eliminates the transaction id queue. The nice thing about the sync-flag in pg_log is that there is no copying by the backend. But we would have to spin through the file to set those sync bits. Your method just copies whole pages to the boot copy. --------------------------------------------------------------------------- I don't want to force this idea on anyone, or annoy anyone. I just think it needs to be considered. The concepts are unusual, so once people get the full idea, if they don't like it, we can trash it. I still think it holds promise. -- Bruce Momjian maillist@candle.pha.pa.us From hotz@jpl.nasa.gov Wed Nov 5 15:30:18 1997 Received: from hotzsun.jpl.nasa.gov (hotzsun.jpl.nasa.gov [137.79.51.138]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA09500 for ; Wed, 5 Nov 1997 15:30:16 -0500 (EST) Received: from [137.79.51.141] (hotzmac [137.79.51.141]) by hotzsun.jpl.nasa.gov (8.7.6/8.7.3) with SMTP id MAA10100; Wed, 5 Nov 1997 12:29:58 -0800 (PST) X-Sender: hotzmail@hotzsun.jpl.nasa.gov Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 5 Nov 1997 12:29:58 -0800 To: Bruce Momjian , matti@algonet.se (Mattias Kregert) From: hotz@jpl.nasa.gov (Henry B. Hotz) Subject: Re: [HACKERS] My $.02, was: PERFORMANCE and Good Bye, Time Travel! Cc: pgsql-hackers@postgreSQL.org Status: OR At 11:15 AM 11/5/97, Bruce Momjian wrote: >The strange thing I am hearing is that the people who use PostgreSQL are >more worried about data recovery from a crash than million-dollar >companies that use commercial databases. > >I don't get it. I would run PG to make sure that committed transactions were really written to disk because that seems "correct" and I don't have the kind of performance requirements that would push me to do otherwise. That said, I can see a need for varying performance/crash-immunity tradeoffs, and at least *one* option in between "correct" and "unprotected" operation would seem desirable. Signature failed Preliminary Design Review. Feasibility of a new signature is currently being evaluated. h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu From owner-pgsql-hackers@hub.org Thu Nov 6 15:51:23 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA04634 for ; Thu, 6 Nov 1997 15:51:08 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA24783; Thu, 6 Nov 1997 15:36:47 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 15:36:07 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA24514 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 15:36:02 -0500 (EST) Received: from guevara.bildbasen.kiruna.se (guevara.bildbasen.kiruna.se [193.45.225.110]) by hub.org (8.8.5/8.7.5) with SMTP id PAA24319 for ; Thu, 6 Nov 1997 15:35:32 -0500 (EST) Received: (qmail 9764 invoked by uid 129); 6 Nov 1997 20:34:35 -0000 Date: 6 Nov 1997 20:34:35 -0000 Message-ID: <19971106203435.9763.qmail@guevara.bildbasen.kiruna.se> From: Goran Thyni To: pgsql-hackers@postgreSQL.org In-reply-to: <34619E9E.622F563@algonet.se> (message from Mattias Kregert on Thu, 06 Nov 1997 11:40:30 +0100) Subject: [HACKERS] Re: Performance vs. Crash Recovery Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-hackers@hub.org Precedence: bulk Status: OR I am getting quiet bored by this discussion, if someone has a strong opinion about how this should be done go ahead and make a test implementation then we have something to discuss. In the mean time, if you want best possible data protection mount you database disk sync:ed. This is safer than any scheme we could come up with. D*mned slow too, so everybody should be happy. :-) And I see no point implement a periodic sync in postmaster. All unices has cron, why not just use that. Or even a stupid 1-liner (ba)sh-script like: while true; do sleep 20; sync; done best regards, -- --------------------------------------------- Göran Thyni, sysadm, JMS Bildbasen, Kiruna From vadim@sable.krasnoyarsk.su Thu Nov 6 23:31:41 1997 Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA04723 for ; Thu, 6 Nov 1997 23:31:21 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25438; Fri, 7 Nov 1997 11:36:25 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <34629AC9.15FB7483@sable.krasnoyarsk.su> Date: Fri, 07 Nov 1997 11:36:25 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711051953.OAA08283@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > > > The only time that synced flag is used, is when the database starts up, > > > and it sees that the previous shutdown was not clean. > > > > > > What am I missing here? > > > > Ok, I see. But we can avoid 'synced' flag: we can make (just before > > sync-ing data pages) in-memory copies of "on-line" durty pg_log pages > > to being written/fsynced and perform write/fsync from these copies > > without stopping new commits in "on-line" page(s) (nothing must go > > to disk from "on-line" log pages). > > [Working late tonight?] [Yes] > I just re-read your description, and I see what you are saying. My idea > has pg_log commit flag be real commit flags while the system is running, > but on reboot after failure, we remove the commit flags on non-synced > stuff before we start up. > > Your idea is to make pg_log commit flags only appear in in-memory copies > of pg_log, and write the commit flags to disk only after the sync is > done. > > Either way will work. The question is, "Which is easier?" The OS is > going to sync pg_log on its own. We would almost need a second copy of > pg_log, one copy to be used on postmaster startup, and a second to be > used by running backends, and the postmaster would make a copy of the > running backend pg_log, sync the disks, and copy it to the boot copy. > > I don't see how the backend is going to figure out which pg_log pages > were modified and need to be sent to the boot copy of pg_log. > > Now that I am thinking, here is a good idea. Instead of a fancy > transaction queue, what if we just have the backend record the lowest > numbered transaction they commit in a shared memory area. If the > current transaction id they commit is greater than the minimum, then > change nothing. That way, the backend could copy all pg_log pages > containing that minimum pg_log transaction id up to the most recent > pg_log page, do the sync, and copy just those to the boot copy of > pg_log. > > This eliminates the transaction id queue. > > The nice thing about the sync-flag in pg_log is that there is no copying > by the backend. But we would have to spin through the file to set those > sync bits. Your method just copies whole pages to the boot copy. In my plans to re-design transaction system I supposed to keep in shmem two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer to access them is not good idea. Also, we could use spinlock instead of lock manager to synchronize access to these pages (as I see in spin.c spinlock-s could be shared, but only exclusive ones are used) - spinlocks are faster. These two last pg_log pages are "online" ones. Race condition: when one or both of online pages becomes non-online ones, i.e. pg_log has to be expanded when writing commit/abort of "big" xid. This is how we could handle this in "buffered" logging (delayed fsync) mode: When backend want to write commit/abort status he acquires exclusive OnLineLogLock. If xid belongs to online pages then backend writes status and releases spin. If xid is less than least xid on 1st online page then backend releases spin and does exactly the same what he does in normal mode: flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer, update xid status, WriteBuffer, release write lock, flush pg_log. If xid is greater than max xid on 2nd online page then the simplest way is just do sync(); sync() (two times), flush 1st or both online pages, read new page(s) into online pages space, update xid status, release OnLineLogLock spin. We could try other ways but pg_log expanding is rare case (32K xids in one pg_log page)... All what postmaster will have to do is: 1. Get shared OnLineLogLock. 2. Copy 2 x 8K data to private place. 3. Release spinlock. 4. sync(); sync(); (two times!) 5. Flush online pages. We could use -F DELAY_TIME to turn fsync delayed mode ON. And, btw, having two bits for xact status we have only one unused status value (0x11) currently - I would like to use this for nested xactions and savepoints... > I don't want to force this idea on anyone, or annoy anyone. I just > think it needs to be considered. The concepts are unusual, so once > people get the full idea, if they don't like it, we can trash it. I > still think it holds promise. Agreed. Vadim From owner-pgsql-hackers@hub.org Fri Nov 7 01:32:49 1997 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07651 for ; Fri, 7 Nov 1997 01:32:47 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23328 for ; Thu, 6 Nov 1997 23:46:08 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA19565; Thu, 6 Nov 1997 23:38:55 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 23:36:53 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA18911 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 23:36:44 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA18779 for ; Thu, 6 Nov 1997 23:36:02 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25448; Fri, 7 Nov 1997 11:40:29 +0700 (KRS) Message-ID: <34629BBD.59E2B600@sable.krasnoyarsk.su> Date: Fri, 07 Nov 1997 11:40:29 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: Mattias Kregert , pgsql-hackers@postgreSQL.org Subject: Re: Sync:ing data and log (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!) References: <199711061810.NAA02118@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Bruce Momjian wrote: > > > > > Never use sync(). Use fsync(). Other processes should take care of their > > own syncing. If you use sync(), and you have a lot of disks, the sync > > can > > take half a minute if you are unlucky. > > We could use fsync() but then the postmaster has to know what tables > have dirty buffers, and I don't think there is an easy way to do this. There is one way - shared system cache... Vadim From vadim@sable.krasnoyarsk.su Fri Nov 7 01:31:24 1997 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07639 for ; Fri, 7 Nov 1997 01:31:22 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23094 for ; Thu, 6 Nov 1997 23:39:00 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25457; Fri, 7 Nov 1997 11:43:52 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <34629C87.3F54BC7E@sable.krasnoyarsk.su> Date: Fri, 07 Nov 1997 11:43:51 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Mattias Kregert CC: Bruce Momjian , pgsql-hackers@postgreSQL.org Subject: Re: Performance vs. Crash Recovery (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!) References: <199711051615.LAA02260@candle.pha.pa.us> <34619E9E.622F563@algonet.se> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Mattias Kregert wrote: > > > The strange thing I am hearing is that the people who use PostgreSQL are > > more worried about data recovery from a crash than million-dollar > > companies that use commercial databases. > > > > I don't get it. > > Perhaps the million-dollar companies have more sophisticated hardware, > like big expensive disk arrays, big UPS:es and parallell backup > servers? > If so, the risk of harware failure is much smaller for them. More of that - Informix is more stable than postgres: elog(FATAL) occures sometime and in fsync delayed mode this will cause of losing xaction too, not onle hard/OS failure. Vadim From owner-pgsql-hackers@hub.org Fri Nov 7 01:31:26 1997 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07642 for ; Fri, 7 Nov 1997 01:31:24 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id AAA24358 for ; Fri, 7 Nov 1997 00:09:47 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA00167; Fri, 7 Nov 1997 00:03:17 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:01:26 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA29427 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:01:19 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA29364 for ; Fri, 7 Nov 1997 00:01:02 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id XAA05565; Thu, 6 Nov 1997 23:54:33 -0500 (EST) From: Bruce Momjian Message-Id: <199711070454.XAA05565@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Thu, 6 Nov 1997 23:54:33 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <34629AC9.15FB7483@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 7, 97 11:36:25 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR I was worried when you didn't respond to my last list of ideas. I thought perhaps the idea was getting on your nerves. I haven't dropped the idea because: 1) it offers 2-9 times speedup in database modifications 2) this is how the big commercial system handle it, and I think we need to give users this option. 3) in the way I had it designed, it wouldn't take much work to do it. Anything that promises that much speedup, if it can be done easy, I say lets consider it, even if you loose 60 seconds of changes. > In my plans to re-design transaction system I supposed to keep in shmem > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer > to access them is not good idea. Also, we could use spinlock instead of > lock manager to synchronize access to these pages (as I see in spin.c > spinlock-s could be shared, but only exclusive ones are used) - spinlocks > are faster. Ah, so you already had the idea of having on-line pages in shared memory as part of a transaction system overhaul? Right now, does each backend lock/read/write/unlock to get at pg_log? Wow, that is bad. Perhaps mmap() would be a good idea. My system has msync() to flush mmap()'ed pages to the underlying file. You would still run fsync() after that. This may give us the best of both worlds: a shared-memory area of variable size, and control of when it get flushed to disk. Do other OS's have this? I have a feeling OS's with unified buffer caches don't have this ability to determine when the underlying mmap'ed file gets sent to the underlying file and disk. > These two last pg_log pages are "online" ones. Race condition: when one or > both of online pages becomes non-online ones, i.e. pg_log has to be expanded > when writing commit/abort of "big" xid. This is how we could handle this > in "buffered" logging (delayed fsync) mode: > > When backend want to write commit/abort status he acquires exclusive > OnLineLogLock. If xid belongs to online pages then backend writes status > and releases spin. If xid is less than least xid on 1st online page then > backend releases spin and does exactly the same what he does in normal mode: > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer, > update xid status, WriteBuffer, release write lock, flush pg_log. > If xid is greater than max xid on 2nd online page then the simplest way is > just do sync(); sync() (two times), flush 1st or both online pages, > read new page(s) into online pages space, update xid status, > release OnLineLogLock spin. We could try other ways but pg_log expanding > is rare case (32K xids in one pg_log page)... > All what postmaster will have to do is: > 1. Get shared OnLineLogLock. > 2. Copy 2 x 8K data to private place. > 3. Release spinlock. > 4. sync(); sync(); (two times!) > 5. Flush online pages. > > We could use -F DELAY_TIME to turn fsync delayed mode ON. > > And, btw, having two bits for xact status we have only one unused > status value (0x11) currently - I would like to use this for > nested xactions and savepoints... I saw that. By keeping two copies of pg_log, one in memory to be used by all backend, and another that hits the disk, it certainly will work. > > > I don't want to force this idea on anyone, or annoy anyone. I just > > think it needs to be considered. The concepts are unusual, so once > > people get the full idea, if they don't like it, we can trash it. I > > still think it holds promise. > > Agreed. > > Vadim > -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Fri Nov 7 01:03:09 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07314 for ; Fri, 7 Nov 1997 01:03:05 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA07879; Fri, 7 Nov 1997 00:57:42 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:55:52 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA03918 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:55:46 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA02961 for ; Fri, 7 Nov 1997 00:55:18 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id MAA25567; Fri, 7 Nov 1997 12:59:29 +0700 (KRS) Message-ID: <3462AE40.FF6D5DF@sable.krasnoyarsk.su> Date: Fri, 07 Nov 1997 12:59:28 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711070454.XAA05565@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Bruce Momjian wrote: > > I was worried when you didn't respond to my last list of ideas. I > thought perhaps the idea was getting on your nerves. No, I was (and, unfortunately, I still) busy... > > I haven't dropped the idea because: > > 1) it offers 2-9 times speedup in database modifications > 2) this is how the big commercial system handle it, and I think > we need to give users this option. > 3) in the way I had it designed, it wouldn't take much work to > do it. > > Anything that promises that much speedup, if it can be done easy, I say > lets consider it, even if you loose 60 seconds of changes. I agreed with your un-buffered logging idea. This would be excellent feature for un-critical dbase usings (WWW, etc). > > > In my plans to re-design transaction system I supposed to keep in shmem > > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer > > to access them is not good idea. Also, we could use spinlock instead of > > lock manager to synchronize access to these pages (as I see in spin.c > > spinlock-s could be shared, but only exclusive ones are used) - spinlocks > > are faster. > > Ah, so you already had the idea of having on-line pages in shared memory > as part of a transaction system overhaul? Right now, does each backend Yes. I hope to implement this in the next 1-2 weeks. > lock/read/write/unlock to get at pg_log? Wow, that is bad. Yes, he does. > > Perhaps mmap() would be a good idea. My system has msync() to flush > mmap()'ed pages to the underlying file. You would still run fsync() > after that. This may give us the best of both worlds: a shared-memory ^^^^^^^^^^^^^ > area of variable size, and control of when it get flushed to disk. Do ^^^^^^^^^^^^^^^^^^^^^ I like it. FreeBSD supports MAP_ANON Map anonymous memory not associated with any specific file. It would be nice to use mmap to get more "shared" memory, but I don't see reasons to mmap any particular file to memory. Having two last pg_log pages in memory + xact commit/abort writeback optimization (updation of commit/abort xmin/xmax status in tuples by any scan - we already have this) reduce access to "old" pg_log pages to zero. > other OS's have this? I have a feeling OS's with unified buffer caches > don't have this ability to determine when the underlying mmap'ed file > gets sent to the underlying file and disk. > > > These two last pg_log pages are "online" ones. Race condition: when one or > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded > > when writing commit/abort of "big" xid. This is how we could handle this > > in "buffered" logging (delayed fsync) mode: > > > > When backend want to write commit/abort status he acquires exclusive > > OnLineLogLock. If xid belongs to online pages then backend writes status > > and releases spin. If xid is less than least xid on 1st online page then > > backend releases spin and does exactly the same what he does in normal mode: > > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer, > > update xid status, WriteBuffer, release write lock, flush pg_log. > > If xid is greater than max xid on 2nd online page then the simplest way is > > just do sync(); sync() (two times), flush 1st or both online pages, > > read new page(s) into online pages space, update xid status, > > release OnLineLogLock spin. We could try other ways but pg_log expanding > > is rare case (32K xids in one pg_log page)... > > All what postmaster will have to do is: > > 1. Get shared OnLineLogLock. > > 2. Copy 2 x 8K data to private place. > > 3. Release spinlock. > > 4. sync(); sync(); (two times!) > > 5. Flush online pages. > > > > We could use -F DELAY_TIME to turn fsync delayed mode ON. > > > > And, btw, having two bits for xact status we have only one unused > > status value (0x11) currently - I would like to use this for > > nested xactions and savepoints... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ More about this: 0x11 could mean "this _child_ transaction is committed - you have to lookup in pg_xact_child to get parent xid and use pg_log again to get parent xact status". If parent committed then child xact status will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this we could get xact nesting and savepoints by starting new child xaction inside running one... > > I saw that. By keeping two copies of pg_log, one in memory to be used ^^^^^^ Just two pg_log pages... > by all backend, and another that hits the disk, it certainly will work. Vadim From vadim@sable.krasnoyarsk.su Fri Nov 7 01:30:59 1997 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07599 for ; Fri, 7 Nov 1997 01:30:58 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id BAA26793 for ; Fri, 7 Nov 1997 01:12:33 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id NAA25592; Fri, 7 Nov 1997 13:16:39 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <3462B247.ABD322C@sable.krasnoyarsk.su> Date: Fri, 07 Nov 1997 13:16:39 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Jan Wieck CC: Bruce Momjian , hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR wieck@sapserv.debis.de wrote: > > Bruce wrote: > > > > > > It seems that this is what Oracle does, but Sybase writes queries > > > > (with transaction ids, of 'course, and before execution) and > > > > begin, commit/abort events <-- this is better for non-overwriting > > > > system (shorter redo file), but, agreed, recovering is more complicated. > > > > > > > > Vadim > > > > > > > > > > Writing only the queries (and only those that really modify > > > data - no selects) would be much smarter and the redo files > > > will be shorter. But it wouldn't fit for PostgreSQL as long > > > as someone can submit a query like > > > > > > DELETE FROM xxx WHERE oid = 59337; > > > > Interesting point. Currently, an insert shows the OID as output in > > psql. Perhaps we could do a little oid-manipulating to set the oid of > > the insert. > > Only for simple inserts, not on > > INSERT INTO xxx SELECT any_type_of_merge_join; I don't know how but Sybase handle this and IDENTITY (case of OIDs) too. But I don't object you, Jan, just because I havn't time to do "log queries" redo implementation and so I would like to have "log changes" redo at least. (Actually, "log changes" is good for my production dbase with 1 - 2 thousand updations per day). (BTW, "incrementing" backup could be implemented without redo - I have some thoughts about this, - but having additional recovering is good in any case). Vadim From owner-pgsql-hackers@hub.org Fri Nov 7 15:42:58 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA22341 for ; Fri, 7 Nov 1997 15:42:55 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA02769; Fri, 7 Nov 1997 15:28:54 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 15:24:00 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA01318 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 15:23:52 -0500 (EST) Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id PAA00705 for ; Fri, 7 Nov 1997 15:21:56 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id PAA20010; Fri, 7 Nov 1997 15:20:10 -0500 (EST) From: Bruce Momjian Message-Id: <199711072020.PAA20010@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Fri, 7 Nov 1997 15:20:10 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <3462AE40.FF6D5DF@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 7, 97 12:59:28 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > Anything that promises that much speedup, if it can be done easy, I say > > lets consider it, even if you loose 60 seconds of changes. > > I agreed with your un-buffered logging idea. This would be excellent > feature for un-critical dbase usings (WWW, etc). Actually, it is buffered logging. We currently have unbuffered logging, I think. > > > In my plans to re-design transaction system I supposed to keep in shmem > > > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer > > > to access them is not good idea. Also, we could use spinlock instead of > > > lock manager to synchronize access to these pages (as I see in spin.c > > > spinlock-s could be shared, but only exclusive ones are used) - spinlocks > > > are faster. > > > > Ah, so you already had the idea of having on-line pages in shared memory > > as part of a transaction system overhaul? Right now, does each backend > > Yes. I hope to implement this in the next 1-2 weeks. > > > lock/read/write/unlock to get at pg_log? Wow, that is bad. > > Yes, he does. > > > > > Perhaps mmap() would be a good idea. My system has msync() to flush > > mmap()'ed pages to the underlying file. You would still run fsync() > > after that. This may give us the best of both worlds: a shared-memory > ^^^^^^^^^^^^^ > > area of variable size, and control of when it get flushed to disk. Do > ^^^^^^^^^^^^^^^^^^^^^ > I like it. FreeBSD supports > > MAP_ANON Map anonymous memory not associated with any specific file. > > It would be nice to use mmap to get more "shared" memory, but I don't see > reasons to mmap any particular file to memory. Having two last pg_log pages > in memory + xact commit/abort writeback optimization (updation of commit/abort > xmin/xmax status in tuples by any scan - we already have this) reduce access > to "old" pg_log pages to zero. I totally agree. There is no advantage to mmap() vs. shared memory for us. I thought if we could control when the mmap() gets flushed to disk, we could let the OS handle the syncing, but I doubt this is going to be portable. Though, we could mmap() pg_log, and that way backends would not have to read/write the blocks, and they could all see the same data. But with the new scheme, they have most transaction ids in shared memory. Interesting you mention the scan updating the transaction status. We would have a problem here. It is possible a backend will update the commit status of a data page, and that data page will make it to disk, but if there is a crash before the update pg_log gets sync'ed, there would be a partial transaction in the system. I don't know any way that a backend would know the transaction has hit disk, and the data commit flag could be set. You don't want to update the commit flag of the data page until entire transaction has been sync'ed. The only way to do that would be to have a 'commit and synced' flag, but you want to save that for nested transactions. Another case this could come in handy is to allow reuse of superceeded data rows. If the transaction is committed and synced, the row space could be reused by another transaction. > > other OS's have this? I have a feeling OS's with unified buffer caches > > don't have this ability to determine when the underlying mmap'ed file > > gets sent to the underlying file and disk. > > > > > These two last pg_log pages are "online" ones. Race condition: when one or > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded > > > when writing commit/abort of "big" xid. This is how we could handle this > > > in "buffered" logging (delayed fsync) mode: > > > > > > When backend want to write commit/abort status he acquires exclusive > > > OnLineLogLock. If xid belongs to online pages then backend writes status This confuses me. Why does a backend need to lock pg_log to update a transaction status? > > > and releases spin. If xid is less than least xid on 1st online page then > > > backend releases spin and does exactly the same what he does in normal mode: > > > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer, > > > update xid status, WriteBuffer, release write lock, flush pg_log. > > > If xid is greater than max xid on 2nd online page then the simplest way is > > > just do sync(); sync() (two times), flush 1st or both online pages, > > > read new page(s) into online pages space, update xid status, > > > release OnLineLogLock spin. We could try other ways but pg_log expanding > > > is rare case (32K xids in one pg_log page)... > > > All what postmaster will have to do is: > > > 1. Get shared OnLineLogLock. > > > 2. Copy 2 x 8K data to private place. > > > 3. Release spinlock. > > > 4. sync(); sync(); (two times!) > > > 5. Flush online pages. Great. > > > > > > We could use -F DELAY_TIME to turn fsync delayed mode ON. > > > > > > And, btw, having two bits for xact status we have only one unused > > > status value (0x11) currently - I would like to use this for > > > nested xactions and savepoints... > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > More about this: 0x11 could mean "this _child_ transaction is committed - > you have to lookup in pg_xact_child to get parent xid and use pg_log again > to get parent xact status". If parent committed then child xact status > will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this > we could get xact nesting and savepoints by starting new child xaction > inside running one... OK. > > > > > I saw that. By keeping two copies of pg_log, one in memory to be used > ^^^^^^ > Just two pg_log pages... Got it. -- Bruce Momjian maillist@candle.pha.pa.us From owner-pgsql-hackers@hub.org Sun Nov 9 22:07:36 1997 Received: from hub.org (hub.org [209.47.148.200]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA04655 for ; Sun, 9 Nov 1997 22:07:30 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id VAA07023; Sun, 9 Nov 1997 21:55:54 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 21:52:20 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id VAA06174 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 21:52:13 -0500 (EST) Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id VAA06092 for ; Sun, 9 Nov 1997 21:51:58 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id VAA04150; Sun, 9 Nov 1997 21:50:29 -0500 (EST) From: Bruce Momjian Message-Id: <199711100250.VAA04150@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! (fwd) To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Sun, 9 Nov 1997 21:50:29 -0500 (EST) Cc: hackers@postgreSQL.org (PostgreSQL-development) X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Forwarded message: > > > Perhaps mmap() would be a good idea. My system has msync() to flush > > > mmap()'ed pages to the underlying file. You would still run fsync() > > > after that. This may give us the best of both worlds: a shared-memory > > ^^^^^^^^^^^^^ > > > area of variable size, and control of when it get flushed to disk. Do > > ^^^^^^^^^^^^^^^^^^^^^ > > I like it. FreeBSD supports > > > > MAP_ANON Map anonymous memory not associated with any specific file. > > > > It would be nice to use mmap to get more "shared" memory, but I don't see > > reasons to mmap any particular file to memory. Having two last pg_log pages > > in memory + xact commit/abort writeback optimization (updation of commit/abort > > xmin/xmax status in tuples by any scan - we already have this) reduce access > > to "old" pg_log pages to zero. > > I totally agree. There is no advantage to mmap() vs. shared memory for > us. I thought if we could control when the mmap() gets flushed to disk, > we could let the OS handle the syncing, but I doubt this is going to be > portable. > > Though, we could mmap() pg_log, and that way backends would not have to > read/write the blocks, and they could all see the same data. But with > the new scheme, they have most transaction ids in shared memory. > > Interesting you mention the scan updating the transaction status. We > would have a problem here. It is possible a backend will update the > commit status of a data page, and that data page will make it to disk, > but if there is a crash before the update pg_log gets sync'ed, there > would be a partial transaction in the system. > > I don't know any way that a backend would know the transaction has hit > disk, and the data commit flag could be set. You don't want to update > the commit flag of the data page until entire transaction has been > sync'ed. The only way to do that would be to have a 'commit and synced' > flag, but you want to save that for nested transactions. > > Another case this could come in handy is to allow reuse of superceeded > data rows. If the transaction is committed and synced, the row space > could be reused by another transaction. > I have been thinking about the mmap() issue, and it seems a natural for pg_log. You can have every backend mmap() pg_log. It becomes a dynamic shared memory area that is auto-initialized to the contents of pg_log, and all changes can be made by all backends. No locking needed. We can also flush the changes to the underlying file. Under bsdi, you can also have the mmap area follow you across exec() calls, so each backend doesn't have to do anything. I want to replace exec with fork also, so the stuff would be auto-loaded in the address space of each backend. This way, you don't have to have two on-line pages and move them around as pg_log grows. The only problem remains how to mark certain transactions as synced or force only synced transactions to hit the pg_log file itself, and data row commit status only should be updated for synced transactions. -- Bruce Momjian maillist@candle.pha.pa.us From vadim@sable.krasnoyarsk.su Sun Nov 9 23:00:58 1997 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA05394 for ; Sun, 9 Nov 1997 23:00:55 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA25139 for ; Sun, 9 Nov 1997 22:42:33 -0500 (EST) Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id KAA01845; Mon, 10 Nov 1997 10:49:25 +0700 (KRS) Sender: root@www.krasnet.ru Message-ID: <34668444.237C228A@sable.krasnoyarsk.su> Date: Mon, 10 Nov 1997 10:49:24 +0700 From: "Vadim B. Mikheev" Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian CC: marc@fallon.classyad.com, hackers@postgreSQL.org Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! References: <199711072020.PAA20010@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > > > Anything that promises that much speedup, if it can be done easy, I say > > > lets consider it, even if you loose 60 seconds of changes. > > > > I agreed with your un-buffered logging idea. This would be excellent > > feature for un-critical dbase usings (WWW, etc). > > Actually, it is buffered logging. We currently have unbuffered logging, > I think. Sorry - mistyping. > > Interesting you mention the scan updating the transaction status. We > would have a problem here. It is possible a backend will update the > commit status of a data page, and that data page will make it to disk, > but if there is a crash before the update pg_log gets sync'ed, there > would be a partial transaction in the system. You're right! Currently, only system relations can be affected by this: backend releases locks on user tables after syncing data and pg_log. I'll keep this in mind... > > > > These two last pg_log pages are "online" ones. Race condition: when one or > > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded > > > > when writing commit/abort of "big" xid. This is how we could handle this > > > > in "buffered" logging (delayed fsync) mode: > > > > > > > > When backend want to write commit/abort status he acquires exclusive > > > > OnLineLogLock. If xid belongs to online pages then backend writes status > > This confuses me. Why does a backend need to lock pg_log to update a > transaction status? What if two backends try to change xact statuses in the same byte ? Vadim From owner-pgsql-hackers@hub.org Sun Nov 9 23:59:50 1997 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA06523 for ; Sun, 9 Nov 1997 23:59:48 -0500 (EST) Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA27105 for ; Sun, 9 Nov 1997 23:41:39 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA08860; Sun, 9 Nov 1997 23:35:42 -0500 (EST) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 23:31:50 -0500 (EST) Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA07962 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 23:31:43 -0500 (EST) Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA07875 for ; Sun, 9 Nov 1997 23:31:28 -0500 (EST) Received: (from maillist@localhost) by candle.pha.pa.us (8.8.5/8.8.5) id XAA05566; Sun, 9 Nov 1997 23:17:41 -0500 (EST) From: Bruce Momjian Message-Id: <199711100417.XAA05566@candle.pha.pa.us> Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Sun, 9 Nov 1997 23:17:41 -0500 (EST) Cc: marc@fallon.classyad.com, hackers@postgreSQL.org In-Reply-To: <34668444.237C228A@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 10, 97 10:49:24 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > > > > These two last pg_log pages are "online" ones. Race condition: when one or > > > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded > > > > > when writing commit/abort of "big" xid. This is how we could handle this > > > > > in "buffered" logging (delayed fsync) mode: > > > > > > > > > > When backend want to write commit/abort status he acquires exclusive > > > > > OnLineLogLock. If xid belongs to online pages then backend writes status > > > > This confuses me. Why does a backend need to lock pg_log to update a > > transaction status? > > What if two backends try to change xact statuses in the same byte ? Ooo, you got me. I so hoped to prevent locking. It would be nice if: *x |= 3; would be atomic, but I don't think it is. Most RISC machines don't even have an OR against a memory address, I think. -- Bruce Momjian maillist@candle.pha.pa.us