postgresql/doc/TODO.detail/vacuum

From Inoue@tpf.co.jp Tue Jan 18 19:08:30 2000
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA10148
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 20:08:27 -0500 (EST)
Received: from cadzone ([126.0.1.40] (may be forged))
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id KAA02790; Wed, 19 Jan 2000 10:08:02 +0900
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
Subject: RE: [HACKERS] Index recreation in vacuum
Date: Wed, 19 Jan 2000 10:13:40 +0900
Message-ID: <000201bf621a$6b9baf20$2801007e@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Importance: Normal
In-Reply-To: <200001181821.NAA02988@candle.pha.pa.us>
Status: ORr

> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
>
> [Charset iso-8859-1 unsupported, filtering to ASCII...]
> > Hi all,
> >
> > I'm trying to implement REINDEX command.
> >
> > REINDEX operation itself is available everywhere and
> > I've thought about applying it to VACUUM.
>
> That is a good idea.  Vacuuming of indexes can be very slow.
>
> > .
> > My plan is as follows.
> >
> > Add a new option to force index recreation in vacuum
> > and if index recreation is specified.
>
> Couldn't we auto-recreate indexes based on the number of tuples moved by
> vacuum,

Yes,we could probably do it. But I'm not sure the availability of new
vacuum.

New vacuum would give us a big advantage that
1) Much faster than current if vacuum remove/moves many tuples.
2) Does shrink index files

But in case of abort/crash
1) couldn't choose index scan for the table
2) unique constraints of the table would be lost

I don't know how people estimate this disadvantage.

>
> > Now I'm inclined to use relhasindex of pg_class to
> > validate/invalidate indexes of a table at once.
>
> There are a few calls to CatalogIndexInsert() that know the
> system table they
> are using and know it has indexes, so it does not check that field.  You
> could add cases for that.
>

I think there aren't so many places to check.
I would examine it if my idea is OK.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

From owner-pgsql-hackers@hub.org Tue Jan 18 19:15:27 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA10454
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 20:15:26 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id UAA42280;
	Tue, 18 Jan 2000 20:10:35 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 20:10:30 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id UAA42081
	for pgsql-hackers-outgoing; Tue, 18 Jan 2000 20:09:31 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by hub.org (8.9.3/8.9.3) with ESMTP id UAA41943
	for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 20:08:39 -0500 (EST)
	(envelope-from Inoue@tpf.co.jp)
Received: from cadzone ([126.0.1.40] (may be forged))
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id KAA02790; Wed, 19 Jan 2000 10:08:02 +0900
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
Subject: RE: [HACKERS] Index recreation in vacuum
Date: Wed, 19 Jan 2000 10:13:40 +0900
Message-ID: <000201bf621a$6b9baf20$2801007e@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Importance: Normal
In-Reply-To: <200001181821.NAA02988@candle.pha.pa.us>
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
>
> [Charset iso-8859-1 unsupported, filtering to ASCII...]
> > Hi all,
> >
> > I'm trying to implement REINDEX command.
> >
> > REINDEX operation itself is available everywhere and
> > I've thought about applying it to VACUUM.
>
> That is a good idea.  Vacuuming of indexes can be very slow.
>
> > .
> > My plan is as follows.
> >
> > Add a new option to force index recreation in vacuum
> > and if index recreation is specified.
>
> Couldn't we auto-recreate indexes based on the number of tuples moved by
> vacuum,

Yes,we could probably do it. But I'm not sure the availability of new
vacuum.

New vacuum would give us a big advantage that
1) Much faster than current if vacuum remove/moves many tuples.
2) Does shrink index files

But in case of abort/crash
1) couldn't choose index scan for the table
2) unique constraints of the table would be lost

I don't know how people estimate this disadvantage.

>
> > Now I'm inclined to use relhasindex of pg_class to
> > validate/invalidate indexes of a table at once.
>
> There are a few calls to CatalogIndexInsert() that know the
> system table they
> are using and know it has indexes, so it does not check that field.  You
> could add cases for that.
>

I think there aren't so many places to check.
I would examine it if my idea is OK.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

************

From owner-pgsql-hackers@hub.org Tue Jan 18 19:57:21 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11764
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 20:57:19 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id UAA50653;
	Tue, 18 Jan 2000 20:52:38 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 20:52:30 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id UAA50513
	for pgsql-hackers-outgoing; Tue, 18 Jan 2000 20:51:32 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67])
	by hub.org (8.9.3/8.9.3) with ESMTP id UAA50462
	for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 20:51:06 -0500 (EST)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id UAA11421;
	Tue, 18 Jan 2000 20:50:50 -0500 (EST)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-Id: <200001190150.UAA11421@candle.pha.pa.us>
Subject: Re: [HACKERS] Index recreation in vacuum
In-Reply-To: <000201bf621a$6b9baf20$2801007e@tpf.co.jp> from Hiroshi Inoue at
	"Jan 19, 2000 10:13:40 am"
To: Hiroshi Inoue <Inoue@tpf.co.jp>
Date: Tue, 18 Jan 2000 20:50:50 -0500 (EST)
CC: pgsql-hackers <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: ORr

> > > Add a new option to force index recreation in vacuum
> > > and if index recreation is specified.
> >
> > Couldn't we auto-recreate indexes based on the number of tuples moved by
> > vacuum,
>
> Yes,we could probably do it. But I'm not sure the availability of new
> vacuum.
>
> New vacuum would give us a big advantage that
> 1) Much faster than current if vacuum remove/moves many tuples.
> 2) Does shrink index files
>
> But in case of abort/crash
> 1) couldn't choose index scan for the table
> 2) unique constraints of the table would be lost
>
> I don't know how people estimate this disadvantage.

That's why I was recommending rename().  The actual window of
vunerability goes from perhaps hours to fractions of a second.

In fact, if I understand this right, you could make the vulerability
zero by just performing the rename as one operation.

In fact, for REINDEX cases where you don't have a lock on the entire
table as you do in vacuum, you could reindex the table with a simple
read-lock on the base table and index, and move the new index into place
with the users seeing no change.  Only people traversing the index
during the change would have a problem.  You just need an exclusive
access on the index for the duration of the rename() so no one is
traversing the index during the rename().

Destroying the index and recreating opens a large time span that there
is no index, and you have to jury-rig something so people don't try to
use the index.  With rename() you just put the new index in place with
one operation.  Just don't let people traverse the index during the
change.  The pointers to the heap tuples is the same in both indexes.

In fact, with WAL, we will allow multiple physical files for the same
table by appending the table oid to the file name.  In this case, the
old index could be deleted by rename, and people would continue to use
the old index until they deleted the open file pointers.  Not sure how
this works in practice because new tuples would not be inserted into the
old copy of the index.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

************

From pgman Tue Jan 18 20:04:11 2000
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id VAA11990;
	Tue, 18 Jan 2000 21:04:11 -0500 (EST)
From: Bruce Momjian <pgman>
Message-Id: <200001190204.VAA11990@candle.pha.pa.us>
Subject: Re: [HACKERS] Index recreation in vacuum
In-Reply-To: <200001190150.UAA11421@candle.pha.pa.us> from Bruce Momjian at "Jan
	18, 2000 08:50:50 pm"
To: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Tue, 18 Jan 2000 21:04:11 -0500 (EST)
CC: Hiroshi Inoue <Inoue@tpf.co.jp>,
        pgsql-hackers <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: OR

> > I don't know how people estimate this disadvantage.
>
> That's why I was recommending rename().  The actual window of
> vunerability goes from perhaps hours to fractions of a second.
>
> In fact, if I understand this right, you could make the vulerability
> zero by just performing the rename as one operation.
>
> In fact, for REINDEX cases where you don't have a lock on the entire
> table as you do in vacuum, you could reindex the table with a simple
> read-lock on the base table and index, and move the new index into place
> with the users seeing no change.  Only people traversing the index
> during the change would have a problem.  You just need an exclusive
> access on the index for the duration of the rename() so no one is
> traversing the index during the rename().
>
> Destroying the index and recreating opens a large time span that there
> is no index, and you have to jury-rig something so people don't try to
> use the index.  With rename() you just put the new index in place with
> one operation.  Just don't let people traverse the index during the
> change.  The pointers to the heap tuples is the same in both indexes.
>
> In fact, with WAL, we will allow multiple physical files for the same
> table by appending the table oid to the file name.  In this case, the
> old index could be deleted by rename, and people would continue to use
> the old index until they deleted the open file pointers.  Not sure how
> this works in practice because new tuples would not be inserted into the
> old copy of the index.

Maybe I am all wrong here.  Maybe most of the advantage of rename() are
meaningless with reindex using during vacuum, which is the most
important use of reindex.

Let's look at index using during vacuum.  Right now, how does vacuum
handle indexes when it moves a tuple?  Does it do each index update as
it moves a tuple?  Is that why it is so slow?

If we don't do that and vacuum fails, what state is the table left in?
If we don't update the index for every tuple, the index is invalid in a
vacuum failure.  rename() is not going to help us here.  It keeps the
old index around, but the index is invalid anyway, right?


--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

From Inoue@tpf.co.jp Tue Jan 18 20:18:48 2000
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA12437
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 21:18:46 -0500 (EST)
Received: from cadzone ([126.0.1.40] (may be forged))
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id LAA02845; Wed, 19 Jan 2000 11:18:18 +0900
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
Subject: RE: [HACKERS] Index recreation in vacuum
Date: Wed, 19 Jan 2000 11:23:55 +0900
Message-ID: <000801bf6224$3bfdd9a0$2801007e@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Importance: Normal
In-Reply-To: <200001190204.VAA11990@candle.pha.pa.us>
Status: ORr

> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
>
> > > I don't know how people estimate this disadvantage.
> >
> > That's why I was recommending rename().  The actual window of
> > vunerability goes from perhaps hours to fractions of a second.
> >
> > In fact, if I understand this right, you could make the vulerability
> > zero by just performing the rename as one operation.
> >
> > In fact, for REINDEX cases where you don't have a lock on the entire
> > table as you do in vacuum, you could reindex the table with a simple
> > read-lock on the base table and index, and move the new index into place
> > with the users seeing no change.  Only people traversing the index
> > during the change would have a problem.  You just need an exclusive
> > access on the index for the duration of the rename() so no one is
> > traversing the index during the rename().
> >
> > Destroying the index and recreating opens a large time span that there
> > is no index, and you have to jury-rig something so people don't try to
> > use the index.  With rename() you just put the new index in place with
> > one operation.  Just don't let people traverse the index during the
> > change.  The pointers to the heap tuples is the same in both indexes.
> >
> > In fact, with WAL, we will allow multiple physical files for the same
> > table by appending the table oid to the file name.  In this case, the
> > old index could be deleted by rename, and people would continue to use
> > the old index until they deleted the open file pointers.  Not sure how
> > this works in practice because new tuples would not be inserted into the
> > old copy of the index.
>
> Maybe I am all wrong here.  Maybe most of the advantage of rename() are
> meaningless with reindex using during vacuum, which is the most
> important use of reindex.
>
> Let's look at index using during vacuum.  Right now, how does vacuum
> handle indexes when it moves a tuple?  Does it do each index update as
> it moves a tuple?  Is that why it is so slow?
>

Yes,I believe so.  It's necessary to keep consistency between heap
table and indexes even in case of abort/crash.
As far as I see,it has been a big charge for vacuum.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


From owner-pgsql-hackers@hub.org Tue Jan 18 20:53:49 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA13285
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 21:53:47 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id VAA65183;
	Tue, 18 Jan 2000 21:47:47 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 21:47:33 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id VAA65091
	for pgsql-hackers-outgoing; Tue, 18 Jan 2000 21:46:33 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67])
	by hub.org (8.9.3/8.9.3) with ESMTP id VAA65034
	for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 21:46:12 -0500 (EST)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id VAA13040;
	Tue, 18 Jan 2000 21:45:27 -0500 (EST)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-Id: <200001190245.VAA13040@candle.pha.pa.us>
Subject: Re: [HACKERS] Index recreation in vacuum
In-Reply-To: <000801bf6224$3bfdd9a0$2801007e@tpf.co.jp> from Hiroshi Inoue at
	"Jan 19, 2000 11:23:55 am"
To: Hiroshi Inoue <Inoue@tpf.co.jp>
Date: Tue, 18 Jan 2000 21:45:27 -0500 (EST)
CC: pgsql-hackers <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

> > > In fact, for REINDEX cases where you don't have a lock on the entire
> > > table as you do in vacuum, you could reindex the table with a simple
> > > read-lock on the base table and index, and move the new index into place
> > > with the users seeing no change.  Only people traversing the index
> > > during the change would have a problem.  You just need an exclusive
> > > access on the index for the duration of the rename() so no one is
> > > traversing the index during the rename().
> > >
> > > Destroying the index and recreating opens a large time span that there
> > > is no index, and you have to jury-rig something so people don't try to
> > > use the index.  With rename() you just put the new index in place with
> > > one operation.  Just don't let people traverse the index during the
> > > change.  The pointers to the heap tuples is the same in both indexes.
> > >
> > > In fact, with WAL, we will allow multiple physical files for the same
> > > table by appending the table oid to the file name.  In this case, the
> > > old index could be deleted by rename, and people would continue to use
> > > the old index until they deleted the open file pointers.  Not sure how
> > > this works in practice because new tuples would not be inserted into the
> > > old copy of the index.
> >
> > Maybe I am all wrong here.  Maybe most of the advantage of rename() are
> > meaningless with reindex using during vacuum, which is the most
> > important use of reindex.
> >
> > Let's look at index using during vacuum.  Right now, how does vacuum
> > handle indexes when it moves a tuple?  Does it do each index update as
> > it moves a tuple?  Is that why it is so slow?
> >
>
> Yes,I believe so.  It's necessary to keep consistency between heap
> table and indexes even in case of abort/crash.
> As far as I see,it has been a big charge for vacuum.

OK, how about making a copy of the heap table before starting vacuum,
moving all the tuples in that copy, create new index, and then move the
new heap and indexes over the old version.  We already have an exclusive
lock on the table.  That would be 100% reliable, with the disadvantage
of using 2x the disk space.  Seems like a big win.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

************

From owner-pgsql-hackers@hub.org Tue Jan 18 21:15:24 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA14115
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 22:15:23 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id WAA72950;
	Tue, 18 Jan 2000 22:10:40 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 22:10:32 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id WAA72644
	for pgsql-hackers-outgoing; Tue, 18 Jan 2000 22:09:36 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67])
	by hub.org (8.9.3/8.9.3) with ESMTP id WAA72504
	for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 22:08:40 -0500 (EST)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id WAA13965;
	Tue, 18 Jan 2000 22:08:25 -0500 (EST)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-Id: <200001190308.WAA13965@candle.pha.pa.us>
Subject: Re: [HACKERS] Index recreation in vacuum
In-Reply-To: <000f01bf622a$bf423940$2801007e@tpf.co.jp> from Hiroshi Inoue at
	"Jan 19, 2000 12:10:32 pm"
To: Hiroshi Inoue <Inoue@tpf.co.jp>
Date: Tue, 18 Jan 2000 22:08:25 -0500 (EST)
CC: pgsql-hackers <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=UNKNOWN-8BIT
Content-Transfer-Encoding: 8bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

> I heard from someone that old vacuum had been like so.
> Probably 2x disk space for big tables was a big disadvantage.

That's interesting.

>
> In addition,rename(),unlink(),mv aren't preferable for transaction
> control as far as I see. We couldn't avoid inconsistency using
> those OS functions.

I disagree.  Vacuum can't be rolled back anyway in the sense you can
bring back expire tuples, though I have no idea why you would want to.

You have an exclusive lock on the table.  Putting new heap/indexes in
place that match and have no expired tuples seems like it can not fail
in any situation.

Of course, the buffers of the old table have to be marked as invalid,
but with an exclusive lock, that is not a problem.  I am sure we do that
anyway<EFBFBD>in vacuum.

> We have to wait the change of relation file naming if copying
> vacuum is needed.
> Under the spec we need not rename(),mv etc.

Sorry, I don't agree, yet...

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

************

From Inoue@tpf.co.jp Tue Jan 18 21:05:23 2000
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA13858
	for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 22:05:21 -0500 (EST)
Received: from cadzone ([126.0.1.40] (may be forged))
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id MAA02870; Wed, 19 Jan 2000 12:04:55 +0900
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
Subject: RE: [HACKERS] Index recreation in vacuum
Date: Wed, 19 Jan 2000 12:10:32 +0900
Message-ID: <000f01bf622a$bf423940$2801007e@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Importance: Normal
In-Reply-To: <200001190245.VAA13040@candle.pha.pa.us>
Status: ORr

> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
> > >
> > > Maybe I am all wrong here.  Maybe most of the advantage of
> rename() are
> > > meaningless with reindex using during vacuum, which is the most
> > > important use of reindex.
> > >
> > > Let's look at index using during vacuum.  Right now, how does vacuum
> > > handle indexes when it moves a tuple?  Does it do each index update as
> > > it moves a tuple?  Is that why it is so slow?
> > >
> >
> > Yes,I believe so.  It's necessary to keep consistency between heap
> > table and indexes even in case of abort/crash.
> > As far as I see,it has been a big charge for vacuum.
>
> OK, how about making a copy of the heap table before starting vacuum,
> moving all the tuples in that copy, create new index, and then move the
> new heap and indexes over the old version.  We already have an exclusive
> lock on the table.  That would be 100% reliable, with the disadvantage
> of using 2x the disk space.  Seems like a big win.
>

I heard from someone that old vacuum had been like so.
Probably 2x disk space for big tables was a big disadvantage.

In addition,rename(),unlink(),mv aren't preferable for transaction
control as far as I see. We couldn't avoid inconsistency using
those OS functions.
We have to wait the change of relation file naming if copying
vacuum is needed.
Under the spec we need not rename(),mv etc.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


From dms@wplus.net Wed Jan 19 15:30:40 2000
Received: from relay.wplus.net (relay.wplus.net [195.131.52.179])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA25919
	for <pgman@candle.pha.pa.us>; Wed, 19 Jan 2000 16:30:38 -0500 (EST)
X-Real-To: pgman@candle.pha.pa.us
Received: from wplus.net (ppdms.dialup.wplus.net [195.131.52.71])
	by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id AAA64218;
	Thu, 20 Jan 2000 00:26:37 +0300 (MSK)
Message-ID: <38862C9D.C2151E4E@wplus.net>
Date: Thu, 20 Jan 2000 00:29:01 +0300
From: Dmitry Samersoff <dms@wplus.net>
X-Mailer: Mozilla 4.61 [en] (WinNT; I)
X-Accept-Language: ru,en
MIME-Version: 1.0
To: Hiroshi Inoue <Inoue@tpf.co.jp>
CC: Bruce Momjian <pgman@candle.pha.pa.us>,
        pgsql-hackers <pgsql-hackers@postgreSQL.org>
Subject: Re: [HACKERS] Index recreation in vacuum
References: <000f01bf622a$bf423940$2801007e@tpf.co.jp>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 7bit
Status: ORr

Hiroshi Inoue wrote:
> > > Yes,I believe so.  It's necessary to keep consistency between heap
> > > table and indexes even in case of abort/crash.
> > > As far as I see,it has been a big charge for vacuum.
> >
> > OK, how about making a copy of the heap table before starting vacuum,
> > moving all the tuples in that copy, create new index, and then move the
> > new heap and indexes over the old version.  We already have an exclusive
> > lock on the table.  That would be 100% reliable, with the disadvantage
> > of using 2x the disk space.  Seems like a big win.
> >
>
> I heard from someone that old vacuum had been like so.
> Probably 2x disk space for big tables was a big disadvantage.

Yes, It is critical.

How about sequence like this:

* Drop indices (keeping somewhere index descriptions)
* vacuuming table
* recreate indices

If something crash, user have been noticed
to re-run vacuum or recreate indices by hand
when system restarts.

I use script like described above for vacuuming
 - it really increase vacuum performance for large table.


--
Dmitry Samersoff, DM\S
dms@wplus.net http://devnull.wplus.net
* there will come soft rains

From dms@wplus.net Wed Jan 19 15:42:49 2000
Received: from relay.wplus.net (relay.wplus.net [195.131.52.179])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA26645
	for <pgman@candle.pha.pa.us>; Wed, 19 Jan 2000 16:42:47 -0500 (EST)
X-Real-To: pgman@candle.pha.pa.us
Received: from wplus.net (ppdms.dialup.wplus.net [195.131.52.71])
	by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id AAA65264;
	Thu, 20 Jan 2000 00:39:02 +0300 (MSK)
Message-ID: <38862F86.20328BD3@wplus.net>
Date: Thu, 20 Jan 2000 00:41:26 +0300
From: Dmitry Samersoff <dms@wplus.net>
X-Mailer: Mozilla 4.61 [en] (WinNT; I)
X-Accept-Language: ru,en
MIME-Version: 1.0
To: Bruce Momjian <pgman@candle.pha.pa.us>
CC: Hiroshi Inoue <Inoue@tpf.co.jp>,
        pgsql-hackers <pgsql-hackers@postgreSQL.org>
Subject: Re: [HACKERS] Index recreation in vacuum
References: <200001192132.QAA26048@candle.pha.pa.us>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 7bit
Status: ORr

Bruce Momjian wrote:
>
> We need two things:
>

>         auto-create index on startup

IMHO, It have to be controlled by user, because creating large index
can take a number of hours. Sometimes it's better to live without
indices
at all, and then build it by hand after workday end.


--
Dmitry Samersoff, DM\S
dms@wplus.net http://devnull.wplus.net
* there will come soft rains

From owner-pgsql-hackers@hub.org Thu Jan 20 23:51:34 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13891
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 00:51:31 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id AAA91784;
	Fri, 21 Jan 2000 00:47:07 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 00:45:38 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id AAA91495
	for pgsql-hackers-outgoing; Fri, 21 Jan 2000 00:44:40 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67])
	by hub.org (8.9.3/8.9.3) with ESMTP id AAA91378
	for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 00:44:04 -0500 (EST)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id AAA13592;
	Fri, 21 Jan 2000 00:43:49 -0500 (EST)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-Id: <200001210543.AAA13592@candle.pha.pa.us>
Subject: [HACKERS] vacuum timings
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 21 Jan 2000 00:43:49 -0500 (EST)
CC: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
400MB and index is 160MB.

With index on the single in4 column, I got:
	 78 seconds for a vacuum
	121 seconds for vacuum after deleting a single row
	662 seconds for vacuum after deleting the entire table

With no index, I got:
	 43 seconds for a vacuum
	 43 seconds for vacuum after deleting a single row
	 43 seconds for vacuum after deleting the entire table

I find this quite interesting.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

************

From owner-pgsql-hackers@hub.org Fri Jan 21 00:34:56 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15559
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 01:34:55 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id BAA06108;
	Fri, 21 Jan 2000 01:32:23 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 01:30:38 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id BAA03704
	for pgsql-hackers-outgoing; Fri, 21 Jan 2000 01:27:53 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37])
	by hub.org (8.9.3/8.9.3) with ESMTP id BAA01710
	for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 01:26:44 -0500 (EST)
	(envelope-from vadim@krs.ru)
Received: from krs.ru (dune.krs.ru [195.161.16.38])
	by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id NAA01685;
	Fri, 21 Jan 2000 13:26:33 +0700 (KRS)
Message-ID: <3887FC19.80305217@krs.ru>
Date: Fri, 21 Jan 2000 13:26:33 +0700
From: Vadim Mikheev <vadim@krs.ru>
Organization: OJSC Rostelecom (Krasnoyarsk)
X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
X-Accept-Language: ru, en
MIME-Version: 1.0
To: Bruce Momjian <pgman@candle.pha.pa.us>
CC: Tom Lane <tgl@sss.pgh.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgreSQL.org>
Subject: Re: [HACKERS] vacuum timings
References: <200001210543.AAA13592@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

Bruce Momjian wrote:
>
> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
> 400MB and index is 160MB.
>
> With index on the single in4 column, I got:
>          78 seconds for a vacuum
>         121 seconds for vacuum after deleting a single row
>         662 seconds for vacuum after deleting the entire table
>
> With no index, I got:
>          43 seconds for a vacuum
>          43 seconds for vacuum after deleting a single row
>          43 seconds for vacuum after deleting the entire table

Wi/wo -F ?

Vadim

************

From vadim@krs.ru Fri Jan 21 00:26:33 2000
Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15239
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 01:26:31 -0500 (EST)
Received: from krs.ru (dune.krs.ru [195.161.16.38])
	by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id NAA01685;
	Fri, 21 Jan 2000 13:26:33 +0700 (KRS)
Sender: root@sunpine.krs.ru
Message-ID: <3887FC19.80305217@krs.ru>
Date: Fri, 21 Jan 2000 13:26:33 +0700
From: Vadim Mikheev <vadim@krs.ru>
Organization: OJSC Rostelecom (Krasnoyarsk)
X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
X-Accept-Language: ru, en
MIME-Version: 1.0
To: Bruce Momjian <pgman@candle.pha.pa.us>
CC: Tom Lane <tgl@sss.pgh.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgreSQL.org>
Subject: Re: [HACKERS] vacuum timings
References: <200001210543.AAA13592@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: ORr

Bruce Momjian wrote:
>
> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
> 400MB and index is 160MB.
>
> With index on the single in4 column, I got:
>          78 seconds for a vacuum
>         121 seconds for vacuum after deleting a single row
>         662 seconds for vacuum after deleting the entire table
>
> With no index, I got:
>          43 seconds for a vacuum
>          43 seconds for vacuum after deleting a single row
>          43 seconds for vacuum after deleting the entire table

Wi/wo -F ?

Vadim

From Inoue@tpf.co.jp Fri Jan 21 00:40:35 2000
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15684
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 01:40:33 -0500 (EST)
Received: from cadzone ([126.0.1.40] (may be forged))
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id PAA04316; Fri, 21 Jan 2000 15:40:35 +0900
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "PostgreSQL-development" <pgsql-hackers@postgreSQL.org>,
        "Tom Lane" <tgl@sss.pgh.pa.us>
Subject: RE: [HACKERS] vacuum timings
Date: Fri, 21 Jan 2000 15:46:15 +0900
Message-ID: <000201bf63db$36cdae20$2801007e@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
Importance: Normal
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
In-Reply-To: <200001210543.AAA13592@candle.pha.pa.us>
Status: OR

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian
>
> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
> 400MB and index is 160MB.
>
> With index on the single in4 column, I got:
> 	 78 seconds for a vacuum
		vc_vaconeind() is called once

> 	121 seconds for vacuum after deleting a single row
		vc_vaconeind() is called twice

Hmmm,vc_vaconeind() takes pretty long time even if it does little.

> 	662 seconds for vacuum after deleting the entire table
>

How about half of the rows deleted case ?
It would take longer time.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

From owner-pgsql-hackers@hub.org Fri Jan 21 12:00:49 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13329
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 13:00:47 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id MAA96106;
	Fri, 21 Jan 2000 12:55:34 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 12:53:53 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id MAA95775
	for pgsql-hackers-outgoing; Fri, 21 Jan 2000 12:52:54 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from candle.pha.pa.us (root@s5-03.ppp.op.net [209.152.195.67])
	by hub.org (8.9.3/8.9.3) with ESMTP id MAA95720
	for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 12:52:39 -0500 (EST)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id MAA12106;
	Fri, 21 Jan 2000 12:51:53 -0500 (EST)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-Id: <200001211751.MAA12106@candle.pha.pa.us>
Subject: [HACKERS] Re: vacuum timings
In-Reply-To: <3641.948433911@sss.pgh.pa.us> from Tom Lane at "Jan 21, 2000 00:51:51
	am"
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 21 Jan 2000 12:51:53 -0500 (EST)
CC: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
> > 400MB and index is 160MB.
>
> > With index on the single in4 column, I got:
> > 	 78 seconds for a vacuum
> > 	121 seconds for vacuum after deleting a single row
> > 	662 seconds for vacuum after deleting the entire table
>
> > With no index, I got:
> > 	 43 seconds for a vacuum
> > 	 43 seconds for vacuum after deleting a single row
> > 	 43 seconds for vacuum after deleting the entire table
>
> > I find this quite interesting.
>
> How long does it take to create the index on your setup --- ie,
> if vacuum did a drop/create index, would it be competitive?

OK, new timings with -F enabled:

	index	no index
	519	same	load
	247	"	first vacuum
	40	"	other vacuums

	1222	X	index creation
	90	X	first vacuum
	80	X	other vacuums

	<1	90	delete one row
	121	38	vacuum after delete 1 row

	346	344	delete all rows
	440	44	first vacuum
	20	<1	other vacuums(index is still same size)

Conclusions:

	o  indexes never get smaller
	o  drop/recreate index is slower than vacuum of indexes

What other conclusions can be made?

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

************

From scrappy@hub.org Fri Jan 21 12:45:38 2000
Received: from thelab.hub.org (nat200.60.mpoweredpc.net [142.177.200.60])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA14380
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 13:45:29 -0500 (EST)
Received: from localhost (scrappy@localhost)
	by thelab.hub.org (8.9.3/8.9.1) with ESMTP id OAA68289;
	Fri, 21 Jan 2000 14:45:35 -0400 (AST)
	(envelope-from scrappy@hub.org)
X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs
Date: Fri, 21 Jan 2000 14:45:34 -0400 (AST)
From: The Hermit Hacker <scrappy@hub.org>
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: Tom Lane <tgl@sss.pgh.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Re: vacuum timings
In-Reply-To: <200001211751.MAA12106@candle.pha.pa.us>
Message-ID: <Pine.BSF.4.21.0001211443480.23487-100000@thelab.hub.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR

On Fri, 21 Jan 2000, Bruce Momjian wrote:

> OK, new timings with -F enabled:
>
> 	index	no index
> 	519	same	load
> 	247	"	first vacuum
> 	40	"	other vacuums
>
> 	1222	X	index creation
> 	90	X	first vacuum
> 	80	X	other vacuums
>
> 	<1	90	delete one row
> 	121	38	vacuum after delete 1 row
>
> 	346	344	delete all rows
> 	440	44	first vacuum
> 	20	<1	other vacuums(index is still same size)
>
> Conclusions:
>
> 	o  indexes never get smaller

this one, I thought, was a known?  if I remember right, Vadim changed it
so that space was reused, but index never shrunk in size ... no?

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


From tgl@sss.pgh.pa.us Fri Jan 21 13:06:35 2000
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA14618
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 14:06:33 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id OAA16501;
	Fri, 21 Jan 2000 14:06:31 -0500 (EST)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
Subject: Re: vacuum timings
In-reply-to: <200001211751.MAA12106@candle.pha.pa.us>
References: <200001211751.MAA12106@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 21 Jan 2000 12:51:53 -0500"
Date: Fri, 21 Jan 2000 14:06:31 -0500
Message-ID: <16498.948481591@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Conclusions:
> 	o  indexes never get smaller

Which we knew...

> 	o  drop/recreate index is slower than vacuum of indexes

Quite a few people have reported finding the opposite in practice.
You should probably try vacuuming after deleting or updating some
fraction of the rows, rather than just the all or none cases.

			regards, tom lane

From dms@wplus.net Fri Jan 21 13:51:27 2000
Received: from relay.wplus.net (relay.wplus.net [195.131.52.179])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA15623
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 14:51:24 -0500 (EST)
X-Real-To: pgman@candle.pha.pa.us
Received: from wplus.net (ppdms.dialup.wplus.net [195.131.52.71])
	by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id WAA89451;
	Fri, 21 Jan 2000 22:46:19 +0300 (MSK)
Message-ID: <3888B822.28F79A1F@wplus.net>
Date: Fri, 21 Jan 2000 22:48:50 +0300
From: Dmitry Samersoff <dms@wplus.net>
X-Mailer: Mozilla 4.7 [en] (WinNT; I)
X-Accept-Language: ru,en
MIME-Version: 1.0
To: Tom Lane <tgl@sss.pgh.pa.us>
CC: Bruce Momjian <pgman@candle.pha.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Re: vacuum timings
References: <200001211751.MAA12106@candle.pha.pa.us> <16498.948481591@sss.pgh.pa.us>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 7bit
Status: ORr

Tom Lane wrote:
>
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Conclusions:
> >       o  indexes never get smaller
>
> Which we knew...
>
> >       o  drop/recreate index is slower than vacuum of indexes
>
> Quite a few people have reported finding the opposite in practice.

I'm one of them. On 1,5 GB table with three indices it about twice
slowly.
Probably becouse vacuuming indices brakes system cache policy.
(FreeBSD 3.3)


--
Dmitry Samersoff, DM\S
dms@wplus.net http://devnull.wplus.net
* there will come soft rains

From owner-pgsql-hackers@hub.org Fri Jan 21 14:04:08 2000
Received: from hub.org (hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA16140
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 15:04:06 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id OAA34808;
	Fri, 21 Jan 2000 14:59:30 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 14:57:48 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id OAA34320
	for pgsql-hackers-outgoing; Fri, 21 Jan 2000 14:56:50 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67])
	by hub.org (8.9.3/8.9.3) with ESMTP id OAA34255
	for <pgsql-hackers@postgresql.org>; Fri, 21 Jan 2000 14:56:18 -0500 (EST)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.9.0/8.9.0) id OAA15772;
	Fri, 21 Jan 2000 14:54:22 -0500 (EST)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-Id: <200001211954.OAA15772@candle.pha.pa.us>
Subject: Re: [HACKERS] Re: vacuum timings
In-Reply-To: <3888B822.28F79A1F@wplus.net> from Dmitry Samersoff at "Jan 21,
	2000 10:48:50 pm"
To: Dmitry Samersoff <dms@wplus.net>
Date: Fri, 21 Jan 2000 14:54:21 -0500 (EST)
CC: Tom Lane <tgl@sss.pgh.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgreSQL.org>
X-Mailer: ELM [version 2.4ME+ PL66 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Status: OR

[Charset koi8-r unsupported, filtering to ASCII...]
> Tom Lane wrote:
> >
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Conclusions:
> > >       o  indexes never get smaller
> >
> > Which we knew...
> >
> > >       o  drop/recreate index is slower than vacuum of indexes
> >
> > Quite a few people have reported finding the opposite in practice.
>
> I'm one of them. On 1,5 GB table with three indices it about twice
> slowly.
> Probably becouse vacuuming indices brakes system cache policy.
> (FreeBSD 3.3)

OK, we are researching what things can be done to improve this.  We are
toying with:

	lock table for less duration, or read lock
	creating another copy of heap/indexes, and rename() over old files
	improving heap vacuum speed
	improving index vacuum speed
	moving analyze out of vacuum


--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

************

From scrappy@hub.org Fri Jan 21 14:12:16 2000
Received: from thelab.hub.org (nat200.60.mpoweredpc.net [142.177.200.60])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA16521
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 15:12:13 -0500 (EST)
Received: from localhost (scrappy@localhost)
	by thelab.hub.org (8.9.3/8.9.1) with ESMTP id QAA69039;
	Fri, 21 Jan 2000 16:12:25 -0400 (AST)
	(envelope-from scrappy@hub.org)
X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs
Date: Fri, 21 Jan 2000 16:12:25 -0400 (AST)
From: The Hermit Hacker <scrappy@hub.org>
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: Dmitry Samersoff <dms@wplus.net>, Tom Lane <tgl@sss.pgh.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Re: vacuum timings
In-Reply-To: <200001211954.OAA15772@candle.pha.pa.us>
Message-ID: <Pine.BSF.4.21.0001211607080.23487-100000@thelab.hub.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR

On Fri, 21 Jan 2000, Bruce Momjian wrote:

> [Charset koi8-r unsupported, filtering to ASCII...]
> > Tom Lane wrote:
> > >
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > Conclusions:
> > > >       o  indexes never get smaller
> > >
> > > Which we knew...
> > >
> > > >       o  drop/recreate index is slower than vacuum of indexes
> > >
> > > Quite a few people have reported finding the opposite in practice.
> >
> > I'm one of them. On 1,5 GB table with three indices it about twice
> > slowly.
> > Probably becouse vacuuming indices brakes system cache policy.
> > (FreeBSD 3.3)
>
> OK, we are researching what things can be done to improve this.  We are
> toying with:
>
> 	lock table for less duration, or read lock

if there is some way that we can work around the bug that I believe Tom
found with removing the lock altogether (ie. makig use of MVCC), I think
that would be the best option ... if not possible, at least get things
down to a table lock vs the whole database?

a good example is the udmsearch that we are using on the site ... it uses
multiple tables to store the dictionary, each representing words of X size
... if I'm searching on a 4 letter word, and the whole database is locked
while it is working on the dictionary with 8 letter words, I'm sitting
there idle ... at least if we only locked the 8 letter table, everyone not
doing 8 letter searches can go on their merry way ...

Slightly longer vacuum's, IMHO, are acceptable if, to the end users, its
as transparent as possible ... locking per table would be slightly slower,
I think, because once a table is finished, the next table would need to
have an exclusive lock put on it before starting, so you'd have to
possibly wait for that...?

> 	creating another copy of heap/indexes, and rename() over old files

sounds to me like introducing a large potential for error here ...

> 	moving analyze out of vacuum

I think that should be done anyway ... if we ever get to the point that
we're able to re-use rows in tables, then that would eliminate the
immediate requirement for vacuum, but still retain a requirement for a
periodic analyze ... no?

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


From tgl@sss.pgh.pa.us Fri Jan 21 16:02:07 2000
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA20290
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 17:02:06 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id RAA09697;
	Fri, 21 Jan 2000 17:02:06 -0500 (EST)
To: The Hermit Hacker <scrappy@hub.org>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
        PostgreSQL-development <pgsql-hackers@postgreSQL.org>
Subject: Re: [HACKERS] Re: vacuum timings
In-reply-to: <Pine.BSF.4.21.0001211607080.23487-100000@thelab.hub.org>
References: <Pine.BSF.4.21.0001211607080.23487-100000@thelab.hub.org>
Comments: In-reply-to The Hermit Hacker <scrappy@hub.org>
	message dated "Fri, 21 Jan 2000 16:12:25 -0400"
Date: Fri, 21 Jan 2000 17:02:06 -0500
Message-ID: <9694.948492126@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR

The Hermit Hacker <scrappy@hub.org> writes:
>> lock table for less duration, or read lock

> if there is some way that we can work around the bug that I believe Tom
> found with removing the lock altogether (ie. makig use of MVCC), I think
> that would be the best option ... if not possible, at least get things
> down to a table lock vs the whole database?

Huh?  VACUUM only requires an exclusive lock on the table it is
currently vacuuming; there's no database-wide lock.

Even a single-table exclusive lock is bad, of course, if it's a large
table that's critical to a 24x7 application.  Bruce was talking about
the possibility of having VACUUM get just a write lock on the table;
other backends could still read it, but not write it, during the vacuum
process.  That'd be a considerable step forward for 24x7 applications,
I think.

It looks like that could be done if we rewrote the table as a new file
(instead of compacting-in-place), but there's a problem when it comes
time to rename the new files into place.  At that point you'd need to
get an exclusive lock to ensure all the readers are out of the table too
--- and upgrading from a plain lock to an exclusive lock is a well-known
recipe for deadlocks.  Not sure if this can be solved.

			regards, tom lane

From tgl@sss.pgh.pa.us Fri Jan 21 22:50:34 2000
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id XAA01657
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 23:50:28 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA19681;
	Fri, 21 Jan 2000 23:50:13 -0500 (EST)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
Subject: Re: vacuum timings
In-reply-to: <200001211751.MAA12106@candle.pha.pa.us>
References: <200001211751.MAA12106@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 21 Jan 2000 12:51:53 -0500"
Date: Fri, 21 Jan 2000 23:50:13 -0500
Message-ID: <19678.948516613@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: ORr

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Conclusions:
> 	o  drop/recreate index is slower than vacuum of indexes

BTW, I did some profiling of CREATE INDEX this evening (quite
unintentionally actually; I was interested in COPY IN, but the pg_dump
script I used as driver happened to create some indexes too).  I was
startled to discover that 60% of the runtime of CREATE INDEX is spent in
_bt_invokestrat (which is called from tuplesort.c's comparetup_index,
and exists only to figure out which specific comparison routine to call).
Of this, a whopping 4% was spent in the useful subroutine, int4gt.  All
the rest went into lookup and validation checks that by rights should be
done once per index creation, not once per comparison.

In short: a fairly straightforward bit of optimization will eliminate
circa 50% of the CPU time consumed by CREATE INDEX.  All we need is to
figure out where to cache the lookup results.  The optimization would
improve insertions and lookups in indexes, as well, if we can cache
the lookup results in those scenarios.

This was for a table small enough that tuplesort.c could do the sort
entirely in memory, so I'm sure the gains would be smaller for a large
table that requires a disk-based sort.  Still, it seems worth looking
into...

			regards, tom lane

From owner-pgsql-hackers@hub.org Sat Jan 22 02:31:03 2000
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA06743
	for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:31:02 -0500 (EST)
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.5 $) with ESMTP id DAA07529 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:25:13 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.3/8.9.3) with SMTP id DAA31900;
	Sat, 22 Jan 2000 03:19:53 -0500 (EST)
	(envelope-from owner-pgsql-hackers)
Received: by hub.org (bulk_mailer v1.5); Sat, 22 Jan 2000 03:17:56 -0500
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id DAA31715
	for pgsql-hackers-outgoing; Sat, 22 Jan 2000 03:16:58 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by hub.org (8.9.3/8.9.3) with ESMTP id DAA31647
	for <pgsql-hackers@postgresql.org>; Sat, 22 Jan 2000 03:16:26 -0500 (EST)
	(envelope-from Inoue@tpf.co.jp)
Received: from mcadnote1 (ppm114.noc.fukui.nsk.ne.jp [210.161.188.33])
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
   id RAA04754; Sat, 22 Jan 2000 17:14:43 +0900
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Tom Lane" <tgl@sss.pgh.pa.us>, "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
Subject: RE: [HACKERS] Re: vacuum timings
Date: Sat, 22 Jan 2000 17:15:37 +0900
Message-ID: <NDBBIJLOILGIKBGDINDFIEEACCAA.Inoue@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-2022-jp"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
In-Reply-To: <16498.948481591@sss.pgh.pa.us>
Importance: Normal
Sender: owner-pgsql-hackers@postgresql.org
Status: OR

> -----Original Message-----
> From: owner-pgsql-hackers@postgresql.org
> [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tom Lane
>
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Conclusions:
> > 	o  indexes never get smaller
>
> Which we knew...
>
> > 	o  drop/recreate index is slower than vacuum of indexes
>
> Quite a few people have reported finding the opposite in practice.
> You should probably try vacuuming after deleting or updating some
> fraction of the rows, rather than just the all or none cases.
>

Vacuum after delelting all rows isn't a worst case.
There's no moving in that case and vacuum doesn't need to call
index_insert() corresponding to the moving of heap tuples.

Vacuum after deleting half of rows may be one of the worst case.
In this case,index_delete() is called as many times as 'delete all'
case and expensive index_insert() is called for moved_in tuples.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

************

From tgl@sss.pgh.pa.us Sat Jan 22 10:31:02 2000
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA20882
	for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:31:00 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) by renoir.op.net (o1/$Revision: 1.5 $) with ESMTP id LAA26612 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:12:44 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id LAA20569;
	Sat, 22 Jan 2000 11:11:26 -0500 (EST)
To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
cc: "Bruce Momjian" <pgman@candle.pha.pa.us>,
        "PostgreSQL-development" <pgsql-hackers@postgreSQL.org>
Subject: Re: [HACKERS] Re: vacuum timings
In-reply-to: <NDBBIJLOILGIKBGDINDFIEEACCAA.Inoue@tpf.co.jp>
References: <NDBBIJLOILGIKBGDINDFIEEACCAA.Inoue@tpf.co.jp>
Comments: In-reply-to "Hiroshi Inoue" <Inoue@tpf.co.jp>
	message dated "Sat, 22 Jan 2000 17:15:37 +0900"
Date: Sat, 22 Jan 2000 11:11:25 -0500
Message-ID: <20566.948557485@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> Vacuum after deleting half of rows may be one of the worst case.

Or equivalently, vacuum after updating all the rows.

			regards, tom lane

From tgl@sss.pgh.pa.us Thu Jan 20 23:51:49 2000
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13919
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 00:51:47 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id AAA03644;
	Fri, 21 Jan 2000 00:51:51 -0500 (EST)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
Subject: Re: vacuum timings
In-reply-to: <200001210543.AAA13592@candle.pha.pa.us>
References: <200001210543.AAA13592@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 21 Jan 2000 00:43:49 -0500"
Date: Fri, 21 Jan 2000 00:51:51 -0500
Message-ID: <3641.948433911@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: ORr

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER);  Table is
> 400MB and index is 160MB.

> With index on the single in4 column, I got:
> 	 78 seconds for a vacuum
> 	121 seconds for vacuum after deleting a single row
> 	662 seconds for vacuum after deleting the entire table

> With no index, I got:
> 	 43 seconds for a vacuum
> 	 43 seconds for vacuum after deleting a single row
> 	 43 seconds for vacuum after deleting the entire table

> I find this quite interesting.

How long does it take to create the index on your setup --- ie,
if vacuum did a drop/create index, would it be competitive?

			regards, tom lane

From pgsql-hackers-owner+M5909@hub.org Thu Aug 17 20:15:33 2000
Received: from hub.org (root@hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA00644
	for <pgman@candle.pha.pa.us>; Thu, 17 Aug 2000 20:15:32 -0400 (EDT)
Received: from hub.org (majordom@localhost [127.0.0.1])
	by hub.org (8.10.1/8.10.1) with SMTP id e7I0APm69660;
	Thu, 17 Aug 2000 20:10:25 -0400 (EDT)
Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [209.1.153.20])
	by hub.org (8.10.1/8.10.1) with ESMTP id e7I01Jm68072
	for <pgsql-hackers@postgresql.org>; Thu, 17 Aug 2000 20:01:19 -0400 (EDT)
Received: (from bright@localhost)
	by fw.wintelcom.net (8.10.0/8.10.0) id e7I01IA20820
	for pgsql-hackers@postgresql.org; Thu, 17 Aug 2000 17:01:18 -0700 (PDT)
Date: Thu, 17 Aug 2000 17:01:18 -0700
From: Alfred Perlstein <bright@wintelcom.net>
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] VACUUM optimization ideas.
Message-ID: <20000817170118.K4854@fw.wintelcom.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.4i
X-Mailing-List: pgsql-hackers@postgresql.org
Precedence: bulk
Sender: pgsql-hackers-owner@hub.org
Status: ORr

Here's two ideas I had for optimizing vacuum, I apologize in advance
if the ideas presented here are niave and don't take into account
the actual code that makes up postgresql.

================

#1

Reducing the time vacuum must hold an exlusive lock on a table:

The idea is that since rows are marked deleted it's ok for the
vacuum to fill them with data from the tail of the table as
long as no transaction is in progress that has started before
the row was deleted.

This may allow the vacuum process to copyback all the data without
a lock, when all the copying is done it then aquires an exlusive lock
and does this:

Aquire an exclusive lock.
Walk all the deleted data marking it as current.
Truncate the table.
Release the lock.

Since the data is still marked invalid (right?) even if valid data
is copied into the space it should be ignored as long as there's no
transaction occurring that started before the data was invalidated.

================

#2

Reducing the amount of scanning a vaccum must do:

It would make sense that if a value of the earliest deleted chunk
was kept in a table then vacuum would not have to scan the entire
table in order to work, it would only need to start at the 'earliest'
invalidated row.

The utility of this (at least for us) is that we have several tables
that will grow to hundreds of megabytes, however changes will only
happen at the tail end (recently added rows).  If we could reduce the
amount of time spent in a vacuum state it would help us a lot.

================

I'm wondering if these ideas make sense and may help at all.

thanks,
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]

From pgsql-hackers-owner+M5912@hub.org Fri Aug 18 01:36:14 2000
Received: from hub.org (root@hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA07787
	for <pgman@candle.pha.pa.us>; Fri, 18 Aug 2000 01:36:12 -0400 (EDT)
Received: from hub.org (majordom@localhost [127.0.0.1])
	by hub.org (8.10.1/8.10.1) with SMTP id e7I5Q2m38759;
	Fri, 18 Aug 2000 01:26:04 -0400 (EDT)
Received: from courier02.adinet.com.uy (courier02.adinet.com.uy [206.99.44.245])
	by hub.org (8.10.1/8.10.1) with ESMTP id e7I5Bam35785
	for <pgsql-hackers@postgresql.org>; Fri, 18 Aug 2000 01:11:37 -0400 (EDT)
Received: from adinet.com.uy (haroldo@r207-50-240-116.adinet.com.uy [207.50.240.116])
	by courier02.adinet.com.uy (8.9.3/8.9.3) with ESMTP id CAA17259;
	Fri, 18 Aug 2000 02:10:49 -0300 (GMT)
Message-ID: <399CC739.B9B13D18@adinet.com.uy>
Date: Fri, 18 Aug 2000 02:18:49 -0300
From: hstenger@adinet.com.uy
Reply-To: hstenger@ieee.org
Organization: PRISMA, Servicio y Desarrollo
X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.14 i586)
X-Accept-Language: en
MIME-Version: 1.0
To: Alfred Perlstein <bright@wintelcom.net>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] VACUUM optimization ideas.
References: <20000817170118.K4854@fw.wintelcom.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailing-List: pgsql-hackers@postgresql.org
Precedence: bulk
Sender: pgsql-hackers-owner@hub.org
Status: ORr

Alfred Perlstein wrote:
> #1
>
> Reducing the time vacuum must hold an exlusive lock on a table:
>
> The idea is that since rows are marked deleted it's ok for the
> vacuum to fill them with data from the tail of the table as
> long as no transaction is in progress that has started before
> the row was deleted.
>
> This may allow the vacuum process to copyback all the data without
> a lock, when all the copying is done it then aquires an exlusive lock
> and does this:
>
> Aquire an exclusive lock.
> Walk all the deleted data marking it as current.
> Truncate the table.
> Release the lock.
>
> Since the data is still marked invalid (right?) even if valid data
> is copied into the space it should be ignored as long as there's no
> transaction occurring that started before the data was invalidated.

Yes, but nothing prevents newer transactions from modifying the _origin_ side of
the copied data _after_ it was copied, but before the Lock-Walk-Truncate-Unlock
cycle takes place, and so it seems unsafe. Maybe locking each record before
copying it up ...

Regards,
Haroldo.

--
----------------------+------------------------
 Haroldo Stenger      | hstenger@ieee.org
 Montevideo, Uruguay. | hstenger@adinet.com.uy
----------------------+------------------------
 Visit UYLUG Web Site: http://www.linux.org.uy
-----------------------------------------------

From pgsql-hackers-owner+M5917@hub.org Fri Aug 18 09:41:33 2000
Received: from hub.org (root@hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA05170
	for <pgman@candle.pha.pa.us>; Fri, 18 Aug 2000 09:41:33 -0400 (EDT)
Received: from hub.org (majordom@localhost [127.0.0.1])
	by hub.org (8.10.1/8.10.1) with SMTP id e7IDVjm75143;
	Fri, 18 Aug 2000 09:31:46 -0400 (EDT)
Received: from andie.ip23.net (andie.ip23.net [212.83.32.23])
	by hub.org (8.10.1/8.10.1) with ESMTP id e7IDPIm73296
	for <pgsql-hackers@postgresql.org>; Fri, 18 Aug 2000 09:25:18 -0400 (EDT)
Received: from imap1.ip23.net (imap1.ip23.net [212.83.32.35])
	by andie.ip23.net (8.9.3/8.9.3) with ESMTP id PAA58387;
	Fri, 18 Aug 2000 15:25:12 +0200 (CEST)
Received: from ip23.net (spc.ip23.net [212.83.32.122])
	by imap1.ip23.net (8.9.3/8.9.3) with ESMTP id PAA59177;
	Fri, 18 Aug 2000 15:41:28 +0200 (CEST)
Message-ID: <399D3938.582FDB49@ip23.net>
Date: Fri, 18 Aug 2000 15:25:12 +0200
From: Sevo Stille <sevo@ip23.net>
Organization: IP23
X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.10 i686)
X-Accept-Language: en, de
MIME-Version: 1.0
To: Alfred Perlstein <bright@wintelcom.net>
CC: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] VACUUM optimization ideas.
References: <20000817170118.K4854@fw.wintelcom.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailing-List: pgsql-hackers@postgresql.org
Precedence: bulk
Sender: pgsql-hackers-owner@hub.org
Status: OR

Alfred Perlstein wrote:

> The idea is that since rows are marked deleted it's ok for the
> vacuum to fill them with data from the tail of the table as
> long as no transaction is in progress that has started before
> the row was deleted.

Well, isn't one of the advantages of vacuuming in the reordering it
does? With a "fill deleted chunks" logic, we'd have far less order in
the databases.

> This may allow the vacuum process to copyback all the data without
> a lock,

Nope. Another process might update the values in between move and mark,
if the record is not locked. We'd either have to write-lock the entire
table for that period, write lock every item as it is moved, or lock,
move and mark on a per-record base. The latter would be slow, but it
could be done in a permanent low priority background process, utilizing
empty CPU cycles. Besides, it probably could not only be done simply
filling from the tail, but also moving up the records in a sorted
fashion.

> #2
>
> Reducing the amount of scanning a vaccum must do:
>
> It would make sense that if a value of the earliest deleted chunk
> was kept in a table then vacuum would not have to scan the entire
> table in order to work, it would only need to start at the 'earliest'
> invalidated row.

Trivial to do. But of course #1 may imply that the physical ordering is
even less likely to be related to the logical ordering in a way where
this helps.

> The utility of this (at least for us) is that we have several tables
> that will grow to hundreds of megabytes, however changes will only
> happen at the tail end (recently added rows).

The tail is a relative position - except for the case where you add
temporary records to a constant default set, everything in the tail will
move, at least relatively, to the head after some time.

> If we could reduce the
> amount of time spent in a vacuum state it would help us a lot.

Rather: If we can reduce the time spent in a locked state while
vacuuming, it would help a lot. Being in a vacuum is not the issue -
even permanent vacuuming need not be an issue, if the locks it uses are
suitably  short-time.

Sevo

--
sevo@ip23.net

From pgsql-hackers-owner+M5911@hub.org Thu Aug 17 21:11:20 2000
Received: from hub.org (root@hub.org [216.126.84.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA01882
	for <pgman@candle.pha.pa.us>; Thu, 17 Aug 2000 21:11:20 -0400 (EDT)
Received: from hub.org (majordom@localhost [127.0.0.1])
	by hub.org (8.10.1/8.10.1) with SMTP id e7I119m80626;
	Thu, 17 Aug 2000 21:01:09 -0400 (EDT)
Received: from acheron.rime.com.au (root@albatr.lnk.telstra.net [139.130.54.222])
	by hub.org (8.10.1/8.10.1) with ESMTP id e7I0wMm79870
	for <pgsql-hackers@postgresql.org>; Thu, 17 Aug 2000 20:58:22 -0400 (EDT)
Received: from oberon (Oberon.rime.com.au [203.8.195.100])
	by acheron.rime.com.au (8.9.3/8.9.3) with SMTP id KAA03215;
	Fri, 18 Aug 2000 10:58:25 +1000
Message-Id: <3.0.5.32.20000818105835.0280ade0@mail.rhyme.com.au>
X-Sender: pjw@mail.rhyme.com.au
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Fri, 18 Aug 2000 10:58:35 +1000
To: Chris Bitmead <chrisb@nimrod.itg.telstra.com.au>,
        Ben Adida <ben@openforce.net>
From: Philip Warner <pjw@rhyme.com.au>
Subject: Re: [HACKERS] Inserting a select statement result into another
  table
Cc: Andrew Selle <aselle@upl.cs.wisc.edu>, pgsql-hackers@postgresql.org
In-Reply-To: <399C7689.2DDDAD1D@nimrod.itg.telecom.com.au>
References: <20000817130517.A10909@upl.cs.wisc.edu>
	<399BF555.43FB70C8@openforce.net>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Mailing-List: pgsql-hackers@postgresql.org
Precedence: bulk
Sender: pgsql-hackers-owner@hub.org
Status: O

At 09:34 18/08/00 +1000, Chris Bitmead wrote:
>
>He does ask a legitimate question though. If you are going to have a
>LIMIT feature (which of course is not pure SQL), there seems no reason
>you shouldn't be able to insert the result into a table.

This feature is supported by two commercial DBs: Dec/RDB and SQL/Server. I
have no idea if Oracle supports it, but it is such a *useful* feature that
I would be very surprised if it didn't.


>Ben Adida wrote:
>>
>> What is the purpose you're trying to accomplish with this order by? No
matter what, all the
>> rows where done='f' will be inserted, and you will not be left with any
indication of that
>> order once the rows are in the todolist table.

I don't know what his *purpose* was, but the query should only insert the
first two rows from the select bacause of the limit).

>> Andrew Selle wrote:
>>
>> > Alright.  My situation is this.  I have a list of things that need to
be done
>> > in a table called tasks.  I have a list of users who will complete
these tasks.
>> > I want these users to be able to come in and "claim" the top 2 most
recent tasks
>> > that have been added.  These tasks then get stored in a table called
todolist
>> > which stores who claimed the task, the taskid, and when the task was
claimed.
>> > For each time someone wants to claim some number of tasks, I want to
do something
>> > like
>> >
>> > INSERT INTO todolist
>> >         SELECT taskid,'1',now()
>> >         FROM tasks
>> >         WHERE done='f'
>> >         ORDER BY submit DESC
>> >         LIMIT 2;

----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

From pgsql-hackers-owner+M8931@postgresql.org Thu May 17 19:14:23 2001
Return-path: <pgsql-hackers-owner+M8931@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4HNEMd04329
	for <pgman@candle.pha.pa.us>; Thu, 17 May 2001 19:14:22 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4HNBbA24259;
	Thu, 17 May 2001 19:11:37 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M8931@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4HN5SA22678
	for <pgsql-hackers@postgreSQL.org>; Thu, 17 May 2001 19:05:28 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4HN5OR12836
	for <pgsql-hackers@postgreSQL.org>; Thu, 17 May 2001 19:05:24 -0400 (EDT)
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 17 May 2001 19:05:24 -0400
Message-ID: <12833.990140724@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr

I have been thinking about the problem of VACUUM and how we might fix it
for 7.2.  Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach: it seems to me that it's an awfully large
change in the way Postgres works.  Instead, here is a sketch of an attack
that I think fits better into the existing system structure.

First point: I don't think we need to get rid of VACUUM, exactly.  What
we want for 24x7 operation is to be able to do whatever housekeeping we
need without locking out normal transaction processing for long intervals.
We could live with routine VACUUMs if they could run in parallel with
reads and writes of the table being vacuumed.  They don't even have to run
in parallel with schema updates of the target table (CREATE/DROP INDEX,
ALTER TABLE, etc).  Schema updates aren't things you do lightly for big
tables anyhow.  So what we want is more of a "background VACUUM" than a
"no VACUUM" solution.

Second: if VACUUM can run in the background, then there's no reason not
to run it fairly frequently.  In fact, it could become an automatically
scheduled activity like CHECKPOINT is now, or perhaps even a continuously
running daemon (which was the original conception of it at Berkeley, BTW).
This is important because it means that VACUUM doesn't have to be perfect.
The existing VACUUM code goes to huge lengths to ensure that it compacts
the table as much as possible.  We don't need that; if we miss some free
space this time around, but we can expect to get it the next time (or
eventually), we can be happy.  This leads to thinking of space management
in terms of steady-state behavior, rather than the periodic "big bang"
approach that VACUUM represents now.

But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it.  The new code would be a new command, maybe "VACUUM LAZY"
(or some other name entirely).

Enough handwaving, what about specifics?

1. Forget moving tuples from one page to another.  Doing that in a
transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.

2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space.  Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.

3. Lazy VACUUM processes a table in five stages:
   A. Scan relation looking for dead tuples; accumulate a list of their
      TIDs, as well as info about existing free space.  (This pass is
      completely read-only and so incurs no WAL traffic.)
   B. Remove index entries for the dead tuples.  (See below for details.)
   C. Physically delete dead tuples and compact free space on their pages.
   D. Truncate any completely-empty pages at relation's end.  (Optional,
      see below.)
   E. Create/update FSM entry for the table.
Note that this is crash-safe as long as the individual update operations
are atomic (which can be guaranteed by WAL entries for them).  If a tuple
is dead, we care not whether its index entries are still around or not;
so there's no risk to logical consistency.

4. Observe that lazy VACUUM need not really be a transaction at all, since
there's nothing it does that needs to be cancelled or undone if it is
aborted.  This means that its WAL entries do not have to hang around past
the next checkpoint, which solves the huge-WAL-space-usage problem that
people have noticed while VACUUMing large tables under 7.1.

5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table.  This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.


Free space map details
----------------------

I envision the FSM as a shared hash table keyed by table ID, with each
entry containing a list of page numbers and free space in each such page.

The FSM is empty at system startup and is filled by lazy VACUUM as it
processes each table.  Backends then decrement/remove page entries as they
use free space.

Critical point: the FSM is only a hint and does not have to be perfectly
accurate.  It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle.  This allows
us to take shortcuts in maintaining it.  In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory.  We just discard entries (pages or whole relations) as necessary
to keep it under budget.  Obviously, we'd not bother to make entries in
the first place for pages with only a little free space.  Relation entries
might be discarded on a least-recently-used basis.

Accesses to the FSM could create contention problems if we're not careful.
I think this can be dealt with by having each backend remember (in its
relcache entry for a table) the page number of the last page it chose from
the FSM to insert into.  That backend will keep inserting new tuples into
that same page, without touching the FSM, as long as there's room there.
Only then does it go back to the FSM, update or remove that page entry,
and choose another page to start inserting on.  This reduces the access
load on the FSM from once per tuple to once per page.  (Moreover, we can
arrange that successive backends consulting the FSM pick different pages
if possible.  Then, concurrent inserts will tend to go to different pages,
reducing contention for shared buffers; yet any single backend does
sequential inserts in one page, so that a bulk load doesn't cause
disk traffic scattered all over the table.)

The FSM can also cache the overall relation size, saving an lseek kernel
call whenever we do have to extend the relation for lack of internal free
space.  This will help pay for the locking cost of accessing the FSM.


Locking issues
--------------

We will need two extensions to the lock manager:

1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
Lazy VACUUM will grab this type of table lock to ensure the table schema
doesn't change under it.  Call it a VacuumLock until we think of a better
name.

2. A "conditional lock" operation that acquires a lock if available, but
doesn't block if not.

The conditional lock will be used by lazy VACUUM to try to upgrade its
VacuumLock to an AccessExclusiveLock at step D (truncate table).  If it's
able to get exclusive lock, it's safe to truncate any unused end pages.
Without exclusive lock, it's not, since there might be concurrent
transactions scanning or inserting into the empty pages.  We do not want
lazy VACUUM to block waiting to do this, since if it does that it will
create a lockout situation (reader/writer transactions will stack up
behind it in the lock queue while everyone waits for the existing
reader/writer transactions to finish).  Better to not do the truncation.

Another place where lazy VACUUM may be unable to do its job completely
is in compaction of space on individual disk pages.  It can physically
move tuples to perform compaction only if there are not currently any
other backends with pointers into that page (which can be tested by
looking to see if the buffer reference count is one).  Again, we punt
and leave the space to be compacted next time if we can't do it right
away.

The fact that inserted/updated tuples might wind up anywhere in the table,
not only at the end, creates no headaches except for heap_update.  That
routine needs buffer locks on both the page containing the old tuple and
the page that will contain the new.  To avoid possible deadlocks between
different backends locking the same two pages in opposite orders, we need
to constrain the lock ordering used by heap_update.  This is doable but
will require slightly more code than is there now.


Index access method improvements
--------------------------------

Presently, VACUUM deletes index tuples by doing a standard index scan
and checking each returned index tuple to see if it points at any of
the tuples to be deleted.  If so, the index AM is called back to delete
the tested index tuple.  This is horribly inefficient: it means one trip
into the index AM (with associated buffer lock/unlock and search overhead)
for each tuple in the index, plus another such trip for each tuple actually
deleted.

This is mainly a problem of a poorly chosen API.  The index AMs should
offer a "bulk delete" call, which is passed a sorted array of main-table
TIDs.  The loop over the index tuples should happen internally to the
index AM.  At least in the case of btree, this could be done by a
sequential scan over the index pages, which avoids the random I/O of an
index-order scan and so should offer additional speedup.

Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or perhaps
via a separate "vacuum index" API.  This can be done without exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees paper
didn't describe how, but more recent papers show how to do it.  As with
the main tables, I think it's sufficient to recycle freed space within
the index, and not necessarily try to give it back to the OS.

We will also want to look at upgrading the non-btree index types to allow
concurrent operations.  This may be a research problem; I don't expect to
touch that issue for 7.2.  (Hence, lazy VACUUM on tables with non-btree
indexes will still create lockouts until this is addressed.  But note that
the lockout only lasts through step B of the VACUUM, not the whole thing.)


There you have it.  If people like this, I'm prepared to commit to
making it happen for 7.2.  Comments, objections, better ideas?

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From tgl@sss.pgh.pa.us Fri May 18 01:41:34 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4I5fWd18922
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 01:41:32 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4I5fYR14013;
	Fri, 18 May 2001 01:41:34 -0400 (EDT)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105180227.f4I2Rpa13258@candle.pha.pa.us>
References: <200105180227.f4I2Rpa13258@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Thu, 17 May 2001 22:27:51 -0400"
Date: Fri, 18 May 2001 01:41:33 -0400
Message-ID: <14010.990164493@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: ORr

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> The only question I have is about the Free Space Map.  It would seem
> better to me if we could get this map closer to the table itself, rather
> than having every table of every database mixed into the same shared
> memory area.  I can just see random table access clearing out most of
> the map cache and perhaps making it less useless.

What random access?  Read transactions will never touch the FSM at all.
As for writes, seems to me the places you are writing are exactly the
places you need info for.

You make a good point, which is that we don't want a schedule-driven
VACUUM to load FSM entries for unused tables into the map at the cost
of throwing out entries that *are* being used.  But it seems to me that
that's easily dealt with if we recognize the risk.

> It would be nice if we could store the map on the first page of the disk
> table, or store it in a flat file per table.  I know both of these ideas
> will not work,

You said it.  What's wrong with shared memory?  You can't get any closer
than shared memory: keeping maps in the files would mean you'd need to
chew up shared-buffer space to get at them.  (And what was that about
random accesses causing your maps to get dropped?  That would happen
for sure if they live in shared buffers.)

Another problem with keeping stuff in the first page: what happens when
the table gets big enough that 8k of map data isn't really enough?
With a shared-memory area, we can fairly easily allocate a variable
amount of space based on total size of a relation vs. total size of
relations under management.

It is true that a shared-memory map would be useless at system startup,
until VACUUM has run and filled in some info.  But I don't see that as
a big drawback.  People who aren't developers like us don't restart
their postmasters every five minutes.

> Another advantage of centralization is that we can record update/delete
> counters per table, helping tell vacuum where to vacuum next.  Vacuum
> roaming around looking for old tuples seems wasteful.

Indeed.  But I thought you were arguing against centralization?

			regards, tom lane

From pgsql-hackers-owner+M8982@postgresql.org Fri May 18 14:13:26 2001
Return-path: <pgsql-hackers-owner+M8982@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4IIDPd08167
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 14:13:25 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4IICbA12956;
	Fri, 18 May 2001 14:12:37 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M8982@postgresql.org)
Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4IFlDA39367
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 11:47:13 -0400 (EDT)
	(envelope-from oleg@sai.msu.su)
Received: from ra (ra [158.250.29.2])
	by ra.sai.msu.su (8.9.3/8.9.3) with ESMTP id SAA17114;
	Fri, 18 May 2001 18:45:46 +0300 (GMT)
Date: Fri, 18 May 2001 18:45:46 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
X-X-Sender: <megera@ra.sai.msu.su>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <12833.990140724@sss.pgh.pa.us>
Message-ID: <Pine.GSO.4.33.0105181830450.12431-100000@ra.sai.msu.su>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

On Thu, 17 May 2001, Tom Lane wrote:

>
> We will also want to look at upgrading the non-btree index types to allow
> concurrent operations.  This may be a research problem; I don't expect to
> touch that issue for 7.2.  (Hence, lazy VACUUM on tables with non-btree
> indexes will still create lockouts until this is addressed.  But note that
> the lockout only lasts through step B of the VACUUM, not the whole thing.)

am I right you plan to work with GiST indexes as well ?
We read a paper "Concurrency and Recovery in Generalized Search Trees"
by Marcel Kornacker, C. Mohan, Joseph Hellerstein
(http://citeseer.nj.nec.com/kornacker97concurrency.html)
and probably we could go in this direction. Right now we're working
on adding of multi-key support to GiST.

btw, I have a question about function gistPageAddItem in gist.c
it just decompress - compress key and calls PageAddItem to
write tuple. We don't understand why do we need this function -
why not use PageAddItem function. Adding multi-key support requires
a lot of work and we don't want to waste our efforts and time.
We already done some tests (gistPageAddItem -> PageAddItem) and
everything is ok.  Bruce, you're enthuasistic in removing unused  code :-)


>
> 			regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From pgsql-hackers-owner+M8987@postgresql.org Fri May 18 14:54:09 2001
Return-path: <pgsql-hackers-owner+M8987@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4IIs9d11463
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 14:54:09 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4IIrSA32621;
	Fri, 18 May 2001 14:53:28 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M8987@postgresql.org)
Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4IHBIA83136
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 13:11:43 -0400 (EDT)
	(envelope-from oleg@sai.msu.su)
Received: from ra (ra [158.250.29.2])
	by ra.sai.msu.su (8.9.3/8.9.3) with ESMTP id UAA18957;
	Fri, 18 May 2001 20:10:10 +0300 (GMT)
Date: Fri, 18 May 2001 20:10:10 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
X-X-Sender: <megera@ra.sai.msu.su>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <20032.990203902@sss.pgh.pa.us>
Message-ID: <Pine.GSO.4.33.0105181947520.12431-100000@ra.sai.msu.su>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

On Fri, 18 May 2001, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
> > On Thu, 17 May 2001, Tom Lane wrote:
> >> We will also want to look at upgrading the non-btree index types to allow
> >> concurrent operations.
>
> > am I right you plan to work with GiST indexes as well ?
> > We read a paper "Concurrency and Recovery in Generalized Search Trees"
> > by Marcel Kornacker, C. Mohan, Joseph Hellerstein
> > (http://citeseer.nj.nec.com/kornacker97concurrency.html)
> > and probably we could go in this direction. Right now we're working
> > on adding of multi-key support to GiST.

Another paper to read:
"Efficient Concurrency Control in Multidimensional Access Methods"
by Kaushik Chakrabarti
http://www.ics.uci.edu/~kaushik/research/pubs.html

>
> Yes, GIST should be upgraded to do concurrency.  But I have no objection
> if you want to work on multi-key support first.
>
> My feeling is that a few releases from now we will have btree and GIST
> as the preferred/well-supported index types.  Hash and rtree might go
> away altogether --- AFAICS they don't do anything that's not done as
> well or better by btree or GIST, so what's the point of maintaining
> them?

Cool ! We could write rtree (and btree) ops using GiST. We have already
realization of rtree for box ops and there are no problem to write
additional ops for points, polygons etc.

>
> > btw, I have a question about function gistPageAddItem in gist.c
> > it just decompress - compress key and calls PageAddItem to
> > write tuple. We don't understand why do we need this function -
>
> The comment says
>
> ** Take a compressed entry, and install it on a page.  Since we now know
> ** where the entry will live, we decompress it and recompress it using
> ** that knowledge (some compression routines may want to fish around
> ** on the page, for example, or do something special for leaf nodes.)
>
> Are you prepared to say that you will no longer support the ability for
> GIST compression routines to do those things?  That seems shortsighted.
>

No-no !!! we don't intend to lose that (compression) functionality.

there are several reason we want to eliminate gistPageAddItem:
1. It seems there are no examples where compress uses information about
   the page.
2. There is some discrepancy between calculation of free space on page and
   the size of tuple saved on page - calculation of free space on page
   by gistNoSpace uses compressed tuple but tuple itself saved after
   recompression. It's possible that size of tupple could changed
   after recompression.
3. decompress/compress could slowdown insert  because it happens
   for every tuple.
4. Currently gistPageAddItem is broken because it's not toast safe
   (see call gist_tuple_replacekey in gistPageAddItem)

Right now we use  #define GIST_PAGEADDITEM in gist.c and
working with original PageAddItem. If people insist on gistPageAddItem
we'll totally rewrite it. But for now we have enough job to do.


> 			regards, tom lane
>

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9001@postgresql.org Fri May 18 20:22:26 2001
Return-path: <pgsql-hackers-owner+M9001@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J0MPd19637
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 20:22:25 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J0LsA39106;
	Fri, 18 May 2001 20:21:54 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9001@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J098A35204
	for <pgsql-hackers@postgreSQL.org>; Fri, 18 May 2001 20:09:08 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 61421 invoked by uid 503); 19 May 2001 00:09:04 -0000
Received: from unknown (HELO sectorbase2.sectorbase.com) (192.168.254.2)
  by 192.168.254.252 with SMTP; 19 May 2001 00:09:04 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CCVAC>; Fri, 18 May 2001 17:08:14 -0700
Message-ID: <3705826352029646A3E91C53F7189E3201662C@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 18 May 2001 17:08:07 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr

> I have been thinking about the problem of VACUUM and how we
> might fix it for 7.2.  Vadim has suggested that we should
> attack this by implementing an overwriting storage manager
> and transaction UNDO, but I'm not totally comfortable with
> that approach: it seems to me that it's an awfully large
> change in the way Postgres works.

I'm not sure if we should implement overwriting smgr at all.
I was/is going to solve space reusing problem with non-overwriting
one, though I'm sure that we have to reimplement it (> 1 table
per data file, stored on disk FSM etc).

> Second: if VACUUM can run in the background, then there's no
> reason not to run it fairly frequently. In fact, it could become
> an automatically scheduled activity like CHECKPOINT is now,
> or perhaps even a continuously running daemon (which was the
> original conception of it at Berkeley, BTW).

And original authors concluded that daemon was very slow in
reclaiming dead space, BTW.

> 3. Lazy VACUUM processes a table in five stages:
>    A. Scan relation looking for dead tuples;...
>    B. Remove index entries for the dead tuples...
>    C. Physically delete dead tuples and compact free space...
>    D. Truncate any completely-empty pages at relation's end.
>    E. Create/update FSM entry for the table.
...
> If a tuple is dead, we care not whether its index entries are still
> around or not; so there's no risk to logical consistency.

What does this sentence mean? We canNOT remove dead heap tuple untill
we know that there are no index tuples referencing it and your A,B,C
reflect this, so ..?

> Another place where lazy VACUUM may be unable to do its job completely
> is in compaction of space on individual disk pages.  It can physically
> move tuples to perform compaction only if there are not currently any
> other backends with pointers into that page (which can be tested by
> looking to see if the buffer reference count is one).  Again, we punt
> and leave the space to be compacted next time if we can't do it right
> away.

We could keep share buffer lock (or add some other kind of lock)
untill tuple projected - after projection we need not to read data
for fetched tuple from shared buffer and time between fetching
tuple and projection is very short, so keeping lock on buffer will
not impact concurrency significantly.

Or we could register callback cleanup function with buffer so bufmgr
would call it when refcnt drops to 0.

> Presently, VACUUM deletes index tuples by doing a standard index
> scan and checking each returned index tuple to see if it points
> at any of the tuples to be deleted. If so, the index AM is called
> back to delete the tested index tuple. This is horribly inefficient:
...
> This is mainly a problem of a poorly chosen API. The index AMs
> should offer a "bulk delete" call, which is passed a sorted array
> of main-table TIDs. The loop over the index tuples should happen
> internally to the index AM.

I agreed with others who think that the main problem of index cleanup
is reading all index data pages to remove some index tuples. You told
youself about partial heap scanning - so for each scanned part of table
you'll have to read all index pages again and again - very good way to
trash buffer pool with big indices.

Well, probably it's ok for first implementation and you'll win some CPU
with "bulk delete" - I'm not sure how much, though, and there is more
significant issue with index cleanup if table is not locked exclusively:
concurrent index scan returns tuple (and unlock index page), heap_fetch
reads table row and find that it's dead, now index scan *must* find
current index tuple to continue, but background vacuum could already
remove that index tuple => elog(FATAL, "_bt_restscan: my bits moved...");

Two ways: hold index page lock untill heap tuple is checked or (rough
schema)
store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
is used by some scan so cleaner could change stored TID (get one from prev
index tuple) and set flag to help scan restore its current position on
return.

I'm particularly interested in discussing this issue because of it must be
resolved for UNDO and chosen way will affect in what volume we'll be able
to implement dirty reads (first way doesn't allow to implement them in full
- ie selects with joins, - but good enough to resolve RI constraints
concurrency issue).

> There you have it.  If people like this, I'm prepared to commit to
> making it happen for 7.2.  Comments, objections, better ideas?

Well, my current TODO looks as (ORDER BY PRIORITY DESC):

1. UNDO;
2. New SMGR;
3. Space reusing.

and I cannot commit at this point anything about 3. So, why not to refine
vacuum if you want it. I, personally, was never be able to convince myself
to spend time for this.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9006@postgresql.org Fri May 18 21:04:21 2001
Return-path: <pgsql-hackers-owner+M9006@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J14Kd22405
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 21:04:20 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J13gA51252;
	Fri, 18 May 2001 21:03:42 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9006@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J0w5A49229
	for <pgsql-hackers@postgreSQL.org>; Fri, 18 May 2001 20:58:05 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J0RFR27251;
	Fri, 18 May 2001 20:27:16 -0400 (EDT)
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3705826352029646A3E91C53F7189E3201662C@sectorbase2.sectorbase.com>
References: <3705826352029646A3E91C53F7189E3201662C@sectorbase2.sectorbase.com>
Comments: In-reply-to "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
	message dated "Fri, 18 May 2001 17:08:07 -0700"
Date: Fri, 18 May 2001 20:27:15 -0400
Message-ID: <27248.990232035@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> If a tuple is dead, we care not whether its index entries are still
>> around or not; so there's no risk to logical consistency.

> What does this sentence mean? We canNOT remove dead heap tuple untill
> we know that there are no index tuples referencing it and your A,B,C
> reflect this, so ..?

Sorry if it wasn't clear.  I meant that if the vacuum process fails
after removing an index tuple but before removing the (dead) heap tuple
it points to, there's no need to try to undo.  That state is OK, and
when we next get a chance to vacuum we'll still be able to finish
removing the heap tuple.

>> Another place where lazy VACUUM may be unable to do its job completely
>> is in compaction of space on individual disk pages.  It can physically
>> move tuples to perform compaction only if there are not currently any
>> other backends with pointers into that page (which can be tested by
>> looking to see if the buffer reference count is one).  Again, we punt
>> and leave the space to be compacted next time if we can't do it right
>> away.

> We could keep share buffer lock (or add some other kind of lock)
> untill tuple projected - after projection we need not to read data
> for fetched tuple from shared buffer and time between fetching
> tuple and projection is very short, so keeping lock on buffer will
> not impact concurrency significantly.

Or drop the pin on the buffer to show we no longer have a pointer to it.
I'm not sure that the time to do projection is short though --- what
if there are arbitrary user-defined functions in the quals or the
projection targetlist?

> Or we could register callback cleanup function with buffer so bufmgr
> would call it when refcnt drops to 0.

Hmm ... might work.  There's no guarantee that the refcnt would drop to
zero before the current backend exits, however.  Perhaps set a flag in
the shared buffer header, and the last guy to drop his pin is supposed
to do the cleanup?  But then you'd be pushing VACUUM's work into
productive transactions, which is probably not the way to go.

>> This is mainly a problem of a poorly chosen API. The index AMs
>> should offer a "bulk delete" call, which is passed a sorted array
>> of main-table TIDs. The loop over the index tuples should happen
>> internally to the index AM.

> I agreed with others who think that the main problem of index cleanup
> is reading all index data pages to remove some index tuples.

For very small numbers of tuples that might be true.  But I'm not
convinced it's worth worrying about.  If there aren't many tuples to
be freed, perhaps VACUUM shouldn't do anything at all.

> Well, probably it's ok for first implementation and you'll win some CPU
> with "bulk delete" - I'm not sure how much, though, and there is more
> significant issue with index cleanup if table is not locked exclusively:
> concurrent index scan returns tuple (and unlock index page), heap_fetch
> reads table row and find that it's dead, now index scan *must* find
> current index tuple to continue, but background vacuum could already
> remove that index tuple => elog(FATAL, "_bt_restscan: my bits moved...");

Hm.  Good point ...

> Two ways: hold index page lock untill heap tuple is checked or (rough
> schema)
> store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
> is used by some scan so cleaner could change stored TID (get one from prev
> index tuple) and set flag to help scan restore its current position on
> return.

Another way is to mark the index tuple "gone but not forgotten", so to
speak --- mark it dead without removing it.  (We could know that we need
to do that if we see someone else has a buffer pin on the index page.)
In this state, the index scan coming back to work would still be allowed
to find the index tuple, but no other index scan would stop on the
tuple.  Later passes of vacuum would eventually remove the index tuple,
whenever vacuum happened to pass through at an instant where no one has
a pin on that index page.

None of these seem real clean though.  Needs more thought.

> Well, my current TODO looks as (ORDER BY PRIORITY DESC):

> 1. UNDO;
> 2. New SMGR;
> 3. Space reusing.

> and I cannot commit at this point anything about 3. So, why not to refine
> vacuum if you want it. I, personally, was never be able to convince myself
> to spend time for this.

Okay, good.  I was worried that this idea would conflict with what you
were doing, but it seems it won't.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From vmikheev@SECTORBASE.COM Fri May 18 21:11:10 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4J1B9d22806
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 21:11:09 -0400 (EDT)
Received: (qmail 74783 invoked by uid 503); 19 May 2001 01:11:07 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 19 May 2001 01:11:07 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CCV1R>; Fri, 18 May 2001 18:10:16 -0700
Message-ID: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>
cc: "'Tom Lane'" <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 18 May 2001 18:10:10 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> Vadim, can you remind me what UNDO is used for?

Ok, last reminder -:))

On transaction abort, read WAL records and undo (rollback)
changes made in storage. Would allow:

1. Reclaim space allocated by aborted transactions.
2. Implement SAVEPOINTs.
   Just to remind -:) - in the event of error discovered by server
   - duplicate key, deadlock, command mistyping, etc, - transaction
   will be rolled back to the nearest implicit savepoint setted
   just before query execution; - or transaction can be aborted by
   ROLLBACK TO <savepoint_name> command to some explicit savepoint
   setted by user. Transaction rolled back to savepoint may be continued.
3. Reuse transaction IDs on postmaster restart.
4. Split pg_log into small files with ability to remove old ones (which
   do not hold statuses for any running transactions).

Vadim

From pgsql-hackers-owner+M9011@postgresql.org Fri May 18 21:44:12 2001
Return-path: <pgsql-hackers-owner+M9011@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J1iBd01588
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 21:44:11 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J1hmA62689;
	Fri, 18 May 2001 21:43:48 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9011@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J1bmA60941
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 21:37:48 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J1bbR27748;
	Fri, 18 May 2001 21:37:37 -0400 (EDT)
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
Comments: In-reply-to "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
	message dated "Fri, 18 May 2001 18:10:10 -0700"
Date: Fri, 18 May 2001 21:37:37 -0400
Message-ID: <27745.990236257@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> Vadim, can you remind me what UNDO is used for?
> Ok, last reminder -:))

> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:

> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

Hm.  On the other hand, relying on WAL for undo means you cannot drop
old WAL segments that contain records for any open transaction.  We've
already seen several complaints that the WAL logs grow unmanageably huge
when there is a long-running transaction, and I think we'll see a lot
more.

It would be nicer if we could drop WAL records after a checkpoint or two,
even in the presence of long-running transactions.  We could do that if
we were only relying on them for crash recovery and not for UNDO.

Looking at the advantages:

1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
make lightweight VACUUM work well.  (I definitely don't like the idea
that after a very long transaction fails and aborts, I'd have to wait
another very long time for UNDO to do its thing before I could get on
with my work.  Would much rather have the space reclamation happen in
background.)

2. SAVEPOINTs would be awfully nice to have, I agree.

3. Reusing xact IDs would be nice, but there's an answer with a lot less
impact on the system: go to 8-byte xact IDs.  Having to shut down the
postmaster when you approach the 4Gb transaction mark isn't going to
impress people who want a 24x7 commitment, anyway.

4. Recycling pg_log would be nice too, but we've already discussed other
hacks that might allow pg_log to be kept finite without depending on
UNDO (or requiring postmaster restarts, IIRC).

I'm sort of thinking that undoing back to a savepoint is the only real
usefulness of WAL-based UNDO.  Is it practical to preserve the WAL log
just back to the last savepoint in each xact, not the whole xact?

Another thought: do we need WAL UNDO at all to implement savepoints?
Is there some way we could do them like nested transactions, wherein
each savepoint-to-savepoint segment is given its own transaction number?
Committing multiple xact IDs at once might be a little tricky, but it
seems like a narrow, soluble problem.  Implementing UNDO without
creating lots of performance issues looks a lot harder.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From tgl@sss.pgh.pa.us Fri May 18 21:37:41 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J1bdd26573
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 21:37:39 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J1bbR27748;
	Fri, 18 May 2001 21:37:37 -0400 (EDT)
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
Comments: In-reply-to "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
	message dated "Fri, 18 May 2001 18:10:10 -0700"
Date: Fri, 18 May 2001 21:37:37 -0400
Message-ID: <27745.990236257@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: ORr

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> Vadim, can you remind me what UNDO is used for?
> Ok, last reminder -:))

> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:

> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

Hm.  On the other hand, relying on WAL for undo means you cannot drop
old WAL segments that contain records for any open transaction.  We've
already seen several complaints that the WAL logs grow unmanageably huge
when there is a long-running transaction, and I think we'll see a lot
more.

It would be nicer if we could drop WAL records after a checkpoint or two,
even in the presence of long-running transactions.  We could do that if
we were only relying on them for crash recovery and not for UNDO.

Looking at the advantages:

1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
make lightweight VACUUM work well.  (I definitely don't like the idea
that after a very long transaction fails and aborts, I'd have to wait
another very long time for UNDO to do its thing before I could get on
with my work.  Would much rather have the space reclamation happen in
background.)

2. SAVEPOINTs would be awfully nice to have, I agree.

3. Reusing xact IDs would be nice, but there's an answer with a lot less
impact on the system: go to 8-byte xact IDs.  Having to shut down the
postmaster when you approach the 4Gb transaction mark isn't going to
impress people who want a 24x7 commitment, anyway.

4. Recycling pg_log would be nice too, but we've already discussed other
hacks that might allow pg_log to be kept finite without depending on
UNDO (or requiring postmaster restarts, IIRC).

I'm sort of thinking that undoing back to a savepoint is the only real
usefulness of WAL-based UNDO.  Is it practical to preserve the WAL log
just back to the last savepoint in each xact, not the whole xact?

Another thought: do we need WAL UNDO at all to implement savepoints?
Is there some way we could do them like nested transactions, wherein
each savepoint-to-savepoint segment is given its own transaction number?
Committing multiple xact IDs at once might be a little tricky, but it
seems like a narrow, soluble problem.  Implementing UNDO without
creating lots of performance issues looks a lot harder.

			regards, tom lane

From pgsql-hackers-owner+M9012@postgresql.org Fri May 18 22:02:39 2001
Return-path: <pgsql-hackers-owner+M9012@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J22dd03438
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 22:02:39 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J22BA67912;
	Fri, 18 May 2001 22:02:11 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9012@postgresql.org)
Received: from store.z.zembu.com (nat.zembu.com [209.128.96.253])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J1uRA66065
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 21:56:27 -0400 (EDT)
	(envelope-from ncm@zembu.com)
Received: by store.z.zembu.com (Postfix, from userid 509)
	id A77BEFDFF; Fri, 18 May 2001 18:56:25 -0700 (PDT)
Date: Fri, 18 May 2001 18:56:25 -0700
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Message-ID: <20010518185625.F18121@store.zembu.com>
Reply-To: pgsql-hackers@postgresql.org
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>; from vmikheev@SECTORBASE.COM on Fri, May 18, 2001 at 06:10:10PM -0700
From: ncm@zembu.com (Nathan Myers)
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

On Fri, May 18, 2001 at 06:10:10PM -0700, Mikheev, Vadim wrote:
> > Vadim, can you remind me what UNDO is used for?
>
> Ok, last reminder -:))
>
> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:
>
> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

I missed the original discussions; apologies if this has already been
beaten into the ground.  But... mightn't sub-transactions be a
better-structured way to expose this service?

Nathan Myers
ncm@zembu.com

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9016@postgresql.org Fri May 18 23:17:40 2001
Return-path: <pgsql-hackers-owner+M9016@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J3Hed15250
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 23:17:40 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J3HGA88247;
	Fri, 18 May 2001 23:17:16 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9016@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J3CwA86943
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 23:12:58 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4J3Cfs14576;
	Fri, 18 May 2001 23:12:41 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105190312.f4J3Cfs14576@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <27745.990236257@sss.pgh.pa.us> "from Tom Lane at May 18, 2001 09:37:37
	pm"
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 18 May 2001 23:12:41 -0400 (EDT)
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Another thought: do we need WAL UNDO at all to implement savepoints?
> Is there some way we could do them like nested transactions, wherein
> each savepoint-to-savepoint segment is given its own transaction number?
> Committing multiple xact IDs at once might be a little tricky, but it
> seems like a narrow, soluble problem.  Implementing UNDO without
> creating lots of performance issues looks a lot harder.

I am confused why we can't implement subtransactions as part of our
command counter?  The counter is already 4 bytes long.  Couldn't we
rollback to counter number X-10?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9017@postgresql.org Fri May 18 23:20:00 2001
Return-path: <pgsql-hackers-owner+M9017@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J3Jxd15384
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 23:19:59 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J3JcA88917;
	Fri, 18 May 2001 23:19:38 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9017@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J3FOA87731
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 23:15:24 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J3FER28239;
	Fri, 18 May 2001 23:15:14 -0400 (EDT)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105190312.f4J3Cfs14576@candle.pha.pa.us>
References: <200105190312.f4J3Cfs14576@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 18 May 2001 23:12:41 -0400"
Date: Fri, 18 May 2001 23:15:13 -0400
Message-ID: <28236.990242113@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am confused why we can't implement subtransactions as part of our
> command counter?  The counter is already 4 bytes long.  Couldn't we
> rollback to counter number X-10?

That'd work within your own transaction, but not from outside it.
After you commit, how will other backends know which command-counter
values of your transaction to believe, and which not?

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From tgl@sss.pgh.pa.us Fri May 18 23:15:13 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J3FCd15028
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 23:15:12 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J3FER28239;
	Fri, 18 May 2001 23:15:14 -0400 (EDT)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105190312.f4J3Cfs14576@candle.pha.pa.us>
References: <200105190312.f4J3Cfs14576@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 18 May 2001 23:12:41 -0400"
Date: Fri, 18 May 2001 23:15:13 -0400
Message-ID: <28236.990242113@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: ORr

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am confused why we can't implement subtransactions as part of our
> command counter?  The counter is already 4 bytes long.  Couldn't we
> rollback to counter number X-10?

That'd work within your own transaction, but not from outside it.
After you commit, how will other backends know which command-counter
values of your transaction to believe, and which not?

			regards, tom lane

From pgsql-hackers-owner+M9020@postgresql.org Fri May 18 23:44:09 2001
Return-path: <pgsql-hackers-owner+M9020@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J3i8d16942
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 23:44:08 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J3hcA96911;
	Fri, 18 May 2001 23:43:38 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9020@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J3U8A92747
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 23:30:08 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4J3TtI15796;
	Fri, 18 May 2001 23:29:55 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105190329.f4J3TtI15796@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <28236.990242113@sss.pgh.pa.us> "from Tom Lane at May 18, 2001 11:15:13
	pm"
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 18 May 2001 23:29:55 -0400 (EDT)
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am confused why we can't implement subtransactions as part of our
> > command counter?  The counter is already 4 bytes long.  Couldn't we
> > rollback to counter number X-10?
>
> That'd work within your own transaction, but not from outside it.
> After you commit, how will other backends know which command-counter
> values of your transaction to believe, and which not?

Seems we would have to store the command counters for the parts of the
transaction that committed, or the ones that were rolled back.  Yuck.

I hate to add UNDO complexity just for subtransactions.

Hey, I have an idea.  Can we do subtransactions as separate transactions
(as Tom mentioned), and put the subtransaction id's in the WAL, so they
an be safely committed/rolledback as a group?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From pgsql-hackers-owner+M9023@postgresql.org Fri May 18 23:56:09 2001
Return-path: <pgsql-hackers-owner+M9023@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J3u8d17382
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 23:56:08 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J3tgA01116;
	Fri, 18 May 2001 23:55:42 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9023@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J3hqA97002
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 23:43:52 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J3hgR28484;
	Fri, 18 May 2001 23:43:42 -0400 (EDT)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105190329.f4J3TtI15796@candle.pha.pa.us>
References: <200105190329.f4J3TtI15796@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 18 May 2001 23:29:55 -0400"
Date: Fri, 18 May 2001 23:43:42 -0400
Message-ID: <28481.990243822@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Hey, I have an idea.  Can we do subtransactions as separate transactions
> (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> an be safely committed/rolledback as a group?

It's not quite that easy: all the subtransactions have to commit at
*the same time* from the point of view of other xacts, or you have
consistency problems.  So there'd need to be more xact-commit mechanism
than there is now.  Snapshots are also interesting; we couldn't use a
single xact ID per backend to show the open-transaction state.

WAL doesn't really enter into it AFAICS...

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From tgl@sss.pgh.pa.us Fri May 18 23:43:41 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J3hed16915
	for <pgman@candle.pha.pa.us>; Fri, 18 May 2001 23:43:40 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4J3hgR28484;
	Fri, 18 May 2001 23:43:42 -0400 (EDT)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105190329.f4J3TtI15796@candle.pha.pa.us>
References: <200105190329.f4J3TtI15796@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Fri, 18 May 2001 23:29:55 -0400"
Date: Fri, 18 May 2001 23:43:42 -0400
Message-ID: <28481.990243822@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: ORr

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Hey, I have an idea.  Can we do subtransactions as separate transactions
> (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> an be safely committed/rolledback as a group?

It's not quite that easy: all the subtransactions have to commit at
*the same time* from the point of view of other xacts, or you have
consistency problems.  So there'd need to be more xact-commit mechanism
than there is now.  Snapshots are also interesting; we couldn't use a
single xact ID per backend to show the open-transaction state.

WAL doesn't really enter into it AFAICS...

			regards, tom lane

From pgsql-hackers-owner+M9024@postgresql.org Sat May 19 00:05:43 2001
Return-path: <pgsql-hackers-owner+M9024@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J45hd18105
	for <pgman@candle.pha.pa.us>; Sat, 19 May 2001 00:05:43 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J457A05136;
	Sat, 19 May 2001 00:05:07 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9024@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J3vEA01609
	for <pgsql-hackers@postgresql.org>; Fri, 18 May 2001 23:57:14 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4J3v1h17419;
	Fri, 18 May 2001 23:57:01 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105190357.f4J3v1h17419@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <28481.990243822@sss.pgh.pa.us> "from Tom Lane at May 18, 2001 11:43:42
	pm"
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 18 May 2001 23:57:01 -0400 (EDT)
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > an be safely committed/rolledback as a group?
>
> It's not quite that easy: all the subtransactions have to commit at
> *the same time* from the point of view of other xacts, or you have
> consistency problems.  So there'd need to be more xact-commit mechanism
> than there is now.  Snapshots are also interesting; we couldn't use a
> single xact ID per backend to show the open-transaction state.

Yes, I knew that was going to come up that you have to add a lock to the
pg_log that is only in affect when someone is commiting a transaction
with subtransactions.  Normal transactions get read/sharedlock, while
subtransaction needs exclusive/writelock.

Seems a lot easier than UNDO.  Vadim you mentioned UNDO would allow
space reuse for rolledback transactions, but in most cases the space
reuse is going to be for old copies of committed transactions, right?
Were you going to use WAL to get free space from old copies too?

Vadim, I think I am missing something.  You mentioned UNDO would be used
for these cases and I don't understand the purpose of adding what would
seem to be a pretty complex capability:

> 1. Reclaim space allocated by aborted transactions.

Is there really a lot to be saved here vs. old tuples of committed
transactions?

> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be
>    continued.

Discussing, perhaps using multiple transactions.

> 3. Reuse transaction IDs on postmaster restart.

Doesn't seem like a huge win.

> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

That one is interesting.  Seems the only workaround for that would be to
allow a global scan of all databases and tables to set commit flags,
then shrink pg_log and set XID offset as start of file.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From pgsql-hackers-owner+M9028@postgresql.org Sat May 19 05:00:37 2001
Return-path: <pgsql-hackers-owner+M9028@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4J90bd29010
	for <pgman@candle.pha.pa.us>; Sat, 19 May 2001 05:00:37 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4J8sGA10373;
	Sat, 19 May 2001 04:54:16 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9028@postgresql.org)
Received: from bering.webline.dk (83.adsl0.kh.worldonline.dk [213.237.10.83])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4J8coA09586
	for <pgsql-hackers@postgreSQL.org>; Sat, 19 May 2001 04:38:53 -0400 (EDT)
	(envelope-from kar@webline.dk)
Received: from bering (localhost [127.0.0.1])
	by bering.webline.dk (8.11.2/8.10.2/SuSE Linux 8.10.0-0.3) with SMTP id f4J8cUq15144
	for <pgsql-hackers@postgreSQL.org>; Sat, 19 May 2001 10:38:30 +0200
Content-Type: text/plain;
  charset="iso-8859-1"
From: Kaare Rasmussen <kar@webline.dk>
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sat, 19 May 2001 10:38:29 +0200
X-Mailer: KMail [version 1.2]
References: <12833.990140724@sss.pgh.pa.us>
In-Reply-To: <12833.990140724@sss.pgh.pa.us>
MIME-Version: 1.0
Message-ID: <01051910382902.14217@bering>
Content-Transfer-Encoding: 8bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Second: if VACUUM can run in the background, then there's no reason not
> to run it fairly frequently.  In fact, it could become an automatically
> scheduled activity like CHECKPOINT is now, or perhaps even a continuously
> running daemon (which was the original conception of it at Berkeley, BTW).

Maybe it's obvious, but I'd like to mention that you need some way of setting
priority. If it's a daemon, or a process, you an nice it. If not, you need to
implement something by yourself.

--
Kaare Rasmussen            --Linux, spil,--        Tlf:        3816 2582
Kaki Data                tshirts, merchandize      Fax:        3816 2501
Howitzvej 75               <20>ben 14.00-18.00        Web:      www.suse.dk
2000 Frederiksberg        L<>rdag 11.00-17.00       Email: kar@webline.dk

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgman Sat May 19 08:12:28 2001
Return-path: <pgman>
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4JCCSc15349;
	Sat, 19 May 2001 08:12:28 -0400 (EDT)
From: Bruce Momjian <pgman>
Message-ID: <200105191212.f4JCCSc15349@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105190357.f4J3v1h17419@candle.pha.pa.us> "from Bruce Momjian
	at May 18, 2001 11:57:01 pm"
To: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 19 May 2001 08:12:28 -0400 (EDT)
cc: Tom Lane <tgl@sss.pgh.pa.us>, "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Status: OR

> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > > an be safely committed/rolledback as a group?
> >
> > It's not quite that easy: all the subtransactions have to commit at
> > *the same time* from the point of view of other xacts, or you have
> > consistency problems.  So there'd need to be more xact-commit mechanism
> > than there is now.  Snapshots are also interesting; we couldn't use a
> > single xact ID per backend to show the open-transaction state.
>
> Yes, I knew that was going to come up that you have to add a lock to the
> pg_log that is only in affect when someone is commiting a transaction
> with subtransactions.  Normal transactions get read/sharedlock, while
> subtransaction needs exclusive/writelock.

I was wrong here.  Multiple backends can write to pg_log at the same
time, even subtraction ones.  It is just that no backend can read from
pg_log during a subtransaction commit.  Acctually, they can if the are
reading a transaction status that is less than the minium active
transaction id, see GetXmaxRecent().

Doesn't seem too bad.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

From pgsql-hackers-owner+M9034@postgresql.org Sat May 19 08:18:54 2001
Return-path: <pgsql-hackers-owner+M9034@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4JCIsd15698
	for <pgman@candle.pha.pa.us>; Sat, 19 May 2001 08:18:54 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4JCI8A86106;
	Sat, 19 May 2001 08:18:08 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9034@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4JCDoA84410
	for <pgsql-hackers@postgresql.org>; Sat, 19 May 2001 08:13:50 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4JCCSc15349;
	Sat, 19 May 2001 08:12:28 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105191212.f4JCCSc15349@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105190357.f4J3v1h17419@candle.pha.pa.us> "from Bruce Momjian
	at May 18, 2001 11:57:01 pm"
To: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 19 May 2001 08:12:28 -0400 (EDT)
cc: Tom Lane <tgl@sss.pgh.pa.us>, "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > > an be safely committed/rolledback as a group?
> >
> > It's not quite that easy: all the subtransactions have to commit at
> > *the same time* from the point of view of other xacts, or you have
> > consistency problems.  So there'd need to be more xact-commit mechanism
> > than there is now.  Snapshots are also interesting; we couldn't use a
> > single xact ID per backend to show the open-transaction state.
>
> Yes, I knew that was going to come up that you have to add a lock to the
> pg_log that is only in affect when someone is commiting a transaction
> with subtransactions.  Normal transactions get read/sharedlock, while
> subtransaction needs exclusive/writelock.

I was wrong here.  Multiple backends can write to pg_log at the same
time, even subtraction ones.  It is just that no backend can read from
pg_log during a subtransaction commit.  Acctually, they can if the are
reading a transaction status that is less than the minium active
transaction id, see GetXmaxRecent().

Doesn't seem too bad.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9036@postgresql.org Sat May 19 08:30:41 2001
Return-path: <pgsql-hackers-owner+M9036@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4JCUed16878
	for <pgman@candle.pha.pa.us>; Sat, 19 May 2001 08:30:41 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4JCTRA90288;
	Sat, 19 May 2001 08:29:27 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9036@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4JCOxA88564
	for <pgsql-hackers@postgresql.org>; Sat, 19 May 2001 08:24:59 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4JCNb815894;
	Sat, 19 May 2001 08:23:37 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105191223.f4JCNb815894@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <28481.990243822@sss.pgh.pa.us> "from Tom Lane at May 18, 2001 11:43:42
	pm"
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 19 May 2001 08:23:37 -0400 (EDT)
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > an be safely committed/rolledback as a group?
>
> It's not quite that easy: all the subtransactions have to commit at
> *the same time* from the point of view of other xacts, or you have
> consistency problems.  So there'd need to be more xact-commit mechanism
> than there is now.  Snapshots are also interesting; we couldn't use a
> single xact ID per backend to show the open-transaction state.

OK, I have another idea about subtransactions as multiple transaction
ids.

I realize that the snapshot problem would be an issue, because now
instead of looking at your own transaction id, you have to look at
multiple transaction ids.  We could do this as a List of xid's, but that
will not scale well.

My idea is for a subtransaction backend to have its own pg_log-style
memory area that shows which transactions it owns and has
committed/aborted.  It can have the log start at its start xid, and can
look in pg_log and in there anytime it needs to check the visibility of
a transaction greater than its minium xid.  16k can hold 64k xids, so it
seems it should scale pretty well.  (Each xid is two bits in pg_log.)

In fact, multi-query transactions are just a special case of
subtransactions, where all previous subtransactions are
committed/visible.  We could use the same pg_log-style memory area for
multi-query transactions, eliminating the command counter  and saving 8
bytes overhead per tuple.

Currently, the XMIN/XMAX command counters are used only by the current
transaction, and they are useless once the transaction finishes and take
up 8 bytes on disk.

So, this idea gets us subtransactions and saves 8 bytes overhead.  This
reduces our per-tuple overhead from 36 to 28 bytes, a 22% reduction!

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From tgl@sss.pgh.pa.us Sat May 19 11:13:12 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4JFDBd10204
	for <pgman@candle.pha.pa.us>; Sat, 19 May 2001 11:13:11 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4JFDBR00135;
	Sat, 19 May 2001 11:13:11 -0400 (EDT)
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105191223.f4JCNb815894@candle.pha.pa.us>
References: <200105191223.f4JCNb815894@candle.pha.pa.us>
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
	message dated "Sat, 19 May 2001 08:23:37 -0400"
Date: Sat, 19 May 2001 11:13:11 -0400
Message-ID: <132.990285191@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> In fact, multi-query transactions are just a special case of
> subtransactions, where all previous subtransactions are
> committed/visible.  We could use the same pg_log-style memory area for
> multi-query transactions, eliminating the command counter  and saving 8
> bytes overhead per tuple.

Interesting thought, but command IDs don't act the same as transactions;
in particular, visibility of one scan to another doesn't necessarily
depend on whether the scan has finished.

Possibly that could be taken into account by having different rules for
"do we think it's committed" in the local pg_log than the global one.

Also, this distinction would propagate out of the xact status code;
for example, it wouldn't do for heapam to set the "known committed"
bit on a tuple just because it's from a previous subtransaction of the
current xact.  Right now that works because heapam knows the difference
between xacts and commands; it would still have to know the difference.

A much more significant objection is that such a design would eat xact
IDs at a tremendous rate, to no purpose.  CommandCounterIncrement is a
cheap operation now, and we do it with abandon.  It would not be cheap
if it implied allocating a new xact ID that would eventually need to be
marked committed.  I don't mind allocating a new xact ID for each
explicitly-created savepoint, but a new ID per CommandCounterIncrement
is a different story.

			regards, tom lane

From pgsql-hackers-owner+M9081@postgresql.org Sun May 20 02:45:24 2001
Return-path: <pgsql-hackers-owner+M9081@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4K6jNN16792
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 02:45:23 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4K6ihA56252;
	Sun, 20 May 2001 02:44:43 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9081@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4K6aaA53464
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 02:36:36 -0400 (EDT)
	(envelope-from vmikheev@sectorbase.com)
Received: (qmail 46626 invoked by uid 503); 20 May 2001 06:36:34 -0000
Received: from din6.sectorbase.com (HELO dune) (63.88.121.76)
  by gate1.sectorbase.com with SMTP; 20 May 2001 06:36:34 -0000
Message-ID: <002d01c0e0f7$376b59a0$4c79583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, <pgsql-hackers@postgresql.org>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com> <27745.990236257@sss.pgh.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sat, 19 May 2001 23:36:34 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="windows-1251"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Hm.  On the other hand, relying on WAL for undo means you cannot drop
> old WAL segments that contain records for any open transaction.  We've
> already seen several complaints that the WAL logs grow unmanageably huge
> when there is a long-running transaction, and I think we'll see a lot
> more.
>
> It would be nicer if we could drop WAL records after a checkpoint or two,
> even in the presence of long-running transactions.  We could do that if
> we were only relying on them for crash recovery and not for UNDO.

As you understand this is old, well-known problem in database practice,
described in books. Two ways - either abort too long running transactions
or (/and) compact old log segments: fetch and save (to use for undo)
records of long-running transactions and remove other records. Neither
way is perfect but nothing is perfect at all -:)

> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> make lightweight VACUUM work well.  (I definitely don't like the idea

Sorry, but I'm going to consider background vacuum as temporary solution
only. As I've already pointed, original PG authors finally became
disillusioned with the same approach. What is good in using UNDO for 1.
is the fact that WAL records give you *direct* physical access to changes
which should be rolled back.

> that after a very long transaction fails and aborts, I'd have to wait
> another very long time for UNDO to do its thing before I could get on
> with my work.  Would much rather have the space reclamation happen in
> background.)

Understandable, but why other transactions should read dirty data again
and again waiting for background vacuum? I think aborted transaction
should take some responsibility for mess made by them -:)
And keeping in mind 2. very long transactions could be continued -:)

> 2. SAVEPOINTs would be awfully nice to have, I agree.
>
> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
> impact on the system: go to 8-byte xact IDs.  Having to shut down the
> postmaster when you approach the 4Gb transaction mark isn't going to
> impress people who want a 24x7 commitment, anyway.

+8 bytes in tuple header is not so tiny thing.

> 4. Recycling pg_log would be nice too, but we've already discussed other
> hacks that might allow pg_log to be kept finite without depending on
> UNDO (or requiring postmaster restarts, IIRC).

We did... and didn't get agreement.

> I'm sort of thinking that undoing back to a savepoint is the only real
> usefulness of WAL-based UNDO. Is it practical to preserve the WAL log
> just back to the last savepoint in each xact, not the whole xact?

No, it's not. It's not possible in overwriting systems at all - all
transaction records are required.

> Another thought: do we need WAL UNDO at all to implement savepoints?
> Is there some way we could do them like nested transactions, wherein
> each savepoint-to-savepoint segment is given its own transaction number?
> Committing multiple xact IDs at once might be a little tricky, but it
> seems like a narrow, soluble problem.

Implicit savepoints wouldn't be possible - this is very convenient
feature I've found in Oracle.
And additional code in tqual.c wouldn't be good addition.

> Implementing UNDO without creating lots of performance issues looks
> a lot harder.

What *performance* issues?!
The only issue is additional disk requirements.

Vadim


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9088@postgresql.org Sun May 20 13:17:50 2001
Return-path: <pgsql-hackers-owner+M9088@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4KHHoN20556
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 13:17:50 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4KHHJA01746;
	Sun, 20 May 2001 13:17:19 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9088@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4KH9vA98828
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 13:09:57 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4KH9jR12006;
	Sun, 20 May 2001 13:09:46 -0400 (EDT)
To: "Vadim Mikheev" <vmikheev@sectorbase.com>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <002d01c0e0f7$376b59a0$4c79583f@sectorbase.com>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com> <27745.990236257@sss.pgh.pa.us> <002d01c0e0f7$376b59a0$4c79583f@sectorbase.com>
Comments: In-reply-to "Vadim Mikheev" <vmikheev@sectorbase.com>
	message dated "Sat, 19 May 2001 23:36:34 -0700"
Date: Sun, 20 May 2001 13:09:45 -0400
Message-ID: <12003.990378585@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
>> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
>> make lightweight VACUUM work well.

> Sorry, but I'm going to consider background vacuum as temporary solution
> only. As I've already pointed, original PG authors finally became
> disillusioned with the same approach.

How could they become disillusioned with it, when they never tried it?
I know of no evidence that any version of PG has had backgroundable
(non-blocking-to-other-transactions) VACUUM, still less within-relation
space recycling.  They may have become disillusioned with the form of
VACUUM that they actually had (ie, the same one we've inherited) --- but
please don't call that "the same approach" I'm proposing.

Certainly, doing VACUUM this way is an experiment that may fail, or may
require further work before it really works well.  But I'd appreciate it
if you wouldn't prejudge the results of the experiment.

>> Would much rather have the space reclamation happen in
>> background.)

> Understandable, but why other transactions should read dirty data again
> and again waiting for background vacuum? I think aborted transaction
> should take some responsibility for mess made by them -:)

They might read it again and again before the failed xact gets around to
removing the data, too.  You cannot rely on UNDO for correctness; at
most it can be a speed/space optimization.  I see no reason to assume
that it's a more effective optimization than a background vacuum
process.

>> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
>> impact on the system: go to 8-byte xact IDs.

> +8 bytes in tuple header is not so tiny thing.

Agreed, but the people who need 8-byte IDs are not running small
installations.  I think they'd sooner pay a little more in disk space
than risk costs in performance or reliability.

>> Another thought: do we need WAL UNDO at all to implement savepoints?
>> Is there some way we could do them like nested transactions, wherein
>> each savepoint-to-savepoint segment is given its own transaction number?

> Implicit savepoints wouldn't be possible - this is very convenient
> feature I've found in Oracle.

Why not?  Seems to me that establishing implicit savepoints is just a
user-interface issue; you can do it, or not do it, regardless of the
underlying mechanism.

>> Implementing UNDO without creating lots of performance issues looks
>> a lot harder.

> What *performance* issues?!
> The only issue is additional disk requirements.

Not so.  UNDO does failed-transaction cleanup work in the interactive
backends, where it necessarily delays clients who might otherwise be
issuing their next command.  A VACUUM-based approach does the cleanup
work in the background.  Same work, more or less, but it's not in the
clients' critical path.

BTW, UNDO for failed transactions alone will not eliminate the need for
VACUUM.  Will you also make successful transactions go back and
physically remove the tuples they deleted?

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From pgsql-hackers-owner+M9099@postgresql.org Sun May 20 17:09:01 2001
Return-path: <pgsql-hackers-owner+M9099@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4KL91N01322
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 17:09:01 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4KL7tA69488;
	Sun, 20 May 2001 17:07:55 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9099@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4KL0qA67639
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 17:00:52 -0400 (EDT)
	(envelope-from vmikheev@sectorbase.com)
Received: (qmail 28135 invoked by uid 503); 20 May 2001 21:00:49 -0000
Received: from din3.sectorbase.com (HELO dune) (63.88.121.73)
  by gate1.sectorbase.com with SMTP; 20 May 2001 21:00:49 -0000
Message-ID: <003701c0e16f$f3561ba0$4979583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, <pgsql-hackers@postgresql.org>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com> <27745.990236257@sss.pgh.pa.us> <002d01c0e0f7$376b59a0$4c79583f@sectorbase.com> <12003.990378585@sss.pgh.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sun, 20 May 2001 14:00:48 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="windows-1251"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> >> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> >> make lightweight VACUUM work well.
>
> > Sorry, but I'm going to consider background vacuum as temporary solution
> > only. As I've already pointed, original PG authors finally became
> > disillusioned with the same approach.
>
> How could they become disillusioned with it, when they never tried it?
> I know of no evidence that any version of PG has had backgroundable
> (non-blocking-to-other-transactions) VACUUM, still less within-relation
> space recycling.  They may have become disillusioned with the form of
> VACUUM that they actually had (ie, the same one we've inherited) --- but
> please don't call that "the same approach" I'm proposing.

Pre-Postgres'95 (original) versions had vacuum daemon running in
background. I don't know if that vacuum shrinked relations or not
(there was no shrinking in '95 version), I know that daemon had to
do some extra work in moving old tuples to archival storage, but
anyway as you can read in old papers in the case of consistent heavy
load daemon was not able to cleanup storage fast enough. And the
reason is obvious - no matter how optimized your daemon will be
(in regard to blocking other transactions etc), it will have to
perform huge amount of IO just to find space available for reclaiming.

> Certainly, doing VACUUM this way is an experiment that may fail, or may
> require further work before it really works well. But I'd appreciate it
> if you wouldn't prejudge the results of the experiment.

Why not, Tom? Why shouldn't I say my opinion?
Last summer your comment about WAL, may experiment that time, was that
it will save just a few fsyncs. It was your right to make prejudment,
what's wrong with my rights? And you appealed to old papers as well, BTW.

> > Understandable, but why other transactions should read dirty data again
> > and again waiting for background vacuum? I think aborted transaction
> > should take some responsibility for mess made by them -:)
>
> They might read it again and again before the failed xact gets around to
> removing the data, too.  You cannot rely on UNDO for correctness; at
> most it can be a speed/space optimization. I see no reason to assume
> that it's a more effective optimization than a background vacuum
> process.

Really?! Once again: WAL records give you *physical* address of tuples
(both heap and index ones!) to be removed and size of log to read
records from is not comparable with size of data files.

> >> Another thought: do we need WAL UNDO at all to implement savepoints?
> >> Is there some way we could do them like nested transactions, wherein
> >> each savepoint-to-savepoint segment is given its own transaction number?
>
> > Implicit savepoints wouldn't be possible - this is very convenient
> > feature I've found in Oracle.
>
> Why not?  Seems to me that establishing implicit savepoints is just a
> user-interface issue; you can do it, or not do it, regardless of the
> underlying mechanism.

Implicit savepoints are setted by server automatically before each
query execution - you wouldn't use transaction IDs for this.

> >> Implementing UNDO without creating lots of performance issues looks
> >> a lot harder.
>
> > What *performance* issues?!
> > The only issue is additional disk requirements.
>
> Not so. UNDO does failed-transaction cleanup work in the interactive
> backends, where it necessarily delays clients who might otherwise be
> issuing their next command.  A VACUUM-based approach does the cleanup
> work in the background. Same work, more or less, but it's not in the
> clients' critical path.

Not same work but much more and in the critical pathes of all clients.
And - is overall performance of Oracle or Informix worse then in PG?
Seems delays in clients for rollback doesn't affect performance so much.
But dirty storage does it.

> BTW, UNDO for failed transactions alone will not eliminate the need for
> VACUUM.  Will you also make successful transactions go back and
> physically remove the tuples they deleted?

They can't do this, as you know pretty well. But using WAL to get TIDs to
be deleted is considerable, no?

Vadim


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From vmikheev@sectorbase.com Sun May 20 17:13:42 2001
Return-path: <vmikheev@sectorbase.com>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4KLDfN01641
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 17:13:41 -0400 (EDT)
Received: (qmail 30876 invoked by uid 503); 20 May 2001 21:13:38 -0000
Received: from din3.sectorbase.com (HELO dune) (63.88.121.73)
  by gate1.sectorbase.com with SMTP; 20 May 2001 21:13:38 -0000
Message-ID: <003f01c0e171$bd1e2f80$4979583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane" <tgl@sss.pgh.pa.us>
cc: <pgsql-hackers@postgresql.org>
References: <200105190357.f4J3v1h17419@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sun, 20 May 2001 14:13:37 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Status: OR

> Were you going to use WAL to get free space from old copies too?

Considerable approach.

> Vadim, I think I am missing something.  You mentioned UNDO would be used
> for these cases and I don't understand the purpose of adding what would
> seem to be a pretty complex capability:

Yeh, we already won title of most advanced among simple databases, -:)
Yes, looking in list of IDs assigned to single transaction in tqual.c is much
easy to do than UNDO. As well as couple of fsyncs is easy than WAL.

> > 1. Reclaim space allocated by aborted transactions.
>
> Is there really a lot to be saved here vs. old tuples of committed
> transactions?

Are you able to protect COPY FROM from abort/crash?

Vadim


From pgsql-hackers-owner+M9103@postgresql.org Sun May 20 17:33:30 2001
Return-path: <pgsql-hackers-owner+M9103@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4KLXTN02284
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 17:33:29 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4KLX3A76360;
	Sun, 20 May 2001 17:33:03 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9103@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4KLPuA74582
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 17:25:56 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4KLPmR19773;
	Sun, 20 May 2001 17:25:48 -0400 (EDT)
To: "Vadim Mikheev" <vmikheev@sectorbase.com>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <003701c0e16f$f3561ba0$4979583f@sectorbase.com>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com> <27745.990236257@sss.pgh.pa.us> <002d01c0e0f7$376b59a0$4c79583f@sectorbase.com> <12003.990378585@sss.pgh.pa.us> <003701c0e16f$f3561ba0$4979583f@sectorbase.com>
Comments: In-reply-to "Vadim Mikheev" <vmikheev@sectorbase.com>
	message dated "Sun, 20 May 2001 14:00:48 -0700"
Date: Sun, 20 May 2001 17:25:47 -0400
Message-ID: <19770.990393947@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> Really?! Once again: WAL records give you *physical* address of tuples
> (both heap and index ones!) to be removed and size of log to read
> records from is not comparable with size of data files.

You sure?  With our current approach of dumping data pages into the WAL
on first change since checkpoint (and doing so again after each
checkpoint) it's not too difficult to devise scenarios where the WAL log
is *larger* than the affected datafiles ... and can't be truncated until
someone commits.

The copied-data-page traffic is the worst problem with our current
WAL implementation.  I did some measurements last week on VACUUM of a
test table (the accounts table from a "pg_bench -s 10" setup, which
contains 1000000 rows; I updated 20000 rows and then vacuumed).  This
generated about 34400 8k blocks of WAL traffic, of which about 33300
represented copied pages and the other 1100 blocks were actual WAL
entries.  That's a pretty massive I/O overhead, considering the table
itself was under 20000 8k blocks.  It was also interesting to note that
a large fraction of the CPU time was spent calculating CRCs on the WAL
data.

Would it be possible to split the WAL traffic into two sets of files,
one for WAL log records proper and one for copied pages?  Seems like
we could recycle the pages after each checkpoint rather than hanging
onto them until the associated transactions commit.

>> Why not?  Seems to me that establishing implicit savepoints is just a
>> user-interface issue; you can do it, or not do it, regardless of the
>> underlying mechanism.

> Implicit savepoints are setted by server automatically before each
> query execution - you wouldn't use transaction IDs for this.

If the user asked you to, I don't see why not.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9104@postgresql.org Sun May 20 17:37:15 2001
Return-path: <pgsql-hackers-owner+M9104@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4KLbFN02418
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 17:37:15 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4KLaZA77465;
	Sun, 20 May 2001 17:36:35 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9104@postgresql.org)
Received: from mobile.hub.org (SHW39-29.accesscable.net [24.138.39.29])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4KLTRA75314
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 17:29:27 -0400 (EDT)
	(envelope-from scrappy@hub.org)
Received: from localhost (scrappy@localhost)
	by mobile.hub.org (8.11.3/8.11.1) with ESMTP id f4KLT3640280;
	Sun, 20 May 2001 18:29:03 -0300 (ADT)
	(envelope-from scrappy@hub.org)
X-Authentication-Warning: mobile.hub.org: scrappy owned process doing -bs
Date: Sun, 20 May 2001 18:29:03 -0300 (ADT)
From: The Hermit Hacker <scrappy@hub.org>
To: Vadim Mikheev <vmikheev@sectorbase.com>
cc: Tom Lane <tgl@sss.pgh.pa.us>, "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <003701c0e16f$f3561ba0$4979583f@sectorbase.com>
Message-ID: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

On Sun, 20 May 2001, Vadim Mikheev wrote:

> > >> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> > >> make lightweight VACUUM work well.
> >
> > > Sorry, but I'm going to consider background vacuum as temporary solution
> > > only. As I've already pointed, original PG authors finally became
> > > disillusioned with the same approach.
> >
> > How could they become disillusioned with it, when they never tried it?
> > I know of no evidence that any version of PG has had backgroundable
> > (non-blocking-to-other-transactions) VACUUM, still less within-relation
> > space recycling.  They may have become disillusioned with the form of
> > VACUUM that they actually had (ie, the same one we've inherited) --- but
> > please don't call that "the same approach" I'm proposing.
>
> Pre-Postgres'95 (original) versions had vacuum daemon running in
> background. I don't know if that vacuum shrinked relations or not
> (there was no shrinking in '95 version), I know that daemon had to
> do some extra work in moving old tuples to archival storage, but
> anyway as you can read in old papers in the case of consistent heavy
> load daemon was not able to cleanup storage fast enough. And the
> reason is obvious - no matter how optimized your daemon will be
> (in regard to blocking other transactions etc), it will have to
> perform huge amount of IO just to find space available for reclaiming.
>
> > Certainly, doing VACUUM this way is an experiment that may fail, or may
> > require further work before it really works well. But I'd appreciate it
> > if you wouldn't prejudge the results of the experiment.
>
> Why not, Tom? Why shouldn't I say my opinion?
> Last summer your comment about WAL, may experiment that time, was that
> it will save just a few fsyncs. It was your right to make prejudment,
> what's wrong with my rights? And you appealed to old papers as well, BTW.

If its an "experiment", shouldn't it be done outside of the main source
tree, with adequate testing in a high load situation, with a patch
released to the community for further testing/comments, before it is added
to the source tree?  From reading Vadim's comment above (re:
pre-Postgres95), this daemonized approach would cause a high I/O load on
the server in a situation where there are *alot* of UPDATE/DELETEs
happening to the database, which should be easily recreatable, no?  Or,
Vadim, am I misundertanding?


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9105@postgresql.org Sun May 20 18:05:07 2001
Return-path: <pgsql-hackers-owner+M9105@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4KM57N03461
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 18:05:07 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4KM4cA84495;
	Sun, 20 May 2001 18:04:38 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9105@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4KLvQA82414
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 17:57:26 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4KLvFR19989;
	Sun, 20 May 2001 17:57:15 -0400 (EDT)
To: The Hermit Hacker <scrappy@hub.org>
cc: Vadim Mikheev <vmikheev@sectorbase.com>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org>
References: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org>
Comments: In-reply-to The Hermit Hacker <scrappy@hub.org>
	message dated "Sun, 20 May 2001 18:29:03 -0300"
Date: Sun, 20 May 2001 17:57:15 -0400
Message-ID: <19986.990395835@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

The Hermit Hacker <scrappy@hub.org> writes:
> If its an "experiment", shouldn't it be done outside of the main source
> tree, with adequate testing in a high load situation, with a patch
> released to the community for further testing/comments, before it is added
> to the source tree?

Mebbe we should've handled WAL that way too ;-)

Seriously, I don't think that my proposed changes need be treated with
quite that much suspicion.  The only part that is really intrusive is
the shared-memory free-heap-space-management change.  But AFAICT that
will be a necessary component of *any* approach to getting rid of
VACUUM.  We've been arguing here, in essence, about whether a background
or on-line approach to finding free space will be more useful; but that
still leaves you with the question of what you do with the free space
after you've found it.  Without some kind of shared free space map,
there's not anything you can do except have the process that found the
space do tuple moving and file truncation --- ie, VACUUM.  So even if
I'm quite wrong about the effectiveness of a background VACUUM, the FSM
code will still be needed: an UNDO-style approach is also going to need
an FSM to do anything with the free space it finds.  It's equally clear
that the index AMs have to support index tuple deletion without
exclusive lock, or we'll still have blocking problems during free-space
cleanup, no matter what drives that cleanup.  The only part of what
I've proposed that might end up getting relegated to the scrap heap is
the "lazy vacuum" command itself, which will be a self-contained and
relatively small module (smaller than the present commands/vacuum.c,
for sure).

Besides which, Vadim has already said that he won't have time to do
anything about space reclamation before 7.2.  So even if background
vacuum does end up getting superseded by something better, we're going
to need it for a release or two ...

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From pgsql-hackers-owner+M9110@postgresql.org Sun May 20 22:41:52 2001
Return-path: <pgsql-hackers-owner+M9110@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4L2fqN14617
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 22:41:52 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4L2YXA49590;
	Sun, 20 May 2001 22:34:33 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9110@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4L2RAA47750
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 22:27:10 -0400 (EDT)
	(envelope-from vmikheev@sectorbase.com)
Received: (qmail 96761 invoked by uid 503); 21 May 2001 02:27:06 -0000
Received: from din2.sectorbase.com (HELO dune) (63.88.121.72)
  by gate1.sectorbase.com with SMTP; 21 May 2001 02:27:06 -0000
Message-ID: <002b01c0e19d$88af9fa0$4879583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, <pgsql-hackers@postgresql.org>
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com> <27745.990236257@sss.pgh.pa.us> <002d01c0e0f7$376b59a0$4c79583f@sectorbase.com> <12003.990378585@sss.pgh.pa.us> <003701c0e16f$f3561ba0$4979583f@sectorbase.com> <19770.990393947@sss.pgh.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sun, 20 May 2001 19:27:07 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="windows-1251"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > Really?! Once again: WAL records give you *physical* address of tuples
> > (both heap and index ones!) to be removed and size of log to read
> > records from is not comparable with size of data files.
>
> You sure?  With our current approach of dumping data pages into the WAL
> on first change since checkpoint (and doing so again after each
> checkpoint) it's not too difficult to devise scenarios where the WAL log
> is *larger* than the affected datafiles ... and can't be truncated until
> someone commits.

Yes, but note mine "size of log to read records from" - each log record
has pointer to previous record made by same transaction: rollback must
not read entire log file to get all records of specific transaction.

> >> Why not?  Seems to me that establishing implicit savepoints is just a
> >> user-interface issue; you can do it, or not do it, regardless of the
> >> underlying mechanism.
>
> > Implicit savepoints are setted by server automatically before each
> > query execution - you wouldn't use transaction IDs for this.
>
> If the user asked you to, I don't see why not.

Example of one of implicit savepoint usage: skipping duplicate key insertion.
Using transaction IDs when someone want to insert a few thousand records?

Vadim


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From vmikheev@sectorbase.com Sun May 20 22:57:50 2001
Return-path: <vmikheev@sectorbase.com>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4L2voN15014
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 22:57:50 -0400 (EDT)
Received: (qmail 3400 invoked by uid 503); 21 May 2001 02:57:47 -0000
Received: from din2.sectorbase.com (HELO dune) (63.88.121.72)
  by gate1.sectorbase.com with SMTP; 21 May 2001 02:57:47 -0000
Message-ID: <004301c0e1a1$d1e7fd80$4879583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "The Hermit Hacker" <scrappy@hub.org>
cc: "Tom Lane" <tgl@sss.pgh.pa.us>, "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   <pgsql-hackers@postgresql.org>
References: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sun, 20 May 2001 19:57:48 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Status: OR

> If its an "experiment", shouldn't it be done outside of the main source
> tree, with adequate testing in a high load situation, with a patch
> released to the community for further testing/comments, before it is added
> to the source tree?  From reading Vadim's comment above (re:
> pre-Postgres95), this daemonized approach would cause a high I/O load on
> the server in a situation where there are *alot* of UPDATE/DELETEs
> happening to the database, which should be easily recreatable, no?  Or,
> Vadim, am I misundertanding?

It probably will not cause more IO than vacuum does right now.
But unfortunately it will not reduce that IO. Cleanup work will be spreaded
in time and users will not experience long lockouts but average impact
on overall system throughput will be same (or maybe higher).
My point is that we'll need in dynamic cleanup anyway and UNDO is
what should be implemented for dynamic cleanup of aborted changes.
Plus UNDO gives us natural implementation of savepoints and some
abilities in transaction IDs management, which we may use or not
(though, 4. - pg_log size management - is really good thing).

Vadim


From pgsql-hackers-owner+M9112@postgresql.org Sun May 20 23:14:28 2001
Return-path: <pgsql-hackers-owner+M9112@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4L3ESN15529
	for <pgman@candle.pha.pa.us>; Sun, 20 May 2001 23:14:28 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4L3DcA60509;
	Sun, 20 May 2001 23:13:38 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9112@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4L36IA58218
	for <pgsql-hackers@postgresql.org>; Sun, 20 May 2001 23:06:19 -0400 (EDT)
	(envelope-from vmikheev@sectorbase.com)
Received: (qmail 5139 invoked by uid 503); 21 May 2001 03:06:14 -0000
Received: from din2.sectorbase.com (HELO dune) (63.88.121.72)
  by gate1.sectorbase.com with SMTP; 21 May 2001 03:06:14 -0000
Message-ID: <004f01c0e1a3$0024bb60$4879583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "The Hermit Hacker" <scrappy@hub.org>, "Tom Lane" <tgl@sss.pgh.pa.us>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, <pgsql-hackers@postgresql.org>
References: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org> <19986.990395835@sss.pgh.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Sun, 20 May 2001 20:06:15 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="windows-1251"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Seriously, I don't think that my proposed changes need be treated with
> quite that much suspicion.  The only part that is really intrusive is

Agreed. I fight for UNDO, not against background vacuum -:)

> the shared-memory free-heap-space-management change.  But AFAICT that
> will be a necessary component of *any* approach to getting rid of
> VACUUM.  We've been arguing here, in essence, about whether a background
> or on-line approach to finding free space will be more useful; but that
> still leaves you with the question of what you do with the free space
> after you've found it.  Without some kind of shared free space map,
> there's not anything you can do except have the process that found the
> space do tuple moving and file truncation --- ie, VACUUM.  So even if
> I'm quite wrong about the effectiveness of a background VACUUM, the FSM
> code will still be needed: an UNDO-style approach is also going to need
> an FSM to do anything with the free space it finds.  It's equally clear

Unfortunately, I think that we'll need in on-disk FSM and that FSM is
actually the most complex thing to do in "space reclamation" project.

> Besides which, Vadim has already said that he won't have time to do
> anything about space reclamation before 7.2.  So even if background
> vacuum does end up getting superseded by something better, we're going
> to need it for a release or two ...

Yes.

Vadim


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9113@postgresql.org Mon May 21 00:43:11 2001
Return-path: <pgsql-hackers-owner+M9113@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4L4hBN17985
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 00:43:11 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4L4gcA87748;
	Mon, 21 May 2001 00:42:38 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9113@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4L4XCA84569
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 00:33:12 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4L4WgR20824;
	Mon, 21 May 2001 00:32:42 -0400 (EDT)
To: "Vadim Mikheev" <vmikheev@sectorbase.com>
cc: "The Hermit Hacker" <scrappy@hub.org>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <004f01c0e1a3$0024bb60$4879583f@sectorbase.com>
References: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org> <19986.990395835@sss.pgh.pa.us> <004f01c0e1a3$0024bb60$4879583f@sectorbase.com>
Comments: In-reply-to "Vadim Mikheev" <vmikheev@sectorbase.com>
	message dated "Sun, 20 May 2001 20:06:15 -0700"
Date: Mon, 21 May 2001 00:32:42 -0400
Message-ID: <20821.990419562@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> Unfortunately, I think that we'll need in on-disk FSM and that FSM is
> actually the most complex thing to do in "space reclamation" project.

I hope we can avoid on-disk FSM.  Seems to me that that would create
problems both for performance (lots of extra disk I/O) and reliability
(what happens if FSM is corrupted?  A restart won't fix it).

But, if we do need it, most of the work needed to install FSM APIs
should carry over.  So I still don't see an objection to doing
in-memory FSM as a first step.


BTW, I was digging through the old Postgres papers this afternoon,
to refresh my memory about what they actually said about VACUUM.
I was interested to discover that at one time the tuple-insertion
algorithm went as follows:
  1. Pick a page at random in the relation, read it in, and see if it
     has enough free space.  Repeat up to three times.
  2. If #1 fails to find space, append tuple at end.
When they got around to doing some performance measurement, they
discovered that step #1 was a serious loser, and dropped it in favor
of pure #2 (which is what we still have today).  Food for thought.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From tgl@sss.pgh.pa.us Mon May 21 13:38:41 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LHceQ02927
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:38:40 -0400 (EDT)
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by west.navpoint.com (8.11.3/8.10.1) with ESMTP id f4LE7uv21524
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 10:07:56 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4LE6aR24899;
	Mon, 21 May 2001 10:06:36 -0400 (EDT)
To: "Vadim Mikheev" <vmikheev@SECTORBASE.COM>
cc: "The Hermit Hacker" <scrappy@hub.org>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <004301c0e1a1$d1e7fd80$4879583f@sectorbase.com>
References: <Pine.BSF.4.33.0105201826150.3057-100000@mobile.hub.org> <004301c0e1a1$d1e7fd80$4879583f@sectorbase.com>
Comments: In-reply-to "Vadim Mikheev" <vmikheev@sectorbase.com>
	message dated "Sun, 20 May 2001 19:57:48 -0700"
Date: Mon, 21 May 2001 10:06:35 -0400
Message-ID: <24896.990453995@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> It probably will not cause more IO than vacuum does right now.
> But unfortunately it will not reduce that IO.

Uh ... what?  Certainly it will reduce the total cost of vacuum,
because it won't bother to try to move tuples to fill holes.
The index cleanup method I've proposed should be substantially
more efficient than the existing code, as well.

> My point is that we'll need in dynamic cleanup anyway and UNDO is
> what should be implemented for dynamic cleanup of aborted changes.

UNDO might offer some other benefits, but I doubt that it will allow
us to eliminate VACUUM completely.  To do that, you would need to
keep track of free space using exact, persistent (on-disk) bookkeeping
data structures.  The overhead of that will be very substantial: more,
I predict, than the approximate approach I proposed.

			regards, tom lane

From pgsql-hackers-owner+M9138@postgresql.org Mon May 21 14:27:34 2001
Return-path: <pgsql-hackers-owner+M9138@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LIRXQ09276
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 14:27:33 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LIL7A94773;
	Mon, 21 May 2001 14:21:07 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9138@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LGWGA38768
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 12:32:16 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 88093 invoked by uid 503); 21 May 2001 16:32:15 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 16:32:15 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC5Q2>; Mon, 21 May 2001 09:31:16 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016630@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   Tom Lane
  <tgl@sss.pgh.pa.us>
cc: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 09:31:15 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="windows-1251"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > Really?! Once again: WAL records give you *physical*
> > address of tuples (both heap and index ones!) to be
> > removed and size of log to read records from is not
> > comparable with size of data files.
>
> So how about a background "vacuum like" process, that reads
> the WAL and does the cleanup ? Seems that would be great,
> since it then does not need to scan, and does not make
> forground cleanup necessary.
>
> Problem is when cleanup can not keep up with cleaning WAL
> files, that already want to be removed. I would envision a
> config, that sais how many Mb of WAL are allowed to queue
> up before clients are blocked.

Yes, some daemon could read logs and gather cleanup info.
We could activate it when switching to new log file segment
and synchronization with checkpointer is not big deal. That
daemon would also archive log files for WAL-based BAR,
if archiving is ON.

But this will be useful only with on-disk FSM.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From vmikheev@SECTORBASE.COM Mon May 21 13:36:13 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LHaDQ01995
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:36:13 -0400 (EDT)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by west.navpoint.com (8.11.3/8.10.1) with SMTP id f4LGtfv12633
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 12:55:41 -0400 (EDT)
Received: (qmail 92843 invoked by uid 503); 21 May 2001 16:54:35 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 16:54:35 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC5T2>; Mon, 21 May 2001 09:53:36 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016631@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>,
   "'Bruce Momjian'"
  <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 09:53:35 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> > It probably will not cause more IO than vacuum does right now.
> > But unfortunately it will not reduce that IO.
>
> Uh ... what?  Certainly it will reduce the total cost of vacuum,
> because it won't bother to try to move tuples to fill holes.

Oh, you're right here, but daemon will most likely read data files
again and again with in-memory FSM. Also, if we'll do partial table
scans then we'll probably re-read indices > 1 time.

> The index cleanup method I've proposed should be substantially
> more efficient than the existing code, as well.

Not in IO area.

> > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > what should be implemented for dynamic cleanup of aborted changes.
>
> UNDO might offer some other benefits, but I doubt that it will allow
> us to eliminate VACUUM completely.  To do that, you would need to

I never told this -:)

> keep track of free space using exact, persistent (on-disk) bookkeeping
> data structures.  The overhead of that will be very substantial: more,
> I predict, than the approximate approach I proposed.

I doubt that "big guys" use in-memory FSM. If they were able to do this...

Vadim

From pgsql-hackers-owner+M9136@postgresql.org Mon May 21 14:25:59 2001
Return-path: <pgsql-hackers-owner+M9136@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LIPxQ09204
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 14:25:59 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LIFbA91850;
	Mon, 21 May 2001 14:15:37 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9136@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LGueA52482
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 12:56:40 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 93265 invoked by uid 503); 21 May 2001 16:56:39 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 16:56:39 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC5TT>; Mon, 21 May 2001 09:55:40 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016632@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>,
   "'Bruce Momjian'"
  <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 09:55:40 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> I hope we can avoid on-disk FSM.  Seems to me that that would create
> problems both for performance (lots of extra disk I/O) and reliability
> (what happens if FSM is corrupted?  A restart won't fix it).

We can use WAL for FSM.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From janwieck@Yahoo.com Mon May 21 13:36:06 2001
Return-path: <janwieck@Yahoo.com>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LHa6Q01945
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:36:06 -0400 (EDT)
Received: from smtp014.mail.yahoo.com (smtp014.mail.yahoo.com [216.136.173.58])
	by west.navpoint.com (8.11.3/8.10.1) with SMTP id f4LHBuv17383
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:11:56 -0400 (EDT)
Received: from jupiter.us.greatbridge.com (HELO jupiter.jw.home) (65.196.69.55)
  by smtp.mail.vip.sc5.yahoo.com with SMTP; 21 May 2001 17:10:55 -0000
X-Apparently-From: <janwieck@yahoo.com>
Received: (from janwieck@localhost)
	by jupiter.jw.home (8.9.3/8.9.3) id NAA14283;
	Mon, 21 May 2001 13:13:55 -0400
From: Jan Wieck <JanWieck@Yahoo.com>
Message-ID: <200105211713.NAA14283@jupiter.jw.home>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <20821.990419562@sss.pgh.pa.us> from Tom Lane at "May 21, 2001 00:32:42
	am"
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 21 May 2001 13:13:55 -0400 (EDT)
cc: Vadim Mikheev <vmikheev@SECTORBASE.COM>,
   The Hermit Hacker <scrappy@hub.org>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   pgsql-hackers@postgresql.orgrg.us.greatbridge.com
X-Mailer: ELM [version 2.4ME+ PL68 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: OR

Tom Lane wrote:
> "Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> > Unfortunately, I think that we'll need in on-disk FSM and that FSM is
> > actually the most complex thing to do in "space reclamation" project.
>
> I hope we can avoid on-disk FSM.  Seems to me that that would create
> problems both for performance (lots of extra disk I/O) and reliability
> (what happens if FSM is corrupted?  A restart won't fix it).
>
> But, if we do need it, most of the work needed to install FSM APIs
> should carry over.  So I still don't see an objection to doing
> in-memory FSM as a first step.
>
>
> BTW, I was digging through the old Postgres papers this afternoon,
> to refresh my memory about what they actually said about VACUUM.
> I was interested to discover that at one time the tuple-insertion
> algorithm went as follows:
>   1. Pick a page at random in the relation, read it in, and see if it
>      has enough free space.  Repeat up to three times.
>   2. If #1 fails to find space, append tuple at end.
> When they got around to doing some performance measurement, they
> discovered that step #1 was a serious loser, and dropped it in favor
> of pure #2 (which is what we still have today).  Food for thought.

    No surprise to me, because without removing dead tuples (plus
    their index  entries)  and  compacting  pages,  there's  VERY
    unlikely  freespace  on  a  randomly selected page. And AFAIR
    these steps haven't been done by those versions.

    I think the in-shared-mem FSM could have  some  max-per-table
    limit  and  the background VACUUM just skips the entire table
    as long as nobody  reused  any  space.  Also  it  might  only
    compact pages that lead to 25 or more percent of freespace in
    the first place. That makes it more likely  that  if  someone
    looks  for  a place to store a tuple that it'll fit into that
    block (remember that the toaster tries to  keep  main  tuples
    below BLKSZ/4).


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


From vmikheev@SECTORBASE.COM Mon May 21 13:36:05 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LHa5Q01929
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:36:05 -0400 (EDT)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by west.navpoint.com (8.11.3/8.10.1) with SMTP id f4LHNtv21125
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:23:55 -0400 (EDT)
Received: (qmail 3362 invoked by uid 503); 21 May 2001 17:23:50 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 17:23:50 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC5WP>; Mon, 21 May 2001 10:22:50 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016634@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Jan Wieck'" <JanWieck@Yahoo.com>, Tom Lane <tgl@sss.pgh.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>,
   "'Bruce Momjian'"
  <pgman@candle.pha.pa.us>,
   pgsql-hackers@postgresql.orgrg.us.greatbridge.com
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 10:22:49 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

>     I think the in-shared-mem FSM could have  some  max-per-table
>     limit  and  the background VACUUM just skips the entire table
>     as long as nobody  reused  any  space.  Also  it  might  only
>     compact pages that lead to 25 or more percent of freespace in
>     the first place. That makes it more likely  that  if  someone
>     looks  for  a place to store a tuple that it'll fit into that
>     block (remember that the toaster tries to  keep  main  tuples
>     below BLKSZ/4).

This should be configurable parameter like PCFREE (or something
like that) in Oracle: consider page for insertion only if it's
PCFREE % empty.

Vadim

From pgsql-hackers-owner+M9142@postgresql.org Mon May 21 16:02:27 2001
Return-path: <pgsql-hackers-owner+M9142@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LK2QQ21180
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 16:02:26 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LItoA12128;
	Mon, 21 May 2001 14:55:50 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9142@postgresql.org)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [216.151.103.158])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4LHN2A66016
	for <pgsql-hackers@postgreSQL.org>; Mon, 21 May 2001 13:23:02 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4LHMrR29288;
	Mon, 21 May 2001 13:22:53 -0400 (EDT)
To: Jan Wieck <JanWieck@yahoo.com>
cc: Vadim Mikheev <vmikheev@sectorbase.com>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <200105211713.NAA14283@jupiter.jw.home>
References: <200105211713.NAA14283@jupiter.jw.home>
Comments: In-reply-to Jan Wieck <JanWieck@yahoo.com>
	message dated "Mon, 21 May 2001 13:13:55 -0400"
Date: Mon, 21 May 2001 13:22:53 -0400
Message-ID: <29285.990465773@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Jan Wieck <JanWieck@yahoo.com> writes:
>     I think the in-shared-mem FSM could have  some  max-per-table
>     limit  and  the background VACUUM just skips the entire table
>     as long as nobody  reused  any  space.

I was toying with the notion of trying to use Vadim's "MNMB" idea
(see his description of the work he did for Perlstein last year);
that is, keep track of the lowest block number of any modified block
within each relation since the last VACUUM.  Then VACUUM would only
have to scan from there to the end.  This covers the totally-untouched-
relation case nicely, and also helps a lot for large rels that you're
mostly just adding to or perhaps updating recent additions.

The FSM could probably keep track of such info fairly easily, since
it will already be aware of which blocks it's told backends to try
to insert into.  But it would have to be told about deletes too,
which would mean more FSM access traffic and more lock contention.
Another problem (given my current view of how FSM should work) is that
rels not being used at all would not be in FSM, or would age out of it,
and so you wouldn't know that you didn't need to vacuum them.
So I'm not sure yet if it's a good idea.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9141@postgresql.org Mon May 21 14:55:40 2001
Return-path: <pgsql-hackers-owner+M9141@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LItdQ12873
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 14:55:39 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LIrgA10844;
	Mon, 21 May 2001 14:53:42 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9141@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LHrUA80109
	for <pgsql-hackers@postgreSQL.org>; Mon, 21 May 2001 13:53:30 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 9854 invoked by uid 503); 21 May 2001 17:53:29 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 17:53:29 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC5Z0>; Mon, 21 May 2001 10:52:29 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016636@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 10:52:28 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > We could keep share buffer lock (or add some other kind of lock)
> > untill tuple projected - after projection we need not to read data
> > for fetched tuple from shared buffer and time between fetching
> > tuple and projection is very short, so keeping lock on buffer will
> > not impact concurrency significantly.
>
> Or drop the pin on the buffer to show we no longer have a pointer to it.

This is not good for seqscans which will return to that buffer anyway.

> > Or we could register callback cleanup function with buffer so bufmgr
> > would call it when refcnt drops to 0.
>
> Hmm ... might work.  There's no guarantee that the refcnt
> would drop to zero before the current backend exits, however.
> Perhaps set a flag in the shared buffer header, and the last guy
> to drop his pin is supposed to do the cleanup?

This is what I've meant - set (register) some pointer in buffer header
to cleanup function.

> But then you'd be pushing VACUUM's work into productive transactions,
> which is probably not the way to go.

Not big work - I wouldn't worry about it.

> > Two ways: hold index page lock untill heap tuple is checked
> > or (rough schema) store info in shmem (just IndexTupleData.t_tid
> > and flag) that an index tuple is used by some scan so cleaner could
> > change stored TID (get one from prev index tuple) and set flag to
> > help scan restore its current position on return.
>
> Another way is to mark the index tuple "gone but not forgotten", so to
> speak --- mark it dead without removing it. (We could know that we need
> to do that if we see someone else has a buffer pin on the index page.)

Register cleanup function just like with heap above.

> None of these seem real clean though.  Needs more thought.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From vmikheev@SECTORBASE.COM Mon May 21 14:02:52 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LI2pQ07495
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 14:02:51 -0400 (EDT)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by west.navpoint.com (8.11.3/8.10.1) with SMTP id f4LI2ov03000
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 14:02:50 -0400 (EDT)
Received: (qmail 11793 invoked by uid 503); 21 May 2001 18:02:45 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 18:02:45 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC555>; Mon, 21 May 2001 11:01:45 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016637@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>
cc: Tom Lane <tgl@sss.pgh.pa.us>, "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 11:01:45 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: ORr

> > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > what should be implemented for dynamic cleanup of aborted changes.
>
> I do not yet understand why you want to handle aborts different than
> outdated tuples.

Maybe because of aborted tuples have shorter Time-To-Live.
And probability to find pages for them in buffer pool is higher.

> The ratio in a well tuned system should well favor outdated tuples.
> If someone ever adds "dirty read" it is also not the case that it
> is guaranteed, that nobody accesses the tuple you currently want
> to undo. So I really miss to see the big difference.

It will not be guaranteed anyway as soon as we start removing
tuples without exclusive access to relation.

And, I cannot say that I would implement UNDO because of
1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
but because of ALL of 1., 2., 4.

Vadim

From pgsql-hackers-owner+M9152@postgresql.org Mon May 21 16:18:57 2001
Return-path: <pgsql-hackers-owner+M9152@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LKIvQ21799
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 16:18:57 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LKBqA45520;
	Mon, 21 May 2001 16:11:52 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9152@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LK2VA42041
	for <pgsql-hackers@postgreSQL.org>; Mon, 21 May 2001 16:02:31 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 59448 invoked by uid 503); 21 May 2001 20:02:29 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 20:02:29 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC6HP>; Mon, 21 May 2001 13:01:29 -0700
Message-ID: <3705826352029646A3E91C53F7189E3201663A@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 13:01:29 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > We could keep share buffer lock (or add some other kind of lock)
> > untill tuple projected - after projection we need not to read data
> > for fetched tuple from shared buffer and time between fetching
> > tuple and projection is very short, so keeping lock on buffer will
> > not impact concurrency significantly.
>
> Or drop the pin on the buffer to show we no longer have a pointer
> to it. I'm not sure that the time to do projection is short though
> --- what if there are arbitrary user-defined functions in the quals
> or the projection targetlist?

Well, while we are on this subject I finally should say about issue
bothered me for long time: only "simple" functions should be allowed
to deal with data in shared buffers directly. "Simple" means: no SQL
queries there. Why? One reason: we hold shlock on buffer while doing
seqscan qual - what if qual' SQL queries will try to acquire exclock
on the same buffer? Another reason - concurrency. I think that such
"heavy" functions should be provided with copy of data.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9160@postgresql.org Mon May 21 20:38:44 2001
Return-path: <pgsql-hackers-owner+M9160@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4M0chQ18986
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 20:38:43 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4M0OoA26618;
	Mon, 21 May 2001 20:24:50 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9160@postgresql.org)
Received: from ns.sharemation.com (h-64-105-36-191.snvacaid.covad.net [64.105.36.191])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4M0JcA24979
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 20:19:38 -0400 (EDT)
	(envelope-from barry@xythos.com)
Received: from xythos.com ([192.168.254.19])
	by ns.sharemation.com (8.9.3/8.8.7) with ESMTP id QAA03121
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 16:04:21 -0700
Message-ID: <3B09B04A.5060806@xythos.com>
Date: Mon, 21 May 2001 17:18:18 -0700
From: Barry Lind <barry@xythos.com>
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.16-22 i686; en-US; m18) Gecko/20010131 Netscape6/6.01
X-Accept-Language: en
MIME-Version: 1.0
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <3705826352029646A3E91C53F7189E3201662E@sectorbase2.sectorbase.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


Mikheev, Vadim wrote:

>
> Ok, last reminder -:))
>
> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:
>
> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).
>
> Vadim

This is probably not a good thread to add my two cents worth, but here
goes anyway.

The biggest issue I see with the proposed UNDO using WAL is the issue of
large/long lasting transactions.  While it might be possible to solve
this problem with some extra work.  However keep in mind that different
types of transactions (i.e. normal vs bulk loads) require different
amounts of time and/or UNDO.  To solve this problem, you really need per
transaction limits which seems difficult to implement.

I have no doubt that UNDO with WAL can be done.  But is there some other
way of doing UNDO that might be just as good or better?

Part of what I see in this thread reading between the lines is that some
believe the solution to many problems in the long term is to implement
an overwriting storage manager.  Implementing UNDO via WAL is a
necessary step in that direction.  While others seem to believe that the
non-overwriting storage manager has some life in it yet, and may even be
the storage manager for many releases to come.  I don't know enough
about the internals to have any say in that discussion, however the
grass isn't always greener on the other side of the fence (i.e. an
overwriting storage manager will come with its own set of problems/issues).

So let me throw out an idea for UNDO using the current storage manager.
First let me state that UNDO is a bit of a misnomer, since undo for
transactions is already implemented.  That is what pg_log is all about.
The part of UNDO that is missing is savepoints (either explicit or
implicit), because pg_log doesn't capture the information for each
command in a transaction.  So the question really becomes, how to
implement savepoints with current storage manager?

I am going to lay out one assumption that I am making:
1) Most transactions are either completely successful or completely
rolled back
  (If this weren't true, i.e. you really needed savepoints to partially
rollback changes, you couldn't be using the existing version of
postgresql at all)

My proposal is:
  1) create a new relation to store 'failed commands' for transactions.
   This is similar to pg_log for transactions, but takes it to the
command level.  And since it records only failed commands (or ranges of
failed commands), thus most transactions will not have any information
in this relation per the assumption above.
   2) Use the unused pg_log status (3 = unused, 2 = commit, 1 = abort, 0
= inprocess) to mean that the transaction was commited but some commands
were rolled back (i.e. partial commit)
   Again for the majority of transactions nothing will need to change,
since they will still be marked as committed or aborted.
   3) Code that determines whether or not a tuple is committed or not
needs to be aware of this new pg_log status, and look in the new
relation to see if the particular command was rolled back or not to
determine the commited status of the tuple.  This subtly changes the
meaning of HEAP_XMIN_COMMITTED and related flags to reflect the
transaction and command status instead of just the transaction status.

The runtime cost of this shouldn't be too high, since the committed
state is cached in HEAP_XMIN_COMMITTED et al, it is only the added cost
for the pass that needs to set these flags, and then there is only added
cost in the case that the transaction wasn't completely sucessful (again
my assumption above).

Now I have know idea if what I am proposing is really doable or not.  I
am just throwing this out as an alternative to WAL based
UNDO/savepoints.  The reason I am doing this is that to me it seems to
leverage much of the existing infrastructure already in place that
performs undo for rolledback transactions (all the tmin, tmax, cmin,
cmax stuff as well as vacuum).  Also it doesn't come with the large WAL
log file problem for large transactions.

Now having said all of this I realize that this doesn't solve the 4
billion transaction id limit problem, or the large size of the pg_log
file with large numbers of transactions.

thanks,
--Barry


>


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9208@postgresql.org Tue May 22 14:02:04 2001
Return-path: <pgsql-hackers-owner+M9208@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4MI23Q13398
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 14:02:03 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MI1OA96629;
	Tue, 22 May 2001 14:01:24 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9208@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4MHY9A84049
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 13:34:09 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4MHWef08905;
	Tue, 22 May 2001 13:32:40 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105221732.f4MHWef08905@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3705826352029646A3E91C53F7189E32016637@sectorbase2.sectorbase.com>
	"from Mikheev, Vadim at May 21, 2001 11:01:45 am"
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
Date: Tue, 22 May 2001 13:32:40 -0400 (EDT)
cc: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

[ Charset ISO-8859-1 unsupported, converting... ]
> > > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > > what should be implemented for dynamic cleanup of aborted changes.
> >
> > I do not yet understand why you want to handle aborts different than
> > outdated tuples.
>
> Maybe because of aborted tuples have shorter Time-To-Live.
> And probability to find pages for them in buffer pool is higher.

This brings up an idea I had about auto-vacuum.  I wonder if autovacuum
could do most of its work by looking at the buffer cache pages and
commit xids.  Seems it would be quite easy record freespace in pages
already in the buffer and collect that information for other backends to
use.  It could also move tuples between cache pages with little
overhead.

There wouldn't be an I/O overhead, and frequently used tables are
already in the cache.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9209@postgresql.org Tue May 22 14:42:27 2001
Return-path: <pgsql-hackers-owner+M9209@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4MIgRQ16770
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 14:42:27 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MIfcA15426;
	Tue, 22 May 2001 14:41:38 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9209@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4MHmdA90489
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 13:48:39 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4MHlF409586;
	Tue, 22 May 2001 13:47:15 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105221747.f4MHlF409586@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3705826352029646A3E91C53F7189E32016637@sectorbase2.sectorbase.com>
	"from Mikheev, Vadim at May 21, 2001 11:01:45 am"
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
Date: Tue, 22 May 2001 13:47:15 -0400 (EDT)
cc: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > The ratio in a well tuned system should well favor outdated tuples.
> > If someone ever adds "dirty read" it is also not the case that it
> > is guaranteed, that nobody accesses the tuple you currently want
> > to undo. So I really miss to see the big difference.
>
> It will not be guaranteed anyway as soon as we start removing
> tuples without exclusive access to relation.
>
> And, I cannot say that I would implement UNDO because of
> 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> but because of ALL of 1., 2., 4.

OK, I understand your reasoning here, but I want to make a comment.

Looking at the previous features you added, like subqueries, MVCC, or
WAL, these were major features that greatly enhanced the system's
capabilities.

Now, looking at UNDO, I just don't see it in the same league as those
other additions.  Of course, you can work on whatever you want, but I
was hoping to see another major feature addition for 7.2.  We know we
badly need auto-vacuum, improved replication, and point-in-time recover.

I can see UNDO improving row reuse, and making subtransactions and
pg_log compression easier, but these items do not require UNDO.

In fact, I am unsure why we would want an UNDO way of reusing rows of
aborted transactions and an autovacuum way of reusing rows from
committed transactions, expecially because aborted transactions account
for <5% of all transactions.  It would be better to put work into one
mechanism that would reuse all tuples.

If UNDO came with no limitations, it may be a good option, but the need
to carry tuples until transaction commit does add an extra burden on
programmers and administrators, and I just don't see what we are getting
for it.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9227@postgresql.org Tue May 22 18:03:59 2001
Return-path: <pgsql-hackers-owner+M9227@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4MM3xQ20269
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 18:03:59 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MM3PA02618;
	Tue, 22 May 2001 18:03:25 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9227@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MLZAA93186
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 17:35:10 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 96775 invoked by uid 503); 22 May 2001 21:34:46 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 22 May 2001 21:34:46 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC0J8>; Tue, 22 May 2001 14:33:43 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016648@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>
cc: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 14:33:38 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
>
> OK, I understand your reasoning here, but I want to make a comment.
>
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
>
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions.  Of course, you can work on whatever you want, but I
> was hoping to see another major feature addition for 7.2.  We know we
> badly need auto-vacuum, improved replication, and point-in-time recover.

I don't like auto-vacuum approach in long term, WAL-based BAR is too easy
to do -:) (and you know that there is man who will do it, probably),
bidirectional sync replication is good to work on, but I'm more
interested in storage/transaction management now. And I'm not sure
if I'll have enough time for "another major feature in 7.2" anyway.

> It would be better to put work into one mechanism that would
> reuse all tuples.

This is what we're discussing now -:)
If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. Easy to
#ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have
to implement logging for non-btree indices (anyway required for UNDO,
WAL-based BAR, WAL-based space reusing).

Vadim

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From vmikheev@SECTORBASE.COM Tue May 22 17:34:52 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4MLYpQ17853
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 17:34:51 -0400 (EDT)
Received: (qmail 96775 invoked by uid 503); 22 May 2001 21:34:46 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 22 May 2001 21:34:46 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC0J8>; Tue, 22 May 2001 14:33:43 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016648@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>
cc: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 14:33:38 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
>
> OK, I understand your reasoning here, but I want to make a comment.
>
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
>
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions.  Of course, you can work on whatever you want, but I
> was hoping to see another major feature addition for 7.2.  We know we
> badly need auto-vacuum, improved replication, and point-in-time recover.

I don't like auto-vacuum approach in long term, WAL-based BAR is too easy
to do -:) (and you know that there is man who will do it, probably),
bidirectional sync replication is good to work on, but I'm more
interested in storage/transaction management now. And I'm not sure
if I'll have enough time for "another major feature in 7.2" anyway.

> It would be better to put work into one mechanism that would
> reuse all tuples.

This is what we're discussing now -:)
If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. Easy to
#ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have
to implement logging for non-btree indices (anyway required for UNDO,
WAL-based BAR, WAL-based space reusing).

Vadim

From Inoue@tpf.co.jp Tue May 22 20:49:08 2001
Return-path: <Inoue@tpf.co.jp>
Received: from sd.tpf.co.jp (IDENT:qmailr@sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4N0n6Q16869
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 20:49:06 -0400 (EDT)
Received: (qmail 1210 invoked from network); 23 May 2001 00:49:05 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 23 May 2001 00:49:05 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id JAA10910;
	Wed, 23 May 2001 09:48:57 +0900 (JST)
Message-ID: <3B0B091D.A5AF412E@tpf.co.jp>
Date: Wed, 23 May 2001 09:49:33 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105221747.f4MHlF409586@candle.pha.pa.us>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Status: ORr

Bruce Momjian wrote:
>
> > > The ratio in a well tuned system should well favor outdated tuples.
> > > If someone ever adds "dirty read" it is also not the case that it
> > > is guaranteed, that nobody accesses the tuple you currently want
> > > to undo. So I really miss to see the big difference.
> >
> > It will not be guaranteed anyway as soon as we start removing
> > tuples without exclusive access to relation.
> >
> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
>
> OK, I understand your reasoning here, but I want to make a comment.
>
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
>
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions.

Hmm hasn't it been an agreement ? I know UNDO was planned
for 7.0 and I've never heard objections about it until
recently. People also have referred to an overwriting smgr
easily. Please tell me how to introduce an overwriting smgr
without UNDO.

regards,
Hiroshi Inoue

From pgsql-hackers-owner+M9233@postgresql.org Tue May 22 21:11:29 2001
Return-path: <pgsql-hackers-owner+M9233@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N1BSQ24335
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 21:11:28 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4N1AsA58638;
	Tue, 22 May 2001 21:10:54 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9233@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4N0rlA52759
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 20:53:47 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4N0rNY17041;
	Tue, 22 May 2001 20:53:23 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105230053.f4N0rNY17041@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3B0B091D.A5AF412E@tpf.co.jp> "from Hiroshi Inoue at May 23, 2001
	09:49:33 am"
To: Hiroshi Inoue <Inoue@tpf.co.jp>
Date: Tue, 22 May 2001 20:53:23 -0400 (EDT)
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > Looking at the previous features you added, like subqueries, MVCC, or
> > WAL, these were major features that greatly enhanced the system's
> > capabilities.
> >
> > Now, looking at UNDO, I just don't see it in the same league as those
> > other additions.
>
> Hmm hasn't it been an agreement ? I know UNDO was planned
> for 7.0 and I've never heard objections about it until
> recently. People also have referred to an overwriting smgr
> easily. Please tell me how to introduce an overwriting smgr
> without UNDO.

I guess that is the question.  Are we heading for an overwriting storage
manager?  I didn't see that in Vadim's list of UNDO advantages, but
maybe that is his final goal.  If so UNDO may make sense, but then the
question is how do we keep MVCC with an overwriting storage manager?

The only way I can see doing it is to throw the old tuples into the WAL
and have backends read through that for MVCC info.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From Inoue@tpf.co.jp Tue May 22 23:04:52 2001
Return-path: <Inoue@tpf.co.jp>
Received: from sd.tpf.co.jp (IDENT:qmailr@sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4N34nQ29601
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 23:04:50 -0400 (EDT)
Received: (qmail 11698 invoked from network); 23 May 2001 03:04:48 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 23 May 2001 03:04:48 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id MAA10968;
	Wed, 23 May 2001 12:04:47 +0900 (JST)
Message-ID: <3B0B28F3.47F70E0F@tpf.co.jp>
Date: Wed, 23 May 2001 12:05:23 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105230053.f4N0rNY17041@candle.pha.pa.us>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Status: OR


Bruce Momjian wrote:
>
> > > Looking at the previous features you added, like subqueries, MVCC, or
> > > WAL, these were major features that greatly enhanced the system's
> > > capabilities.
> > >
> > > Now, looking at UNDO, I just don't see it in the same league as those
> > > other additions.
> >
> > Hmm hasn't it been an agreement ? I know UNDO was planned
> > for 7.0 and I've never heard objections about it until
> > recently. People also have referred to an overwriting smgr
> > easily. Please tell me how to introduce an overwriting smgr
> > without UNDO.
>
> I guess that is the question.  Are we heading for an overwriting storage
> manager?

I've never heard that it was given up. So there seems to be
at least a possibility to introduce it in the future.
PostgreSQL could have lived without UNDO due to its no
overwrite smgr. I don't know if avoiding UNDO is possible
to implement partial rollback(I don't think it's easy
anyway). However it seems harmful for the future
implementation of an overwriting smgr if we would
introduce it.

> I didn't see that in Vadim's list of UNDO advantages, but
> maybe that is his final goal.
> If so UNDO may make sense, but then the
> question is how do we keep MVCC with an overwriting storage manager?
>

It doesn't seem easy. ISTM it's one of the main reason we
couldn't introduce an overwriting smgr in 7.2.

regards,
Hiroshi Inoue

From pgsql-hackers-owner+M9241@postgresql.org Tue May 22 23:20:08 2001
Return-path: <pgsql-hackers-owner+M9241@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N3K7Q00337
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 23:20:07 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4N3JgA97799;
	Tue, 22 May 2001 23:19:42 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9241@postgresql.org)
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4N34tA93250
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 23:04:56 -0400 (EDT)
	(envelope-from Inoue@tpf.co.jp)
Received: (qmail 11698 invoked from network); 23 May 2001 03:04:48 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 23 May 2001 03:04:48 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id MAA10968;
	Wed, 23 May 2001 12:04:47 +0900 (JST)
Message-ID: <3B0B28F3.47F70E0F@tpf.co.jp>
Date: Wed, 23 May 2001 12:05:23 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: Bruce Momjian <pgman@candle.pha.pa.us>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105230053.f4N0rNY17041@candle.pha.pa.us>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


Bruce Momjian wrote:
>
> > > Looking at the previous features you added, like subqueries, MVCC, or
> > > WAL, these were major features that greatly enhanced the system's
> > > capabilities.
> > >
> > > Now, looking at UNDO, I just don't see it in the same league as those
> > > other additions.
> >
> > Hmm hasn't it been an agreement ? I know UNDO was planned
> > for 7.0 and I've never heard objections about it until
> > recently. People also have referred to an overwriting smgr
> > easily. Please tell me how to introduce an overwriting smgr
> > without UNDO.
>
> I guess that is the question.  Are we heading for an overwriting storage
> manager?

I've never heard that it was given up. So there seems to be
at least a possibility to introduce it in the future.
PostgreSQL could have lived without UNDO due to its no
overwrite smgr. I don't know if avoiding UNDO is possible
to implement partial rollback(I don't think it's easy
anyway). However it seems harmful for the future
implementation of an overwriting smgr if we would
introduce it.

> I didn't see that in Vadim's list of UNDO advantages, but
> maybe that is his final goal.
> If so UNDO may make sense, but then the
> question is how do we keep MVCC with an overwriting storage manager?
>

It doesn't seem easy. ISTM it's one of the main reason we
couldn't introduce an overwriting smgr in 7.2.

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pjw@rhyme.com.au Wed May 23 05:01:44 2001
Return-path: <pjw@rhyme.com.au>
Received: from acheron.rime.com.au (albatr.lnk.telstra.net [139.130.54.222])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N91YQ10467
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 05:01:42 -0400 (EDT)
Received: from oberon ([203.8.195.100])
	by acheron.rime.com.au (8.11.2/8.11.2/SuSE Linux 8.11.1-0.5) with SMTP id f4N8x1K02874;
	Wed, 23 May 2001 18:59:17 +1000
Message-ID: <3.0.5.32.20010523185858.00c24290@mail.rhyme.com.au>
X-Sender: pjw@mail.rhyme.com.au
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Wed, 23 May 2001 18:58:58 +1000
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
From: Philip Warner <pjw@rhyme.com.au>
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
cc: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
In-Reply-To: <3705826352029646A3E91C53F7189E32016648@sectorbase2.sectorb
	ase.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: OR

At 14:33 22/05/01 -0700, Mikheev, Vadim wrote:
>
>If community will not like UNDO then I'll probably try to implement
>dead space collector which will read log files and so on.

I'd vote for UNDO; in terms of usability & friendliness it's a big win.
Tom's plans for FSM etc are, at least, going to get us some useful data,
and at best will mean we can hang of WAL based FSM for a few versions.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

From pgsql-hackers-owner+M9249@postgresql.org Wed May 23 06:18:40 2001
Return-path: <pgsql-hackers-owner+M9249@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4NAIeQ14309
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 06:18:40 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4NAHnA87259;
	Wed, 23 May 2001 06:17:49 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9249@postgresql.org)
Received: from taru.tm.ee (taru.tm.ee [194.204.62.23])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4NA95A83396
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 06:09:07 -0400 (EDT)
	(envelope-from hannu@tm.ee)
Received: from tm.ee (localhost.localdomain [127.0.0.1])
	by taru.tm.ee (8.11.2/8.11.2) with ESMTP id f4NABnI19798;
	Wed, 23 May 2001 12:11:49 +0200
Message-ID: <3B0B8CE5.1D85B52D@tm.ee>
Date: Wed, 23 May 2001 12:11:49 +0200
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: et, en, ru
MIME-Version: 1.0
To: Hiroshi Inoue <Inoue@tpf.co.jp>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105221747.f4MHlF409586@candle.pha.pa.us> <3B0B091D.A5AF412E@tpf.co.jp>
Content-Type: text/plain; charset=iso-8859-15
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Hiroshi Inoue wrote:
>
> People also have referred to an overwriting smgr easily.

I am all for an overwriting smgr, but as a feature that can be selected
on a table-by table basis (or at least in compile time), not as an
overall change

> Please tell me how to introduce an overwriting smgr
> without UNDO.

I would much more like a dead-space-reusing smgr on top of MVCC which
does
not touch live transactions.

------------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From tgl@sss.pgh.pa.us Wed May 23 15:31:29 2001
Return-path: <tgl@sss.pgh.pa.us>
Received: from sss.pgh.pa.us (tgl@sss.pgh.pa.us [216.151.103.158])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4NJVSQ00678
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 15:31:28 -0400 (EDT)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f4NJVAR16596;
	Wed, 23 May 2001 15:31:12 -0400 (EDT)
To: Hiroshi Inoue <Inoue@tpf.co.jp>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev,
    Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3B0B28F3.47F70E0F@tpf.co.jp>
References: <200105230053.f4N0rNY17041@candle.pha.pa.us> <3B0B28F3.47F70E0F@tpf.co.jp>
Comments: In-reply-to Hiroshi Inoue <Inoue@tpf.co.jp>
	message dated "Wed, 23 May 2001 12:05:23 +0900"
Date: Wed, 23 May 2001 15:31:10 -0400
Message-ID: <16593.990646270@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Status: OR

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> I guess that is the question.  Are we heading for an overwriting storage
>> manager?

> I've never heard that it was given up. So there seems to be
> at least a possibility to introduce it in the future.

Unless we want to abandon MVCC (which I don't), I think an overwriting
smgr is impractical.  We need a more complex space-reuse scheme than
that.

			regards, tom lane

From Inoue@tpf.co.jp Wed May 23 19:14:31 2001
Return-path: <Inoue@tpf.co.jp>
Received: from sd.tpf.co.jp (IDENT:qmailr@sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4NNETQ22521
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 19:14:30 -0400 (EDT)
Received: (qmail 15859 invoked from network); 23 May 2001 23:14:29 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 23 May 2001 23:14:29 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id IAA11547;
	Thu, 24 May 2001 08:14:28 +0900 (JST)
Message-ID: <3B0C447A.E6EF4AF3@tpf.co.jp>
Date: Thu, 24 May 2001 08:15:06 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105230053.f4N0rNY17041@candle.pha.pa.us> <3B0B28F3.47F70E0F@tpf.co.jp> <16593.990646270@sss.pgh.pa.us>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Status: OR

Tom Lane wrote:
>
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> >> I guess that is the question.  Are we heading for an overwriting storage
> >> manager?
>
> > I've never heard that it was given up. So there seems to be
> > at least a possibility to introduce it in the future.
>
> Unless we want to abandon MVCC (which I don't), I think an overwriting
> smgr is impractical.

Impractical ? Oracle does it.

> We need a more complex space-reuse scheme than
> that.
>

IMHO we have to decide which to go now.
As I already mentioned, changing current handling
of transactionId/CommandId to avoid UNDO is not
only useless but also harmful for an overwriting
smgr.

regards,
Hiroshi Inoue

From dhogaza@pacifier.com Wed May 23 19:25:51 2001
Return-path: <dhogaza@pacifier.com>
Received: from asteroid.pacifier.com (asteroid.pacifier.com [199.2.117.154])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4NNPoQ22976
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 19:25:50 -0400 (EDT)
Received: from desktop (dsl-dhogaza.pacifier.net [207.202.226.68])
	by asteroid.pacifier.com (8.11.2/8.11.1) with SMTP id f4NNOuM04072;
	Wed, 23 May 2001 16:24:57 -0700 (PDT)
Message-ID: <3.0.1.32.20010523162448.01797330@mail.pacifier.com>
X-Sender: dhogaza@mail.pacifier.com
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
Date: Wed, 23 May 2001 16:24:48 -0700
To: Hiroshi Inoue <Inoue@tpf.co.jp>, Tom Lane <tgl@sss.pgh.pa.us>
From: Don Baccus <dhogaza@pacifier.com>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
In-Reply-To: <3B0C447A.E6EF4AF3@tpf.co.jp>
References: <200105230053.f4N0rNY17041@candle.pha.pa.us>
	<3B0B28F3.47F70E0F@tpf.co.jp>
	<16593.990646270@sss.pgh.pa.us>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: OR

At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:

>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>> smgr is impractical.
>
>Impractical ? Oracle does it.

It's not easy, though ... the current PG scheme has the advantage of being
relatively simple and probably more efficient than scanning logs like
Oracle has to do (assuming your datafiles aren't thoroughly clogged with
old dead tuples).

Has anyone looked at InterBase for hints for space-reusing strategies?

As I understand it, they have a tuple-versioning scheme similar to PG's.

If nothing else, something might be learned as to the efficiency and
effectiveness of one particular approach to solving the problem.


- Don Baccus, Portland OR <dhogaza@pacifier.com>
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.

From pgsql-hackers-owner+M9291@postgresql.org Wed May 23 22:29:59 2001
Return-path: <pgsql-hackers-owner+M9291@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4O2TxQ08894
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 22:29:59 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4O2TWA68746;
	Wed, 23 May 2001 22:29:32 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9291@postgresql.org)
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4O2LVA66613
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 22:21:31 -0400 (EDT)
	(envelope-from Inoue@tpf.co.jp)
Received: (qmail 29618 invoked from network); 24 May 2001 02:21:25 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 24 May 2001 02:21:25 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id LAA11655;
	Thu, 24 May 2001 11:21:23 +0900 (JST)
Message-ID: <3B0C7048.902DD407@tpf.co.jp>
Date: Thu, 24 May 2001 11:22:00 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: Don Baccus <dhogaza@pacifier.com>
cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105230053.f4N0rNY17041@candle.pha.pa.us>
		 <3B0B28F3.47F70E0F@tpf.co.jp>
		 <16593.990646270@sss.pgh.pa.us> <3.0.1.32.20010523162448.01797330@mail.pacifier.com>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Don Baccus wrote:
>
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
>
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
>
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
>

I think so too. I've never said that an overwriting smgr
is easy and I don't love it particularily.

What I'm objecting is to avoid UNDO without giving up
an overwriting smgr. We shouldn't be noncommittal now.

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From Inoue@tpf.co.jp Wed May 23 22:21:26 2001
Return-path: <Inoue@tpf.co.jp>
Received: from sd.tpf.co.jp (IDENT:qmailr@sd.tpf.co.jp [210.161.239.34])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4O2LPQ08631
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 22:21:25 -0400 (EDT)
Received: (qmail 29618 invoked from network); 24 May 2001 02:21:25 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 24 May 2001 02:21:25 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id LAA11655;
	Thu, 24 May 2001 11:21:23 +0900 (JST)
Message-ID: <3B0C7048.902DD407@tpf.co.jp>
Date: Thu, 24 May 2001 11:22:00 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: Don Baccus <dhogaza@pacifier.com>
cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105230053.f4N0rNY17041@candle.pha.pa.us>
		 <3B0B28F3.47F70E0F@tpf.co.jp>
		 <16593.990646270@sss.pgh.pa.us> <3.0.1.32.20010523162448.01797330@mail.pacifier.com>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Status: OR

Don Baccus wrote:
>
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
>
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
>
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
>

I think so too. I've never said that an overwriting smgr
is easy and I don't love it particularily.

What I'm objecting is to avoid UNDO without giving up
an overwriting smgr. We shouldn't be noncommittal now.

regards,
Hiroshi Inoue

From dhogaza@pacifier.com Thu May 24 08:55:51 2001
Return-path: <dhogaza@pacifier.com>
Received: from asteroid.pacifier.com (asteroid.pacifier.com [199.2.117.154])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4OCtoQ02711
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 08:55:51 -0400 (EDT)
Received: from desktop (dsl-dhogaza.pacifier.net [207.202.226.68])
	by asteroid.pacifier.com (8.11.2/8.11.1) with SMTP id f4OCt2M03955;
	Thu, 24 May 2001 05:55:02 -0700 (PDT)
Message-ID: <3.0.1.32.20010523214243.017aab70@mail.pacifier.com>
X-Sender: dhogaza@mail.pacifier.com
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
Date: Wed, 23 May 2001 21:42:43 -0700
To: Tom Lane <tgl@sss.pgh.pa.us>, Hiroshi Inoue <Inoue@tpf.co.jp>
From: Don Baccus <dhogaza@pacifier.com>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev,    Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
In-Reply-To: <26782.990673338@sss.pgh.pa.us>
References: <3B0C447A.E6EF4AF3@tpf.co.jp>
	<200105230053.f4N0rNY17041@candle.pha.pa.us>
	<3B0B28F3.47F70E0F@tpf.co.jp>
	<16593.990646270@sss.pgh.pa.us>
	<3B0C447A.E6EF4AF3@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: OR

At 11:02 PM 5/23/01 -0400, Tom Lane wrote:
>Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> Tom Lane wrote:
>>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>>> smgr is impractical.
>
>> Impractical ? Oracle does it.
>
>Oracle has MVCC?

With restrictions, yes.  You didn't know that?  Vadim did ...


- Don Baccus, Portland OR <dhogaza@pacifier.com>
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.

From pgsql-hackers-owner+M9319@postgresql.org Thu May 24 13:21:55 2001
Return-path: <pgsql-hackers-owner+M9319@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4OHLtt18473
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 13:21:55 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4OHLMA41708;
	Thu, 24 May 2001 13:21:22 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9319@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4OH1fA33215
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 13:01:42 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 84033 invoked by uid 503); 24 May 2001 17:01:41 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 24 May 2001 17:01:41 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDGHJ>; Thu, 24 May 2001 10:00:31 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016650@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'"
  <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 24 May 2001 10:00:31 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> >> Impractical ? Oracle does it.
> >
> >Oracle has MVCC?
>
> With restrictions, yes.

What restrictions? Rollback segments size?
Non-overwriting smgr can eat all disk space...

> You didn't know that?  Vadim did ...

Didn't I mention a few times that I was
inspired by Oracle? -:)

Vadim

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9322@postgresql.org Thu May 24 13:49:18 2001
Return-path: <pgsql-hackers-owner+M9322@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4OHnIt19501
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 13:49:18 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4OHmoA53643;
	Thu, 24 May 2001 13:48:50 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9322@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4OHVoA46193
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 13:31:51 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 90475 invoked by uid 503); 24 May 2001 17:31:50 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 24 May 2001 17:31:50 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDGKJ>; Thu, 24 May 2001 10:30:40 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016651@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>, Hiroshi Inoue <Inoue@tpf.co.jp>
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 24 May 2001 10:30:39 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> If PostgreSQL wants to stay MVCC, then we should imho forget
> "overwriting smgr" very fast.
>
> Let me try to list the pros and cons that I can think of:
> Pro:
> 	no index modification if key stays same
> 	no search for free space for update (if tuple still
>         fits into page)
> 	no pg_log
> Con:
> 	additional IO to write "before image" to rollback segment
> 		(every before image, not only first after checkpoint)
> 		(also before image of every index page that is updated !)

I don't think that Oracle writes entire page as before image - just
tuple data and some control info. As for additional IO - we'll do it
anyway to remove "before image" (deleted tuple data) from data files.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From vmikheev@SECTORBASE.COM Thu May 24 13:31:55 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4OHVtt18902
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 13:31:55 -0400 (EDT)
Received: (qmail 90475 invoked by uid 503); 24 May 2001 17:31:50 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 24 May 2001 17:31:50 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDGKJ>; Thu, 24 May 2001 10:30:40 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016651@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>, Hiroshi Inoue <Inoue@tpf.co.jp>
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 24 May 2001 10:30:39 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> If PostgreSQL wants to stay MVCC, then we should imho forget
> "overwriting smgr" very fast.
>
> Let me try to list the pros and cons that I can think of:
> Pro:
> 	no index modification if key stays same
> 	no search for free space for update (if tuple still
>         fits into page)
> 	no pg_log
> Con:
> 	additional IO to write "before image" to rollback segment
> 		(every before image, not only first after checkpoint)
> 		(also before image of every index page that is updated !)

I don't think that Oracle writes entire page as before image - just
tuple data and some control info. As for additional IO - we'll do it
anyway to remove "before image" (deleted tuple data) from data files.

Vadim

From pgsql-hackers-owner+M9327@postgresql.org Thu May 24 14:23:44 2001
Return-path: <pgsql-hackers-owner+M9327@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4OINit21100
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 14:23:44 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4OINAA69817;
	Thu, 24 May 2001 14:23:10 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9327@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4OHwVA57438
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 13:58:31 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 96204 invoked by uid 503); 24 May 2001 17:58:30 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 24 May 2001 17:58:30 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDGMN>; Thu, 24 May 2001 10:57:20 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016652@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Hiroshi Inoue'" <Inoue@tpf.co.jp>, Don Baccus <dhogaza@pacifier.com>
cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 24 May 2001 10:57:19 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-2022-jp"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> I think so too. I've never said that an overwriting smgr
> is easy and I don't love it particularily.
>
> What I'm objecting is to avoid UNDO without giving up
> an overwriting smgr. We shouldn't be noncommittal now.

Why not? We could decide to do overwriting smgr later
and implement UNDO then. For the moment we could just
change checkpointer to use checkpoint.redo instead of
checkpoint.undo when defining what log files should be
deleted - it's a few minutes deal, and so is changing it
back.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From dhogaza@pacifier.com Mon May 28 10:42:51 2001
Return-path: <dhogaza@pacifier.com>
Received: from comet.pacifier.com (comet.pacifier.com [199.2.117.155])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4SEgog06154
	for <pgman@candle.pha.pa.us>; Mon, 28 May 2001 10:42:50 -0400 (EDT)
Received: from desktop (dsl-dhogaza.pacifier.net [207.202.226.68])
	by comet.pacifier.com (8.11.2/8.11.1) with SMTP id f4SEg2i04695;
	Mon, 28 May 2001 07:42:03 -0700 (PDT)
Message-ID: <3.0.1.32.20010524111646.01776100@mail.pacifier.com>
X-Sender: dhogaza@mail.pacifier.com
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
Date: Thu, 24 May 2001 11:16:46 -0700
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>
From: Don Baccus <dhogaza@pacifier.com>
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
In-Reply-To: <3705826352029646A3E91C53F7189E32016650@sectorbase2.sectorb
	ase.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: OR

At 10:00 AM 5/24/01 -0700, Mikheev, Vadim wrote:
>> >> Impractical ? Oracle does it.
>> >
>> >Oracle has MVCC?
>>
>> With restrictions, yes.
>
>What restrictions? Rollback segments size?
>Non-overwriting smgr can eat all disk space...

Actually, the restriction I'm thinking about isn't MVCC related, per
se, but a within-transaction restriction.  The infamous "mutating table"
error.

>> You didn't know that?  Vadim did ...
>
>Didn't I mention a few times that I was
>inspired by Oracle? -:)

Yes, you most certainly have!


- Don Baccus, Portland OR <dhogaza@pacifier.com>
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.

From pgsql-hackers-owner+M9344@postgresql.org Thu May 24 20:00:27 2001
Return-path: <pgsql-hackers-owner+M9344@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P00Qt19276
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 20:00:27 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4ONxtA85777;
	Thu, 24 May 2001 19:59:55 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9344@postgresql.org)
Received: from rh72.home.ee (adsl895.estpak.ee [213.168.23.133])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4ONtVA84581
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 19:55:31 -0400 (EDT)
	(envelope-from hannu@tm.ee)
Received: from tm.ee (rh72.home.ee [127.0.0.1])
	by rh72.home.ee (8.11.2/8.11.2) with ESMTP id f4OKp6T01940;
	Fri, 25 May 2001 01:51:06 +0500
Message-ID: <3B0D743A.B57B76A0@tm.ee>
Date: Fri, 25 May 2001 01:51:06 +0500
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: en, ru, et
MIME-Version: 1.0
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <3705826352029646A3E91C53F7189E32016650@sectorbase2.sectorbase.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Mikheev, Vadim" wrote:
>
> > >> Impractical ? Oracle does it.
> > >
> > >Oracle has MVCC?
> >
> > With restrictions, yes.
>
> What restrictions? Rollback segments size?
> Non-overwriting smgr can eat all disk space...

Is'nt the same true for an overwriting smgr ? ;)

> > You didn't know that?  Vadim did ...
>
> Didn't I mention a few times that I was
> inspired by Oracle? -:)

How does it do MVCC with an overwriting storage manager ?

Could it possibly be a Postgres-inspired bolted-on hack
needed for better concurrency ?


BTW, are you aware how Interbase does its MVCC - is it more
like Oracle's way or like PostgreSQL's ?

----------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9345@postgresql.org Thu May 24 20:14:28 2001
Return-path: <pgsql-hackers-owner+M9345@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P0ERt20188
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 20:14:27 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P0DBA89822;
	Thu, 24 May 2001 20:13:12 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9345@postgresql.org)
Received: from rh72.home.ee (adsl895.estpak.ee [213.168.23.133])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4P08qA88602
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 20:08:52 -0400 (EDT)
	(envelope-from hannu@tm.ee)
Received: from tm.ee (rh72.home.ee [127.0.0.1])
	by rh72.home.ee (8.11.2/8.11.2) with ESMTP id f4OL5JT01975;
	Fri, 25 May 2001 02:05:19 +0500
Message-ID: <3B0D778F.8DF75D48@tm.ee>
Date: Fri, 25 May 2001 02:05:19 +0500
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: en, ru, et
MIME-Version: 1.0
To: Don Baccus <dhogaza@pacifier.com>
cc: Hiroshi Inoue <Inoue@tpf.co.jp>, Tom Lane <tgl@sss.pgh.pa.us>,
   Bruce Momjian <pgman@candle.pha.pa.us>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <200105230053.f4N0rNY17041@candle.pha.pa.us>
  <3B0B28F3.47F70E0F@tpf.co.jp>
  <16593.990646270@sss.pgh.pa.us> <3.0.1.32.20010523162448.01797330@mail.pacifier.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Don Baccus wrote:
>
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
>
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
>
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
>
> Has anyone looked at InterBase for hints for space-reusing strategies?
>
> As I understand it, they have a tuple-versioning scheme similar to PG's.
>
> If nothing else, something might be learned as to the efficiency and
> effectiveness of one particular approach to solving the problem.

It may also be beneficial to study SapDB (which is IIRC a branch-off of
Adabas) although they claim at http://www.sapdb.org/ in features
section:

NOT supported features:

              Collations

              Result sets that are created within a stored procedure and
fetched outside. This feature is planned to be
              offered in one of the coming releases.
              Meanwhile, use temporary tables.
              see Reference Manual: SAP DB 7.2 and 7.3 -> Data
definition -> CREATE TABLE statement: Owner of a
              table

              Multi version concurrency for OLTP
              It is available with the object extension of SAPDB only.

              Hot stand by
              This feature is planned to be offered in one of the coming
releases.

So MVCC seems to be a bolt-on feature there.

---------------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9343@postgresql.org Thu May 24 19:58:01 2001
Return-path: <pgsql-hackers-owner+M9343@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4ONw0t19025
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 19:58:01 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4ONvPA85057;
	Thu, 24 May 2001 19:57:25 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9343@postgresql.org)
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4ONpSA83441
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 19:51:30 -0400 (EDT)
	(envelope-from Inoue@tpf.co.jp)
Received: (qmail 7484 invoked from network); 24 May 2001 23:51:13 -0000
Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
  by sd2.10.0.100.in-addr.arpa with SMTP; 24 May 2001 23:51:13 -0000
Received: from tpf.co.jp (3d_note1 [126.0.1.61])
	by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id IAA12184;
	Fri, 25 May 2001 08:51:06 +0900 (JST)
Message-ID: <3B0D9E90.8DB98EFC@tpf.co.jp>
Date: Fri, 25 May 2001 08:51:44 +0900
From: Hiroshi Inoue <Inoue@tpf.co.jp>
X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
X-Accept-Language: ja
MIME-Version: 1.0
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: Don Baccus <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <3705826352029646A3E91C53F7189E32016652@sectorbase2.sectorbase.com>
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr

"Mikheev, Vadim" wrote:
>
> > I think so too. I've never said that an overwriting smgr
> > is easy and I don't love it particularily.
> >
> > What I'm objecting is to avoid UNDO without giving up
> > an overwriting smgr. We shouldn't be noncommittal now.
>
> Why not? We could decide to do overwriting smgr later
> and implement UNDO then.

What I'm refering to is the discussion about the handling
of subtransactions in order to introduce the savepoints
functionality. Or do we postpone *savepoints* again ?

I realize now few people have had the idea how to switch
to an overwriting smgr. I don't think an overwriting smgr
could be achived at once and we have to prepare one by one
for it.  AFAIK there's no idea how to introduce an overwriting
smgr without UNDO. If we avoid UNDO now when overwriting smgr
would appear ? I also think that the problems Andreas has
specified are pretty serious. I also have known the problems
and I've expected that people have the idea to solve it but
...  I'm inclined to give up an overwriting smgr now.

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9347@postgresql.org Thu May 24 20:31:27 2001
Return-path: <pgsql-hackers-owner+M9347@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P0VQt21167
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 20:31:26 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P0V1A94671;
	Thu, 24 May 2001 20:31:01 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9347@postgresql.org)
Received: from acheron.rime.com.au (albatr.lnk.telstra.net [139.130.54.222])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4P0QkA93400
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 20:26:46 -0400 (EDT)
	(envelope-from pjw@rhyme.com.au)
Received: from oberon (Oberon.rime.com.au [203.8.195.100])
	by acheron.rime.com.au (8.11.2/8.11.2/SuSE Linux 8.11.1-0.5) with SMTP id f4P0LbK15911;
	Fri, 25 May 2001 10:21:37 +1000
Message-ID: <3.0.5.32.20010525102137.0395e100@mail.rhyme.com.au>
X-Sender: pjw@mail.rhyme.com.au
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Fri, 25 May 2001 10:21:37 +1000
To: Hannu Krosing <hannu@tm.ee>, "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
From: Philip Warner <pjw@rhyme.com.au>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
cc: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
In-Reply-To: <3B0D743A.B57B76A0@tm.ee>
References: <3705826352029646A3E91C53F7189E32016650@sectorbase2.sectorbase.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

At 01:51 25/05/01 +0500, Hannu Krosing wrote:
>
>How does it do MVCC with an overwriting storage manager ?
>

I don't know about Oracle, but Dec/RDB also does overwriting and MVCC. It
does this by taking a snapshot of pages that are participating in both RW
and RO transactions (De/RDB has the options on SET TRANSACTION that specify
if the TX will do updates or not). It has the disadvantage that the
snapshot will grow quite large for bulk loads. Typically they are about
10-20% of DB size. Pages are freed from the snapshot as active TXs finish.

Note that the snapshots are separate from the journalling (WAL) and
rollback files.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9346@postgresql.org Thu May 24 20:30:01 2001
Return-path: <pgsql-hackers-owner+M9346@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P0U0t21115
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 20:30:00 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P0TXA94058;
	Thu, 24 May 2001 20:29:33 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9346@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P0OdA92709
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 20:24:39 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 31434 invoked by uid 503); 25 May 2001 00:24:31 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 25 May 2001 00:24:31 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDHJ0>; Thu, 24 May 2001 17:23:20 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016655@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Hannu Krosing'" <hannu@tm.ee>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 24 May 2001 17:23:19 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > > >Oracle has MVCC?
> > >
> > > With restrictions, yes.
> >
> > What restrictions? Rollback segments size?
> > Non-overwriting smgr can eat all disk space...
>
> Is'nt the same true for an overwriting smgr ? ;)

Removing dead records from rollback segments should
be faster than from datafiles.

> > > You didn't know that?  Vadim did ...
> >
> > Didn't I mention a few times that I was
> > inspired by Oracle? -:)
>
> How does it do MVCC with an overwriting storage manager ?

1. System Change Number (SCN) is used: system increments it
   on each transaction commit.
2. When scan meets data block with SCN > SCN as it was when
   query/transaction started, old block image is restored
   using rollback segments.

> Could it possibly be a Postgres-inspired bolted-on hack
> needed for better concurrency ?

-:)) Oracle has MVCC for years, probably from the beginning
and for sure before Postgres.

> BTW, are you aware how Interbase does its MVCC - is it more
> like Oracle's way or like PostgreSQL's ?

Like ours.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From pgsql-hackers-owner+M9348@postgresql.org Thu May 24 21:13:34 2001
Return-path: <pgsql-hackers-owner+M9348@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P1DYt24746
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 21:13:34 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P1D9A05820;
	Thu, 24 May 2001 21:13:09 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9348@postgresql.org)
Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4P164A03993
	for <pgsql-hackers@postgresql.org>; Thu, 24 May 2001 21:06:04 -0400 (EDT)
	(envelope-from pgman@candle.pha.pa.us)
Received: (from pgman@localhost)
	by candle.pha.pa.us (8.10.1/8.10.1) id f4P108j23173;
	Thu, 24 May 2001 21:00:08 -0400 (EDT)
From: Bruce Momjian <pgman@candle.pha.pa.us>
Message-ID: <200105250100.f4P108j23173@candle.pha.pa.us>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3B0D9E90.8DB98EFC@tpf.co.jp> "from Hiroshi Inoue at May 25, 2001
	08:51:44 am"
To: Hiroshi Inoue <Inoue@tpf.co.jp>
Date: Thu, 24 May 2001 21:00:08 -0400 (EDT)
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   Don Baccus <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL90 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> What I'm refering to is the discussion about the handling
> of subtransactions in order to introduce the savepoints
> functionality. Or do we postpone *savepoints* again ?
>
> I realize now few people have had the idea how to switch
> to an overwriting smgr. I don't think an overwriting smgr
> could be achived at once and we have to prepare one by one
> for it.  AFAIK there's no idea how to introduce an overwriting
> smgr without UNDO. If we avoid UNDO now when overwriting smgr
> would appear ? I also think that the problems Andreas has
> specified are pretty serious. I also have known the problems
> and I've expected that people have the idea to solve it but
> ...  I'm inclined to give up an overwriting smgr now.

Now that everyone has commented on the UNDO issue, I thought I would try
to summarize the comments so we can come to some kind of conclusion.

Here are the issues as I see them:

---------------------------------------------------------------------------

Do we want to keep MVCC?

Yes.  No one has said otherwise.

---------------------------------------------------------------------------

Do we want to head for an overwriting storage manager?

Not sure.

Advantages:  UPDATE has easy space reuse because usually done in-place,
no index change on UPDATE unless key is changed.

Disadvantages:  Old records have to be stored somewhere for MVCC use.
Could limit transaction size.

---------------------------------------------------------------------------

Do we want UNDO _if_ we are heading for an overwriting storage manager?

Everyone seems to say yes.

---------------------------------------------------------------------------

Do we want UNDO if we are _not_ heading for an overwriting storage
manager?

This is the tough one.  UNDO advantages are:

	Make subtransactions easier by rolling back aborted subtransaction.
	Workaround is using a new transactions id for each subtransaction.

	Easy space reuse for aborted transactions.

	Reduce size of pg_log.

UNDO disadvantages are:

	Limit size of transactions to log storage size.

---------------------------------------------------------------------------

If we are heading for an overwriting storage manager, we may as well get
UNDO now.  If we are not, then we have to decide if we can solve the
problems that UNDO would fix.  Basically, can we solve those problems
easier without UNDO, or are the disadvanges of UNDO too great?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9356@postgresql.org Fri May 25 05:15:44 2001
Return-path: <pgsql-hackers-owner+M9356@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P9Fht13195
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 05:15:43 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P9EgA03190;
	Fri, 25 May 2001 05:14:42 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9356@postgresql.org)
Received: from rh72.home.ee (adsl895.estpak.ee [213.168.23.133])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4P96nA00120
	for <pgsql-hackers@postgresql.org>; Fri, 25 May 2001 05:06:50 -0400 (EDT)
	(envelope-from hannu@tm.ee)
Received: from tm.ee (rh72.home.ee [127.0.0.1])
	by rh72.home.ee (8.11.2/8.11.2) with ESMTP id f4P61rf01629;
	Fri, 25 May 2001 11:01:54 +0500
Message-ID: <3B0DF551.275F382C@tm.ee>
Date: Fri, 25 May 2001 11:01:53 +0500
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: en, ru, et
MIME-Version: 1.0
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <3705826352029646A3E91C53F7189E32016655@sectorbase2.sectorbase.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Mikheev, Vadim" wrote:
>
> > > > >Oracle has MVCC?
> > > >
> > > > With restrictions, yes.
> > >
> > > What restrictions? Rollback segments size?
> > > Non-overwriting smgr can eat all disk space...
> >
> > Is'nt the same true for an overwriting smgr ? ;)
>
> Removing dead records from rollback segments should
> be faster than from datafiles.

Is it for better locality or are they stored in a different way ?

Do you think that there is some fundamental performance advantage
in making a copy to rollback segment and then deleting it from
there vs. reusing space in datafiles ?

One thing (not having to updata non-changing index entries) can be
quite substantial under some scenarios, but there are probably ways
to at least speed up part of this by doing other compromizes, perhaps
by saving more info in index leaf (trading lookup speed for space
and insert speed) or chaining data pages (trading insert speed for
(some) space and lookup speed)

> > > > You didn't know that?  Vadim did ...
> > >
> > > Didn't I mention a few times that I was
> > > inspired by Oracle? -:)
> >
> > How does it do MVCC with an overwriting storage manager ?
>
> 1. System Change Number (SCN) is used: system increments it
>    on each transaction commit.
> 2. When scan meets data block with SCN > SCN as it was when
>    query/transaction started, old block image is restored
>    using rollback segments.

You mean it is restored in session that is running the transaction ?

I guess thet it could be slower than our current way of doing it.

> > Could it possibly be a Postgres-inspired bolted-on hack
> > needed for better concurrency ?
>
> -:)) Oracle has MVCC for years, probably from the beginning
> and for sure before Postgres.

In that case we can claim thet their way is more primitive ;) ;)

-----------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From vmikheev@SECTORBASE.COM Fri May 25 12:38:38 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4PGcbt10779
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 12:38:38 -0400 (EDT)
Received: (qmail 95922 invoked by uid 503); 25 May 2001 16:38:29 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 25 May 2001 16:38:29 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDJAL>; Fri, 25 May 2001 09:37:16 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016656@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>
cc: Don Baccus <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 25 May 2001 09:37:16 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> Do we want to head for an overwriting storage manager?
>
> Not sure.
>
> Advantages:  UPDATE has easy space reuse because usually done
> in-place, no index change on UPDATE unless key is changed.
>
> Disadvantages:  Old records have to be stored somewhere for MVCC use.
> Could limit transaction size.

Really? Why is it assumed that we *must* limit size of rollback segments?
We can let them grow without bounds, as we do now keeping old records in
datafiles and letting them eat all of disk space.

> UNDO disadvantages are:
>
> 	Limit size of transactions to log storage size.

Don't be kidding - in any system transactions size is limitted
by available storage. So we should tell that more disk space
is required for UNDO. From my POV, putting $100 to buy 30Gb
disk is not big deal, keeping in mind that PGSQL requires
$ZERO to be used.

Vadim

From pgsql-hackers-owner+M9365@postgresql.org Fri May 25 13:11:43 2001
Return-path: <pgsql-hackers-owner+M9365@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4PHBht18086
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 13:11:43 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4PHBEA00999;
	Fri, 25 May 2001 13:11:14 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9365@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4PGcXA86450
	for <pgsql-hackers@postgresql.org>; Fri, 25 May 2001 12:38:33 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 95922 invoked by uid 503); 25 May 2001 16:38:29 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 25 May 2001 16:38:29 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDJAL>; Fri, 25 May 2001 09:37:16 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016656@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>
cc: Don Baccus <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 25 May 2001 09:37:16 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Do we want to head for an overwriting storage manager?
>
> Not sure.
>
> Advantages:  UPDATE has easy space reuse because usually done
> in-place, no index change on UPDATE unless key is changed.
>
> Disadvantages:  Old records have to be stored somewhere for MVCC use.
> Could limit transaction size.

Really? Why is it assumed that we *must* limit size of rollback segments?
We can let them grow without bounds, as we do now keeping old records in
datafiles and letting them eat all of disk space.

> UNDO disadvantages are:
>
> 	Limit size of transactions to log storage size.

Don't be kidding - in any system transactions size is limitted
by available storage. So we should tell that more disk space
is required for UNDO. From my POV, putting $100 to buy 30Gb
disk is not big deal, keeping in mind that PGSQL requires
$ZERO to be used.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From vmikheev@SECTORBASE.COM Fri May 25 13:06:50 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4PH6ot17912
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 13:06:50 -0400 (EDT)
Received: (qmail 5857 invoked by uid 503); 25 May 2001 17:06:45 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 25 May 2001 17:06:45 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDJCN>; Fri, 25 May 2001 10:05:32 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016657@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   "'Don Baccus'"
  <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker
  <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 25 May 2001 10:05:31 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> > > >Oracle has MVCC?
> > >
> > > With restrictions, yes.
> >
> > What restrictions? Rollback segments size?
>
> No, that is not the whole story. The problem with their
> "rollback segment approach" is, that they do not guard against
> overwriting a tuple version in the rollback segment.
> They simply recycle each segment in a wrap around manner.
> Thus there could be an open transaction that still wanted to
> see a tuple version that was already overwritten, leading to the
> feared "snapshot too old" error.
>
> Copying their "rollback segment" approach is imho the last
> thing we want to do.

So, they limit size of rollback segments and we don't limit
how big our datafiles may grow if there is some long running
transaction in serializable mode. We could allow our rollback
segments to grow without limits as well.

> > Non-overwriting smgr can eat all disk space...
> >
> > > You didn't know that?  Vadim did ...
> >
> > Didn't I mention a few times that I was inspired by Oracle? -:)
>
> Looking at what they supply in the feature area is imho good.
> Copying their technical architecture is not so good in general.

Copying is not inspiration -:)

Vadim

From pgsql-hackers-owner+M9367@postgresql.org Fri May 25 14:01:43 2001
Return-path: <pgsql-hackers-owner+M9367@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4PI1gt20100
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 14:01:42 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4PI1GA21327;
	Fri, 25 May 2001 14:01:16 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9367@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4PHrVA17755
	for <pgsql-hackers@postgresql.org>; Fri, 25 May 2001 13:53:31 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 15841 invoked by uid 503); 25 May 2001 17:53:30 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 25 May 2001 17:53:30 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDJF3>; Fri, 25 May 2001 10:52:17 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016658@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Hannu Krosing'" <hannu@tm.ee>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 25 May 2001 10:52:17 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > Removing dead records from rollback segments should
> > be faster than from datafiles.
>
> Is it for better locality or are they stored in a different way ?

Locality - all dead data would be localized in one place.

> Do you think that there is some fundamental performance advantage
> in making a copy to rollback segment and then deleting it from
> there vs. reusing space in datafiles ?

As it showed by WAL additional writes don't mean worse performance.
As for deleting from RS (rollback segment) - we could remove or reuse
RS files as whole.

> > > How does it do MVCC with an overwriting storage manager ?
> >
> > 1. System Change Number (SCN) is used: system increments it
> >    on each transaction commit.
> > 2. When scan meets data block with SCN > SCN as it was when
> >    query/transaction started, old block image is restored
> >    using rollback segments.
>
> You mean it is restored in session that is running the transaction ?
>
> I guess thet it could be slower than our current way of doing it.

Yes, for older transactions which *really* need in *particular*
old data, but not for newer ones. Look - now transactions have to read
dead data again and again, even if some of them (newer) need not to see
those data at all, and we keep dead data as long as required for other
old transactions *just for the case* they will look there.
But who knows?! Maybe those old transactions will not read from table
with big amount of dead data at all! So - why keep dead data in datafiles
for long time? This obviously affects overall system performance.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9387@postgresql.org Sun May 27 04:42:32 2001
Return-path: <pgsql-hackers-owner+M9387@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4R8gVa22868
	for <pgman@candle.pha.pa.us>; Sun, 27 May 2001 04:42:32 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4R8fwA13840;
	Sun, 27 May 2001 04:41:58 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9387@postgresql.org)
Received: from p2272.nsk.ne.jp (p2272.nsk.ne.jp [210.145.18.145])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4R8YeA10972
	for <pgsql-hackers@postgresql.org>; Sun, 27 May 2001 04:34:41 -0400 (EDT)
	(envelope-from Inoue@tpf.co.jp)
Received: from mcadnote1 (ppm147.noc.fukui.nsk.ne.jp [210.161.188.66])
	by p2272.nsk.ne.jp (8.9.3/3.7W-20000722) with SMTP id RAA11130;
	Sun, 27 May 2001 17:32:07 +0900 (JST)
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
cc: "Don Baccus" <dhogaza@pacifier.com>, "Tom Lane" <tgl@sss.pgh.pa.us>,
   "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   "The Hermit Hacker" <scrappy@hub.org>, <pgsql-hackers@postgresql.org>
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Sun, 27 May 2001 17:32:54 +0900
Message-ID: <EKEJJICOHDIEMGPNIFIJKEEBEIAA.Inoue@tpf.co.jp>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
Importance: Normal
In-Reply-To: <3705826352029646A3E91C53F7189E32016656@sectorbase2.sectorbase.com>
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> -----Original Message-----
> From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
>
> > Do we want to head for an overwriting storage manager?
> >
> > Not sure.
> >
> > Advantages:  UPDATE has easy space reuse because usually done
> > in-place, no index change on UPDATE unless key is changed.
> >
> > Disadvantages:  Old records have to be stored somewhere for MVCC use.
> > Could limit transaction size.
>
> Really? Why is it assumed that we *must* limit size of rollback segments?
> We can let them grow without bounds, as we do now keeping old records in
> datafiles and letting them eat all of disk space.
>

Is it proper/safe for a DBMS to allow the system eating all disk
space ? For example, could we expect to recover the database
even when no disk space available ?

1) even before WAL
    Is 'deleting records and vacuum' always possible ?
    I saw the cases that indexes grow by vacuum.

2) under WAL(current)
    If DROP or VACUUM is done after a checkpoint, wouldn't
    REDO recovery add the pages drop/truncated by the
    DROP/VACUUM ?

3) with rollback data
    Shouldn't WAL log UNDO operations either ?
    If so, UNDO requires an extra disk space which could
    be unlimitedly big.

There's another serious problem. Once UNDO is required
with a biiiig rollback data, it would take a veeery long time
to undo. It's quite different from the current behavior. Even
though people want to cancel the UNDO, there's no way
unfortunately(under an overwriting smgr).

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From vmikheev@sectorbase.com Mon May 28 13:11:10 2001
Return-path: <vmikheev@sectorbase.com>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4SHB9g28092
	for <pgman@candle.pha.pa.us>; Mon, 28 May 2001 13:11:09 -0400 (EDT)
Received: (qmail 21667 invoked by uid 503); 28 May 2001 17:11:03 -0000
Received: from din4.sectorbase.com (HELO dune) (63.88.121.74)
  by gate1.sectorbase.com with SMTP; 28 May 2001 17:11:03 -0000
Message-ID: <007001c0e799$321dcc00$4a79583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "Zeugswetter Andreas SB" <ZeugswetterA@wien.spardat.at>,
   "'Hannu Krosing'" <hannu@tm.ee>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, "Tom Lane" <tgl@sss.pgh.pa.us>,
   "Hiroshi Inoue" <Inoue@tpf.co.jp>, "Bruce Momjian" <pgman@candle.pha.pa.us>,
   "The Hermit Hacker" <scrappy@hub.org>, <pgsql-hackers@postgresql.org>
References: <11C1E6749A55D411A9670001FA6879633682F5@sdexcsrv1.f000.d0188.sd.spardat.at>
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 28 May 2001 10:11:10 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Status: OR

> Yes, that is a good description. And old version is only required in the following
> two cases:
>
> 1. the txn that modified this tuple is still open (reader in default committed read)
> 2. reader is in serializable transaction isolation and has earlier xtid
>
> Seems overwrite smgr has mainly advantages in terms of speed for operations
> other than rollback.

... And rollback is required for < 5% transactions ...

Vadim


From hannu@tm.ee Mon May 28 13:37:50 2001
Return-path: <hannu@tm.ee>
Received: from taru.tm.ee (taru.tm.ee [194.204.62.23])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4SHblg17386
	for <pgman@candle.pha.pa.us>; Mon, 28 May 2001 13:37:48 -0400 (EDT)
Received: from tm.ee (localhost.localdomain [127.0.0.1])
	by taru.tm.ee (8.11.2/8.11.2) with ESMTP id f4SHfeI11208;
	Mon, 28 May 2001 19:41:40 +0200
Sender: hannu@taru.tm.ee
Message-ID: <3B128DD4.E15100AB@tm.ee>
Date: Mon, 28 May 2001 19:41:40 +0200
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: et, en, ru
MIME-Version: 1.0
To: Vadim Mikheev <vmikheev@sectorbase.com>
cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <11C1E6749A55D411A9670001FA6879633682F5@sdexcsrv1.f000.d0188.sd.spardat.at> <007001c0e799$321dcc00$4a79583f@sectorbase.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Vadim Mikheev wrote:
>
> > Yes, that is a good description. And old version is only required in the following
> > two cases:
> >
> > 1. the txn that modified this tuple is still open (reader in default committed read)
> > 2. reader is in serializable transaction isolation and has earlier xtid
> >
> > Seems overwrite smgr has mainly advantages in terms of speed for operations
> > other than rollback.
>
> ... And rollback is required for < 5% transactions ...

This obviously depends on application.

I know people who rollback most of their transactions (actually they use
it to
emulate temp tables when reporting).

OTOH it is possible to do without rolling back at all as MySQL folks
have
shown us ;)

Also, IIRC, pgbench does no rollbacks. I think that we have no
performance test that does.

-----------------
Hannu

From pgsql-hackers-owner+M9464@postgresql.org Tue May 29 16:40:30 2001
Return-path: <pgsql-hackers-owner+M9464@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4TKeU722464
	for <pgman@candle.pha.pa.us>; Tue, 29 May 2001 16:40:30 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4TKe3E85183;
	Tue, 29 May 2001 16:40:03 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9464@postgresql.org)
Received: from rh72.home.ee (adsl895.estpak.ee [213.168.23.133])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4TKBxE74107
	for <pgsql-hackers@postgresql.org>; Tue, 29 May 2001 16:11:59 -0400 (EDT)
	(envelope-from hannu@tm.ee)
Received: from tm.ee (rh72.home.ee [127.0.0.1])
	by rh72.home.ee (8.11.2/8.11.2) with ESMTP id f4TH7H501651;
	Tue, 29 May 2001 22:07:18 +0500
Message-ID: <3B13D744.10537597@tm.ee>
Date: Tue, 29 May 2001 22:07:16 +0500
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: en, ru, et
MIME-Version: 1.0
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Plans for solving the VACUUM problem
References: <3705826352029646A3E91C53F7189E3201665D@sectorbase2.sectorbase.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

"Mikheev, Vadim" wrote:
>
> > I know people who rollback most of their transactions
> > (actually they use it to emulate temp tables when reporting).
>
> Shouldn't they use TEMP tables? -:)

They probably should.

Actually they did it on Oracle, so it shows that it can be done
even with O-smgr ;)

> > OTOH it is possible to do without rolling back at all as
> > MySQL folks have shown us ;)
>
> Not with SDB tables which support transactions.

My point was that MySQL was used quite a long time without it
and still quite many useful applications were produced.

BTW, do you know what strategy is used by BSDDB/SDB for
rollback/undo ?

---------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From vmikheev@SECTORBASE.COM Tue May 29 13:50:48 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4THol712186
	for <pgman@candle.pha.pa.us>; Tue, 29 May 2001 13:50:47 -0400 (EDT)
Received: (qmail 35525 invoked by uid 503); 29 May 2001 17:50:38 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 29 May 2001 17:50:38 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDR7P>; Tue, 29 May 2001 10:49:12 -0700
Message-ID: <3705826352029646A3E91C53F7189E3201665D@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Hannu Krosing'" <hannu@tm.ee>
cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   "'Don Baccus'"
  <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 29 May 2001 10:49:12 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> > > Seems overwrite smgr has mainly advantages in terms of
> > > speed for operations other than rollback.
> >
> > ... And rollback is required for < 5% transactions ...
>
> This obviously depends on application.

Small number of aborted transactions was used to show
useless of UNDO in terms of space cleanup - that's why
I use same argument to show usefulness of O-smgr -:)

> I know people who rollback most of their transactions
> (actually they use it to emulate temp tables when reporting).

Shouldn't they use TEMP tables? -:)

> OTOH it is possible to do without rolling back at all as
> MySQL folks have shown us ;)

Not with SDB tables which support transactions.

Vadim

From pgsql-hackers-owner+M9462@postgresql.org Tue May 29 14:12:23 2001
Return-path: <pgsql-hackers-owner+M9462@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4TICN713882
	for <pgman@candle.pha.pa.us>; Tue, 29 May 2001 14:12:23 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4TIBvE28866;
	Tue, 29 May 2001 14:11:57 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9462@postgresql.org)
Received: from comet.pacifier.com (comet.pacifier.com [199.2.117.155])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4THvaE21886
	for <pgsql-hackers@postgresql.org>; Tue, 29 May 2001 13:57:36 -0400 (EDT)
	(envelope-from dhogaza@pacifier.com)
Received: from desktop (dsl-dhogaza.pacifier.net [207.202.226.68])
	by comet.pacifier.com (8.11.2/8.11.1) with SMTP id f4THtri06016;
	Tue, 29 May 2001 10:55:57 -0700 (PDT)
Message-ID: <3.0.1.32.20010529105533.016b9100@mail.pacifier.com>
X-Sender: dhogaza@mail.pacifier.com
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
Date: Tue, 29 May 2001 10:55:33 -0700
To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Hannu Krosing'" <hannu@tm.ee>
From: Don Baccus <dhogaza@pacifier.com>
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   Tom Lane <tgl@sss.pgh.pa.us>, Hiroshi Inoue <Inoue@tpf.co.jp>,
   Bruce Momjian <pgman@candle.pha.pa.us>, The Hermit Hacker <scrappy@hub.org>,
   pgsql-hackers@postgresql.org
In-Reply-To: <3705826352029646A3E91C53F7189E3201665D@sectorbase2.sectorb
	ase.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

At 10:49 AM 5/29/01 -0700, Mikheev, Vadim wrote:

>> I know people who rollback most of their transactions
>> (actually they use it to emulate temp tables when reporting).
>
>Shouldn't they use TEMP tables? -:)

Which is a very good point.  Pandering to poor practice at the
expense of good performance for better-designed applications
isn't a good idea.


- Don Baccus, Portland OR <dhogaza@pacifier.com>
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From pgsql-hackers-owner+M9466@postgresql.org Tue May 29 17:15:46 2001
Return-path: <pgsql-hackers-owner+M9466@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4TLFk725188
	for <pgman@candle.pha.pa.us>; Tue, 29 May 2001 17:15:46 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4TLFLE98211;
	Tue, 29 May 2001 17:15:21 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9466@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4TKcWE84385
	for <pgsql-hackers@postgresql.org>; Tue, 29 May 2001 16:38:32 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 71617 invoked by uid 503); 29 May 2001 20:38:31 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 29 May 2001 20:38:31 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDSJA>; Tue, 29 May 2001 13:37:04 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016660@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Hannu Krosing'" <hannu@tm.ee>
cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   "'Don Baccus'"
  <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 29 May 2001 13:37:03 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > > OTOH it is possible to do without rolling back at all as
> > > MySQL folks have shown us ;)
> >
> > Not with SDB tables which support transactions.
>
> My point was that MySQL was used quite a long time without it
> and still quite many useful applications were produced.

And my point was that needless to talk about rollbacks in
non-transaction system and in transaction system one has to
implement rollback somehow.

> BTW, do you know what strategy is used by BSDDB/SDB for
> rollback/undo ?

AFAIR, they use O-smgr => UNDO is required.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9119@postgresql.org Mon May 21 13:38:38 2001
Return-path: <pgsql-hackers-owner+M9119@postgresql.org>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LHccQ02858
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:38:38 -0400 (EDT)
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by west.navpoint.com (8.11.3/8.10.1) with ESMTP id f4LFCNv10580
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 11:12:23 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LF5AA60093;
	Mon, 21 May 2001 11:05:10 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9119@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4LEj6A50541
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 10:45:07 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4LEj0930147
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 16:45:00 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGS6NA>; Mon, 21 May 2001 16:44:44 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682DA@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'"
  <pgman@candle.pha.pa.us>
cc: "'Tom Lane'" <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 16:44:42 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> > Vadim, can you remind me what UNDO is used for?
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

They are already small (16Mb). Or do you mean even smaller ?
This imposes one huge risk, that is already a pain in other db's. You need
all logs of one transaction online. For a GigaByte transaction like a bulk
insert this can be very inconvenient.
Imho there should be some limit where you can choose whether you want
to continue without the feature (no savepoint) or are automatically aborted.

In any case, imho some thought should be put into this :-)

Another case where this is a problem is a client that starts a tx, does one little
insert or update on his private table, and then sits and waits for a day.

Both cases currently impose no problem whatsoever.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From ZeugswetterA@wien.spardat.at Mon May 21 13:37:56 2001
Return-path: <ZeugswetterA@wien.spardat.at>
Received: from west.navpoint.com (root@west.navpoint.com [207.106.42.13])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LHbuQ02280
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 13:37:56 -0400 (EDT)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by west.navpoint.com (8.11.3/8.10.1) with ESMTP id f4LGPOv03505
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 12:25:24 -0400 (EDT)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4LGC8921002
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 18:12:08 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGS694>; Mon, 21 May 2001 18:11:26 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682E2@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Vadim Mikheev'" <vmikheev@SECTORBASE.COM>,
   The Hermit Hacker
  <scrappy@hub.org>
cc: Tom Lane <tgl@sss.pgh.pa.us>, "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 18:11:16 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR


> My point is that we'll need in dynamic cleanup anyway and UNDO is
> what should be implemented for dynamic cleanup of aborted changes.

I do not yet understand why you want to handle aborts different than outdated
tuples. The ratio in a well tuned system should well favor outdated tuples.
If someone ever adds "dirty read" it is also not the case that it is guaranteed,
that nobody accesses the tuple you currently want to undo. So I really miss to see
the big difference.

Andreas

From pgsql-hackers-owner+M9153@postgresql.org Mon May 21 16:27:39 2001
Return-path: <pgsql-hackers-owner+M9153@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LKRdQ22261
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 16:27:39 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LKPaA50766;
	Mon, 21 May 2001 16:25:36 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9153@postgresql.org)
Received: from ns.sharemation.com (h-64-105-36-191.snvacaid.covad.net [64.105.36.191])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4LKKZA48741
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 16:20:35 -0400 (EDT)
	(envelope-from barry@xythos.com)
Received: from xythos.com ([192.168.254.19])
	by ns.sharemation.com (8.9.3/8.8.7) with ESMTP id MAA32032;
	Mon, 21 May 2001 12:05:12 -0700
Message-ID: <3B09783C.1080508@xythos.com>
Date: Mon, 21 May 2001 13:19:08 -0700
From: Barry Lind <barry@xythos.com>
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.16-22 i686; en-US; m18) Gecko/20010131 Netscape6/6.01
X-Accept-Language: en
MIME-Version: 1.0
To: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   pgsql-hackers@postgresql.org
Subject: Re: AW: [HACKERS] Plans for solving the VACUUM problem
References: <11C1E6749A55D411A9670001FA6879633682DA@sdexcsrv1.f000.d0188.sd.spardat.at>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


Zeugswetter Andreas SB wrote:

>>> Vadim, can you remind me what UNDO is used for?
>>
>> 4. Split pg_log into small files with ability to remove old ones (which
>>    do not hold statuses for any running transactions).
>
>
> They are already small (16Mb). Or do you mean even smaller ?
> This imposes one huge risk, that is already a pain in other db's. You need
> all logs of one transaction online. For a GigaByte transaction like a bulk
> insert this can be very inconvenient.
> Imho there should be some limit where you can choose whether you want
> to continue without the feature (no savepoint) or are automatically aborted.
>
> In any case, imho some thought should be put into this :-)
>
> Another case where this is a problem is a client that starts a tx, does one little
> insert or update on his private table, and then sits and waits for a day.
>
> Both cases currently impose no problem whatsoever.

Correct me if I am wrong, but both cases do present a problem currently
in 7.1.  The WAL log will not remove any WAL files for transactions that
are still open (even after a checkpoint occurs).  Thus if you do a bulk
insert of gigabyte size you will require a gigabyte sized WAL
directory.  Also if you have a simple OLTP transaction that the user
started and walked away from for his one week vacation, then no WAL log
files can be deleted until that user returns from his vacation and ends
his transaction.

--Barry

>
>
> Andreas
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From pgsql-hackers-owner+M9154@postgresql.org Mon May 21 16:45:07 2001
Return-path: <pgsql-hackers-owner+M9154@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LKj7Q24731
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 16:45:07 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LKhDA57568;
	Mon, 21 May 2001 16:43:13 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9154@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LKU6A52602
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 16:30:06 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 65246 invoked by uid 503); 21 May 2001 20:30:04 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 21 May 2001 20:30:04 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC6JW>; Mon, 21 May 2001 13:29:04 -0700
Message-ID: <3705826352029646A3E91C53F7189E3201663C@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Barry Lind'" <barry@xythos.com>,
   Zeugswetter Andreas SB
  <ZeugswetterA@wien.spardat.at>,
   pgsql-hackers@postgresql.org
Subject: RE: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 21 May 2001 13:29:03 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Correct me if I am wrong, but both cases do present a problem
> currently in 7.1.  The WAL log will not remove any WAL files
> for transactions that are still open (even after a checkpoint
> occurs). Thus if you do a bulk insert of gigabyte size you will
> require a gigabyte sized WAL directory. Also if you have a simple
> OLTP transaction that the user started and walked away from for
> his one week vacation, then no WAL log files can be deleted until
> that user returns from his vacation and ends his transaction.

Todo:

1. Compact log files after checkpoint (save records of uncommitted
   transactions and remove/archive others).
2. Abort long running transactions.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9155@postgresql.org Mon May 21 17:01:45 2001
Return-path: <pgsql-hackers-owner+M9155@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4LL1jQ27771
	for <pgman@candle.pha.pa.us>; Mon, 21 May 2001 17:01:45 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LKttA62549;
	Mon, 21 May 2001 16:55:55 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9155@postgresql.org)
Received: from smtp018.mail.yahoo.com (smtp018.mail.yahoo.com [216.136.174.115])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4LKcbA55747
	for <pgsql-hackers@postgresql.org>; Mon, 21 May 2001 16:38:37 -0400 (EDT)
	(envelope-from janwieck@yahoo.com)
Received: from jupiter.us.greatbridge.com (HELO jupiter.jw.home) (65.196.69.55)
  by smtp.mail.vip.sc5.yahoo.com with SMTP; 21 May 2001 20:38:35 -0000
X-Apparently-From: <janwieck@yahoo.com>
Received: (from janwieck@localhost)
	by jupiter.jw.home (8.9.3/8.9.3) id QAA15136;
	Mon, 21 May 2001 16:41:33 -0400
From: Jan Wieck <JanWieck@Yahoo.com>
Message-ID: <200105212041.QAA15136@jupiter.jw.home>
Subject: Re: AW: [HACKERS] Plans for solving the VACUUM problem
In-Reply-To: <3B09783C.1080508@xythos.com> from Barry Lind at "May 21, 2001 01:19:08
	pm"
To: Barry Lind <barry@xythos.com>
Date: Mon, 21 May 2001 16:41:33 -0400 (EDT)
cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
   pgsql-hackers@postgresql.org
X-Mailer: ELM [version 2.4ME+ PL68 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Barry Lind wrote:
>
>
> Zeugswetter Andreas SB wrote:
>
> >>> Vadim, can you remind me what UNDO is used for?
> >>
> >> 4. Split pg_log into small files with ability to remove old ones (which
> >>    do not hold statuses for any running transactions).
> >
> >
> > They are already small (16Mb). Or do you mean even smaller ?
> > This imposes one huge risk, that is already a pain in other db's. You need
> > all logs of one transaction online. For a GigaByte transaction like a bulk
> > insert this can be very inconvenient.
> > Imho there should be some limit where you can choose whether you want
> > to continue without the feature (no savepoint) or are automatically aborted.
> >
> > In any case, imho some thought should be put into this :-)
> >
> > Another case where this is a problem is a client that starts a tx, does one little
> > insert or update on his private table, and then sits and waits for a day.
> >
> > Both cases currently impose no problem whatsoever.
>
> Correct me if I am wrong, but both cases do present a problem currently
> in 7.1.  The WAL log will not remove any WAL files for transactions that
> are still open (even after a checkpoint occurs).  Thus if you do a bulk
> insert of gigabyte size you will require a gigabyte sized WAL
> directory.  Also if you have a simple OLTP transaction that the user
> started and walked away from for his one week vacation, then no WAL log
> files can be deleted until that user returns from his vacation and ends
> his transaction.

    As  a  rule  of  thumb,  online  applications  that hold open
    transactions during user interaction  are  considered  to  be
    Broken  By  Design  (tm).   So I'd slap the programmer/design
    team with - let's use the server box since it doesn't contain
    anything useful.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9174@postgresql.org Tue May 22 04:34:56 2001
Return-path: <pgsql-hackers-owner+M9174@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4M8YuQ08718
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 04:34:56 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4M8VjA29342;
	Tue, 22 May 2001 04:31:45 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9174@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4M8GWA21819
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 04:16:32 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4M8GSP18677
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 10:16:28 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGS0WK>; Tue, 22 May 2001 10:16:10 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682E8@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Barry Lind'" <barry@xythos.com>, Tom Lane <tgl@sss.pgh.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 10:16:10 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> REDO in oracle is done by something known as a 'rollback segment'.

You are not seriously saying that you like the "rollback segments" in Oracle.
They only cause trouble:
1. configuration (for every different workload you need a different config)
2. snapshot too old
3. tx abort because rollback segments are full
4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)

If I read the papers correctly Version 9 gets rid of Point 1 but the rest ...

Andreas

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9206@postgresql.org Tue May 22 13:26:46 2001
Return-path: <pgsql-hackers-owner+M9206@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4MHQkQ08668
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 13:26:46 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MHPBA80339;
	Tue, 22 May 2001 13:25:11 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9206@postgresql.org)
Received: from ns.sharemation.com (h-64-105-36-191.snvacaid.covad.net [64.105.36.191])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4MHD8A75168
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 13:13:09 -0400 (EDT)
	(envelope-from barry@xythos.com)
Received: from xythos.com ([192.168.254.19])
	by ns.sharemation.com (8.9.3/8.8.7) with ESMTP id IAA12961;
	Tue, 22 May 2001 08:57:47 -0700
Message-ID: <3B0A9DD6.7040502@xythos.com>
Date: Tue, 22 May 2001 10:11:50 -0700
From: Barry Lind <barry@xythos.com>
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.16-22 i686; en-US; m18) Gecko/20010131 Netscape6/6.01
X-Accept-Language: en
MIME-Version: 1.0
To: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
cc: pgsql-hackers@postgresql.org
Subject: Re: AW: [HACKERS] Plans for solving the VACUUM problem
References: <11C1E6749A55D411A9670001FA6879633682E8@sdexcsrv1.f000.d0188.sd.spardat.at>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Actually I don't like the problems with rollback segments in oracle at
all.  I am just concerned that using WAL for UNDO will have all of the
same problems if it isn't designed carefully.  At least in oracle's
rollback segments there are multiple of them, in WAL there is just one,
thus you will potentially have that 20Gig all in your single log
directory.  People are already reporting the log directory growing to a
gig or more when running vacuum in 7.1.

Of the points you raised about oracle's rollback segment problems:

1. configuration (for every different workload you need a different config)

Postgres should be able to do a better job here.


2. snapshot too old

Shouldn't be a problem as long as postgres continues to use a non-overwriting storage manager.  However under an overwriting storage manager, you need to keep all of the changes in the UNDO records to satisfy the query snapshot, thus if you want to limit the size of UNDO you may need to kill long running queries.

3. tx abort because rollback segments are full
If you want to limit the size of the UNDO, then this is a corresponding
byproduct.  I believe a mail note was sent out yesterday suggesting that
limits like this be added to the todo list.

4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)
You need to store the UNDO information somewhere.  And on active
databases that can amount to alot of information, especially for bulk
loads or massive updates.

thanks,
--Barry


Zeugswetter Andreas SB wrote:

>
>
>> REDO in oracle is done by something known as a 'rollback segment'.
>
>
> You are not seriously saying that you like the "rollback segments" in Oracle.
> They only cause trouble:
> 1. configuration (for every different workload you need a different config)
> 2. snapshot too old
> 3. tx abort because rollback segments are full
> 4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)
>
> If I read the papers correctly Version 9 gets rid of Point 1 but the rest ...
>
> Andreas
>
>


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From pgsql-hackers-owner+M9210@postgresql.org Tue May 22 14:56:54 2001
Return-path: <pgsql-hackers-owner+M9210@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4MIurQ18060
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 14:56:53 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MItuA22196;
	Tue, 22 May 2001 14:55:56 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9210@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4MILiA06054
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 14:21:45 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 26740 invoked by uid 503); 22 May 2001 18:21:43 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 22 May 2001 18:21:43 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CC9XB>; Tue, 22 May 2001 11:20:40 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016642@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   "'Tom Lane'"
  <tgl@sss.pgh.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: RE: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 11:20:37 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > 1. Compact log files after checkpoint (save records of uncommitted
> >    transactions and remove/archive others).
>
> On the grounds that undo is not guaranteed anyway (concurrent
> heap access), why not simply forget it,

We can set flag in ItemData and register callback function in
buffer header regardless concurrent heap/index access. So buffer
will be cleaned before throwing it out from buffer pool
(little optimization: if at the time when pin drops to 0 buffer
is undirty then we shouldn't really clean it up to avoid unnecessary
write - we can save info in FSM that space is available and clean it
up on first pin/read).
So, only ability of *immediate reusing* is not guaranteed. But this is
general problem of space reusing till we assume that buffer pin is
enough to access data.

> since above sounds rather expensive ?

I'm not sure. For non-overwriting smgr required UNDO info is pretty
small because of we're not required to keep tuple data, unlike
Oracle & Co. We can even store UNDO info in non-WAL format to avoid
log record header overhead. UNDO files would be kind of Oracle rollback
segments but muuuuch smaller. But yeh - additional log reads.

> The downside would only be, that long running txn's cannot
> [easily] rollback to savepoint.

We should implement savepoints for all or none transactions, no?

> > 2. Abort long running transactions.
>
> This is imho "the" big downside of UNDO, and should not
> simply be put on the TODO without thorow research. I think it
> would be better to forget UNDO for long running transactions
> before aborting them.

Abort could be configurable.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9243@postgresql.org Wed May 23 04:12:50 2001
Return-path: <pgsql-hackers-owner+M9243@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N8CoQ08950
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 04:12:50 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4N885A32881;
	Wed, 23 May 2001 04:08:05 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9243@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4N7ZpA12878
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 03:35:54 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4N7ZZt24346
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 09:35:35 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTHD8>; Wed, 23 May 2001 09:35:23 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682ED@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'"
  <pgman@candle.pha.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Wed, 23 May 2001 09:35:22 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> If community will not like UNDO then I'll probably try to implement

Imho UNDO would be great under the following circumstances:
	1. The undo is only registered for some background work process
	    and not done in the client's backend (or only if it is a small txn).
	2. The same mechanism should also be used for outdated tuples
	    (the only difference beeing, that some tuples need to wait longer
	     because of an active xid)

The reasoning to not do it in the client's backend is not only that the client
does not need to wait, but that the nervous dba tends to kill them if after one hour
of forward work the backend seemingly does not respond anymore (because it is
busy with undo).

> dead space collector which will read log files and so on.

Which would then only be a possible implementation variant of above :-)
First step probably would be to separate the physical log to reduce WAL size.

> to implement logging for non-btree indices (anyway required for UNDO,
> WAL-based BAR, WAL-based space reusing).

Imho it would be great to implement a generic (albeit more expensive)
redo for all possible index types, that would be used in absence of a physical
redo for that particular index type (which is currently available for btree).

The prerequisites would be a physical log that saves the page before
modification. The redo could then be done (with the info from the heap tuple log record)
with the same index interface, that is used during normal operation.

Imho implementing a new index type is difficult enough as is without the need
to write a redo and undo.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

From pgsql-hackers-owner+M9245@postgresql.org Wed May 23 04:41:13 2001
Return-path: <pgsql-hackers-owner+M9245@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N8fDQ09762
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 04:41:13 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4N8ZZA45573;
	Wed, 23 May 2001 04:35:35 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9245@postgresql.org)
Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4N8QjA42040
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 04:26:45 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4N8Qa826319
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 10:26:36 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTHY2>; Wed, 23 May 2001 10:26:28 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682EE@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>
cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   The Hermit Hacker
  <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Wed, 23 May 2001 10:26:26 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > People also have referred to an overwriting smgr
> > easily. Please tell me how to introduce an overwriting smgr
> > without UNDO.

There is no way. Although undo for an overwriting smgr would involve a
very different approach than with non-overwriting. See Vadim's post about what
info suffices for undo in non overwriting smgr (file and ctid).

> I guess that is the question.  Are we heading for an overwriting storage
> manager?  I didn't see that in Vadim's list of UNDO advantages, but
> maybe that is his final goal.  If so UNDO may make sense, but then the
> question is how do we keep MVCC with an overwriting storage manager?
>
> The only way I can see doing it is to throw the old tuples into the WAL
> and have backends read through that for MVCC info.

If PostgreSQL wants to stay MVCC, then we should imho forget "overwriting smgr"
very fast.

Let me try to list the pros and cons that I can think of:
Pro:
	no index modification if key stays same
	no search for free space for update (if tuple still fits into page)
	no pg_log
Con:
	additional IO to write "before image" to rollback segment
		(every before image, not only first after checkpoint)
		(also before image of every index page that is updated !)
	need a rollback segment that imposes all sorts of contention problems
	active rollback, that needs to do a lot of undo work

Andreas

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From ZeugswetterA@wien.spardat.at Wed May 23 05:25:30 2001
Return-path: <ZeugswetterA@wien.spardat.at>
Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N9PSQ11379
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 05:25:29 -0400 (EDT)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4N9PJ812775
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 11:25:19 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGT2QN>; Wed, 23 May 2001 11:25:12 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682F0@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Philip Warner'" <pjw@rhyme.com.au>,
   "Mikheev, Vadim"
  <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Wed, 23 May 2001 11:25:12 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR


> >If community will not like UNDO then I'll probably try to implement
> >dead space collector which will read log files and so on.
>
> I'd vote for UNDO; in terms of usability & friendliness it's a big win.

Could you please try it a little more verbose ? I am very interested in
the advantages you see in "UNDO for rollback only".

pg_log is a very big argument, but couldn't we try to change the format
to something that only stores ranges of aborted txn's in a btree like format ?
Now that we have WAL, that should be possible.

Andreas

From pjw@rhyme.com.au Wed May 23 06:45:18 2001
Return-path: <pjw@rhyme.com.au>
Received: from acheron.rime.com.au (albatr.lnk.telstra.net [139.130.54.222])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4NAjGQ27811
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 06:45:16 -0400 (EDT)
Received: from oberon ([203.8.195.100])
	by acheron.rime.com.au (8.11.2/8.11.2/SuSE Linux 8.11.1-0.5) with SMTP id f4NAhQK04805;
	Wed, 23 May 2001 20:43:42 +1000
Message-ID: <3.0.5.32.20010523204324.02bd14b0@mail.rhyme.com.au>
X-Sender: pjw@mail.rhyme.com.au
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Wed, 23 May 2001 20:43:24 +1000
To: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>,
   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
From: Philip Warner <pjw@rhyme.com.au>
Subject: Re: AW: [HACKERS] Plans for solving the VACUUM problem
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
In-Reply-To: <11C1E6749A55D411A9670001FA6879633682F0@sdexcsrv1.f000.d018
	8.sd.spardat.at>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: ORr

At 11:25 23/05/01 +0200, Zeugswetter Andreas SB wrote:
>
>> >If community will not like UNDO then I'll probably try to implement
>> >dead space collector which will read log files and so on.
>>
>> I'd vote for UNDO; in terms of usability & friendliness it's a big win.
>
>Could you please try it a little more verbose ? I am very interested in
>the advantages you see in "UNDO for rollback only".

I have not been paying strict attention to this thread, so it may have
wandered into a narrower band than I think we are in, but my understanding
is that UNDO is required for partial rollback in the case of failed
commands withing a single TX. Specifically,

- A simple typo in psql can currently cause a forced rollback of the entire
TX. UNDO should avoid this.

- It is not uncommon for application in other databases to handle errors
from the database (eg. missing FKs), and continue a TX.

- Similarly, when we get a new error reporting system, general constraint
(or other) failures should be able to be handled in one TX.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

From vmikheev@SECTORBASE.COM Thu May 24 14:07:24 2001
Return-path: <vmikheev@SECTORBASE.COM>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4OI7Nt20455
	for <pgman@candle.pha.pa.us>; Thu, 24 May 2001 14:07:23 -0400 (EDT)
Received: (qmail 98123 invoked by uid 503); 24 May 2001 18:07:18 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 24 May 2001 18:07:18 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDGNA>; Thu, 24 May 2001 11:06:08 -0700
Message-ID: <3705826352029646A3E91C53F7189E32016653@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>,
   "'Philip Warner'" <pjw@rhyme.com.au>,
   "'Bruce Momjian'"
  <pgman@candle.pha.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: RE: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Thu, 24 May 2001 11:06:08 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR

> > - A simple typo in psql can currently cause a forced
> > rollback of the entire TX. UNDO should avoid this.
>
> Yes, I forgot to mention this very big advantage, but undo is
> not the only possible way to implement savepoints. Solutions
> using CommandCounter have been discussed.

This would be hell.

Vadim

From ZeugswetterA@wien.spardat.at Fri May 25 03:44:30 2001
Return-path: <ZeugswetterA@wien.spardat.at>
Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P7iTt10069
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 03:44:29 -0400 (EDT)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4P7iM332208
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 09:44:22 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTQC3>; Fri, 25 May 2001 09:44:14 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682F3@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Don Baccus'"
  <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker
  <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 25 May 2001 09:44:14 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Status: OR


> > >> Impractical ? Oracle does it.
> > >
> > >Oracle has MVCC?
> >
> > With restrictions, yes.
>
> What restrictions? Rollback segments size?

No, that is not the whole story. The problem with their "rollback segment approach" is,
that they do not guard against overwriting a tuple version in the rollback segment.
They simply recycle each segment in a wrap around manner.
Thus there could be an open transaction that still wanted to see a tuple version
that was already overwritten, leading to the feared "snapshot too old" error.

Copying their "rollback segment" approach is imho the last thing we want to do.

> Non-overwriting smgr can eat all disk space...
>
> > You didn't know that?  Vadim did ...
>
> Didn't I mention a few times that I was inspired by Oracle? -:)

Looking at what they supply in the feature area is imho good.
Copying their technical architecture is not so good in general.

Andreas

From pgsql-hackers-owner+M9354@postgresql.org Fri May 25 04:10:48 2001
Return-path: <pgsql-hackers-owner+M9354@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4P8Alt10967
	for <pgman@candle.pha.pa.us>; Fri, 25 May 2001 04:10:48 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4P89GA75319;
	Fri, 25 May 2001 04:09:16 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9354@postgresql.org)
Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4P7iSA59404
	for <pgsql-hackers@postgresql.org>; Fri, 25 May 2001 03:44:33 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4P7iM332218
	for <pgsql-hackers@postgresql.org>; Fri, 25 May 2001 09:44:22 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTQC3>; Fri, 25 May 2001 09:44:14 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682F3@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Don Baccus'"
  <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue
  <Inoue@tpf.co.jp>
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker
  <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Fri, 25 May 2001 09:44:14 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> > >> Impractical ? Oracle does it.
> > >
> > >Oracle has MVCC?
> >
> > With restrictions, yes.
>
> What restrictions? Rollback segments size?

No, that is not the whole story. The problem with their "rollback segment approach" is,
that they do not guard against overwriting a tuple version in the rollback segment.
They simply recycle each segment in a wrap around manner.
Thus there could be an open transaction that still wanted to see a tuple version
that was already overwritten, leading to the feared "snapshot too old" error.

Copying their "rollback segment" approach is imho the last thing we want to do.

> Non-overwriting smgr can eat all disk space...
>
> > You didn't know that?  Vadim did ...
>
> Didn't I mention a few times that I was inspired by Oracle? -:)

Looking at what they supply in the feature area is imho good.
Copying their technical architecture is not so good in general.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9402@postgresql.org Mon May 28 04:14:09 2001
Return-path: <pgsql-hackers-owner+M9402@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4S8E8a27526
	for <pgman@candle.pha.pa.us>; Mon, 28 May 2001 04:14:09 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4S8CtA91878;
	Mon, 28 May 2001 04:12:55 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9402@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4S83fA87860
	for <pgsql-hackers@postgresql.org>; Mon, 28 May 2001 04:03:41 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4S83bj01495
	for <pgsql-hackers@postgresql.org>; Mon, 28 May 2001 10:03:37 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTYF3>; Mon, 28 May 2001 10:02:24 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682F5@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Hannu Krosing'"
  <hannu@tm.ee>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 28 May 2001 10:02:17 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> > You mean it is restored in session that is running the transaction ?

Depends on what you mean with restored. It first reads the heap page,
sees that it needs an older version and thus reads it from the "rollback segment".

> >
> > I guess thet it could be slower than our current way of doing it.
>
> Yes, for older transactions which *really* need in *particular*
> old data, but not for newer ones. Look - now transactions have to read
> dead data again and again, even if some of them (newer) need not to see
> those data at all, and we keep dead data as long as required for other
> old transactions *just for the case* they will look there.
> But who knows?! Maybe those old transactions will not read from table
> with big amount of dead data at all! So - why keep dead data in datafiles
> for long time? This obviously affects overall system performance.

Yes, that is a good description. And old version is only required in the following
two cases:

1. the txn that modified this tuple is still open (reader in default committed read)
2. reader is in serializable transaction isolation and has earlier xtid

Seems overwrite smgr has mainly advantages in terms of speed for operations
other than rollback.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9403@postgresql.org Mon May 28 05:16:44 2001
Return-path: <pgsql-hackers-owner+M9403@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4S9Gia00375
	for <pgman@candle.pha.pa.us>; Mon, 28 May 2001 05:16:44 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4S9FbA17309;
	Mon, 28 May 2001 05:15:37 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9403@postgresql.org)
Received: from taru.tm.ee (taru.tm.ee [194.204.62.23])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4S956A13403
	for <pgsql-hackers@postgresql.org>; Mon, 28 May 2001 05:05:07 -0400 (EDT)
	(envelope-from hannu@tm.ee)
Received: from tm.ee (localhost.localdomain [127.0.0.1])
	by taru.tm.ee (8.11.2/8.11.2) with ESMTP id f4S9BkI10488;
	Mon, 28 May 2001 11:11:46 +0200
Message-ID: <3B121652.C4DCE1F6@tm.ee>
Date: Mon, 28 May 2001 11:11:46 +0200
From: Hannu Krosing <hannu@tm.ee>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686)
X-Accept-Language: et, en, ru
MIME-Version: 1.0
To: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
cc: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Don Baccus'" <dhogaza@pacifier.com>, Tom Lane <tgl@sss.pgh.pa.us>,
   Hiroshi Inoue <Inoue@tpf.co.jp>, Bruce Momjian <pgman@candle.pha.pa.us>,
   The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
Subject: Re: AW: [HACKERS] Plans for solving the VACUUM problem
References: <11C1E6749A55D411A9670001FA6879633682F5@sdexcsrv1.f000.d0188.sd.spardat.at>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

Zeugswetter Andreas SB wrote:
>
> > > You mean it is restored in session that is running the transaction ?
>
> Depends on what you mean with restored. It first reads the heap page,
> sees that it needs an older version and thus reads it from the "rollback segment".

So are whole pages stored in rollback segments or just the modified data
?

Storing whole pages could be very wasteful for tables with small records
that
are often modified.

---------------
Hannu

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

From vmikheev@sectorbase.com Mon May 28 13:15:16 2001
Return-path: <vmikheev@sectorbase.com>
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by candle.pha.pa.us (8.10.1/8.10.1) with SMTP id f4SHFFg01624
	for <pgman@candle.pha.pa.us>; Mon, 28 May 2001 13:15:15 -0400 (EDT)
Received: (qmail 22525 invoked by uid 503); 28 May 2001 17:15:10 -0000
Received: from din4.sectorbase.com (HELO dune) (63.88.121.74)
  by gate1.sectorbase.com with SMTP; 28 May 2001 17:15:10 -0000
Message-ID: <007c01c0e799$c54dd9c0$4a79583f@sectorbase.com>
From: "Vadim Mikheev" <vmikheev@sectorbase.com>
To: "Hannu Krosing" <hannu@tm.ee>,
   "Zeugswetter Andreas SB" <ZeugswetterA@wien.spardat.at>
cc: "'Don Baccus'" <dhogaza@pacifier.com>, "Tom Lane" <tgl@sss.pgh.pa.us>,
   "Hiroshi Inoue" <Inoue@tpf.co.jp>, "Bruce Momjian" <pgman@candle.pha.pa.us>,
   "The Hermit Hacker" <scrappy@hub.org>, <pgsql-hackers@postgresql.org>
References: <11C1E6749A55D411A9670001FA6879633682F5@sdexcsrv1.f000.d0188.sd.spardat.at> <3B121652.C4DCE1F6@tm.ee>
Subject: Re: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Mon, 28 May 2001 10:15:17 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Status: OR

> > > > You mean it is restored in session that is running the transaction ?
> >
> > Depends on what you mean with restored. It first reads the heap page,
> > sees that it needs an older version and thus reads it from the "rollback segment".
>
> So are whole pages stored in rollback segments or just the modified data?

This is implementation dependent. Storing whole pages is much easy to do,
but obviously it's better to store just modified data.

Vadim


From pgsql-hackers-owner+M9458@postgresql.org Tue May 29 13:49:27 2001
Return-path: <pgsql-hackers-owner+M9458@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4THnQ712093
	for <pgman@candle.pha.pa.us>; Tue, 29 May 2001 13:49:26 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4THmoE17911;
	Tue, 29 May 2001 13:48:50 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9458@postgresql.org)
Received: from mailer.sectorbase.com (mailer.sectorbase.com [63.88.121.2])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4THfTE14406
	for <pgsql-hackers@postgresql.org>; Tue, 29 May 2001 13:41:29 -0400 (EDT)
	(envelope-from vmikheev@SECTORBASE.COM)
Received: (qmail 33510 invoked by uid 503); 29 May 2001 17:41:25 -0000
Received: from sectorbase2.sectorbase.com (192.168.254.2)
  by gate1.sectorbase.com with SMTP; 29 May 2001 17:41:25 -0000
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
	id <LG3CDR6T>; Tue, 29 May 2001 10:39:59 -0700
Message-ID: <3705826352029646A3E91C53F7189E3201665C@sectorbase2.sectorbase.com>
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
To: "'Zeugswetter Andreas SB'" <ZeugswetterA@wien.spardat.at>
cc: "'pgsql-hackers@postgresql.org'" <pgsql-hackers@postgresql.org>
Subject: RE: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 29 May 2001 10:39:59 -0700
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> > > So are whole pages stored in rollback segments or just
> > > the modified data?
> >
> > This is implementation dependent. Storing whole pages is
> > much easy to do, but obviously it's better to store just
> > modified data.
>
> I am not sure it is necessarily better. Seems to be a tradeoff here.
> pros of whole pages:
> 	a possible merge with physical log (for first
>           modification of a page after checkpoint
> 		there would be no overhead compared to current
>           since it is already written now)

Using WAL as RS data storage is questionable.

> 	in a clever implementation a page already in the
>           "rollback segment" might satisfy the
> 		modification of another row on that page, and
>           thus would not need any additional io.

This would be possible only if there was no commit (same SCN)
between two modifications.

But, aren't we too deep on overwriting smgr (O-smgr) implementation?
It's doable. It has advantages in terms of IO active transactions
must do to follow MVCC. It has drawback in terms of required
disk space (and, oh yeh, it's not easy to implement -:)).
So, any other opinions about value of O-smgr?

Vadim

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9173@postgresql.org Tue May 22 04:27:18 2001
Return-path: <pgsql-hackers-owner+M9173@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4M8RHQ08426
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 04:27:18 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4M8P5A26187;
	Tue, 22 May 2001 04:25:05 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9173@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4M7seA10340
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 03:54:40 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4M7saP09352
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 09:54:36 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGS0MG>; Tue, 22 May 2001 09:54:30 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682E6@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Barry Lind'" <barry@xythos.com>, pgsql-hackers@postgresql.org
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 09:54:21 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> Correct me if I am wrong, but both cases do present a problem currently
> in 7.1.  The WAL log will not remove any WAL files for transactions that
> are still open (even after a checkpoint occurs).  Thus if you do a bulk
> insert of gigabyte size you will require a gigabyte sized WAL
> directory.  Also if you have a simple OLTP transaction that the user
> started and walked away from for his one week vacation, then no WAL log
> files can be deleted until that user returns from his vacation and ends
> his transaction.

I am not sure, it might be so implemented. But there is no technical reason
to keep them beyond checkpoint without UNDO.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9170@postgresql.org Tue May 22 04:20:39 2001
Return-path: <pgsql-hackers-owner+M9170@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4M8KcQ08175
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 04:20:38 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4M8EnA20785;
	Tue, 22 May 2001 04:14:49 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9170@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4M87AA17155
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 04:07:10 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4M876P14958
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 10:07:06 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGS0S9>; Tue, 22 May 2001 10:06:57 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682E7@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Jan Wieck'" <JanWieck@yahoo.com>, Barry Lind <barry@xythos.com>
cc: pgsql-hackers@postgresql.org
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 10:06:54 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


>     As  a  rule  of  thumb,  online  applications  that hold open
>     transactions during user interaction  are  considered  to  be
>     Broken  By  Design  (tm).   So I'd slap the programmer/design
>     team with - let's use the server box since it doesn't contain
>     anything useful.

We have a database system here, and not an OLTP helper app.
A database system must support all sorts of mixed usage from simple
OLTP to OLAP. Imho the usual separation on different servers gives more
headaches than are necessary.

Thus above statement can imho be true for one OLTP application, but not
for all applications on one db server.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

From pgsql-hackers-owner+M9175@postgresql.org Tue May 22 05:41:09 2001
Return-path: <pgsql-hackers-owner+M9175@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4M9f5Q12618
	for <pgman@candle.pha.pa.us>; Tue, 22 May 2001 05:41:05 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4M9d8A57688;
	Tue, 22 May 2001 05:39:08 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9175@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4M9RZA52748
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 05:27:35 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4M9RUP15035
	for <pgsql-hackers@postgresql.org>; Tue, 22 May 2001 11:27:30 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTASQ>; Tue, 22 May 2001 11:27:22 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682E9@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Tom Lane'"
  <tgl@sss.pgh.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 22 May 2001 11:27:19 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR

> Todo:
>
> 1. Compact log files after checkpoint (save records of uncommitted
>    transactions and remove/archive others).

On the grounds that undo is not guaranteed anyway (concurrent heap access),
why not simply forget it, since above sounds rather expensive ?
The downside would only be, that long running txn's cannot [easily] rollback
to savepoint.

> 2. Abort long running transactions.

This is imho "the" big downside of UNDO, and should not simply be put on
the TODO without thorow research. I think it would be better to forget UNDO for long
running transactions before aborting them.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9246@postgresql.org Wed May 23 04:58:55 2001
Return-path: <pgsql-hackers-owner+M9246@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4N8wsQ10317
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 04:58:54 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4N8s6A52669;
	Wed, 23 May 2001 04:54:06 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9246@postgresql.org)
Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4N8jbA49604
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 04:45:37 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4N8jO832710
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 10:45:24 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGT2AG>; Wed, 23 May 2001 10:45:17 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682EF@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>,
   "'Tom Lane'"
  <tgl@sss.pgh.pa.us>
cc: pgsql-hackers@postgresql.org
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Wed, 23 May 2001 10:45:17 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> > The downside would only be, that long running txn's cannot
> > [easily] rollback to savepoint.
>
> We should implement savepoints for all or none transactions, no?

We should not limit transaction size to online available disk space for WAL.
Imho that is much more important. With guaranteed undo we would need
diskspace for more than 2x new data size (+ at least space for 1x all modified
pages unless physical log is separated from WAL).

Imho a good design should involve only little more than 1x new data size.

>
> > > 2. Abort long running transactions.
> >
> > This is imho "the" big downside of UNDO, and should not
> > simply be put on the TODO without thorow research. I think it
> > would be better to forget UNDO for long running transactions
> > before aborting them.
>
> Abort could be configurable.

The point is, that you need to abort before WAL runs out of disk space
regardless of configuration.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

From pgsql-hackers-owner+M9252@postgresql.org Wed May 23 07:17:03 2001
Return-path: <pgsql-hackers-owner+M9252@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4NBH2Q10577
	for <pgman@candle.pha.pa.us>; Wed, 23 May 2001 07:17:02 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4NBGIA12333;
	Wed, 23 May 2001 07:16:18 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9252@postgresql.org)
Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4NB8mA09095
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 07:08:48 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4NB8d806512
	for <pgsql-hackers@postgresql.org>; Wed, 23 May 2001 13:08:39 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGTJP5>; Wed, 23 May 2001 13:08:29 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682F2@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Philip Warner'" <pjw@rhyme.com.au>,
   "Mikheev, Vadim"
  <vmikheev@SECTORBASE.COM>,
   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
cc: The Hermit Hacker <scrappy@hub.org>, Tom Lane <tgl@sss.pgh.pa.us>,
   pgsql-hackers@postgresql.org
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Wed, 23 May 2001 13:08:28 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> - A simple typo in psql can currently cause a forced rollback of the entire
> TX. UNDO should avoid this.

Yes, I forgot to mention this very big advantage, but undo is not the only possible way
to implement savepoints. Solutions using CommandCounter have been discussed.
Although the pg_log mechanism would become more complex, a background
"vacuum-like" process could put highest priority on removing such rolled back parts
of transactions.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

From pgsql-hackers-owner+M9443@postgresql.org Tue May 29 04:21:45 2001
Return-path: <pgsql-hackers-owner+M9443@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4T8Li727196
	for <pgman@candle.pha.pa.us>; Tue, 29 May 2001 04:21:44 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4T8KTE67371;
	Tue, 29 May 2001 04:20:30 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9443@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4T7ZKE41255
	for <pgsql-hackers@postgresql.org>; Tue, 29 May 2001 03:35:23 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4T7ZG921237
	for <pgsql-hackers@postgresql.org>; Tue, 29 May 2001 09:35:16 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <K5WGT047>; Tue, 29 May 2001 09:35:06 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682F8@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Vadim Mikheev'" <vmikheev@sectorbase.com>
cc: "'pgsql-hackers@postgresql.org'" <pgsql-hackers@postgresql.org>
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Tue, 29 May 2001 09:35:01 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> > > > > You mean it is restored in session that is running the transaction ?
> > >
> > > Depends on what you mean with restored. It first reads the heap page,
> > > sees that it needs an older version and thus reads it from the "rollback segment".
> >
> > So are whole pages stored in rollback segments or just the modified data?
>
> This is implementation dependent. Storing whole pages is much easy to do,
> but obviously it's better to store just modified data.

I am not sure it is necessarily better. Seems to be a tradeoff here.
pros of whole pages:
	a possible merge with physical log (for first modification of a page after checkpoint
		there would be no overhead compared to current since it is already written now)
	in a clever implementation a page already in the "rollback segment" might satisfy the
		modification of another row on that page, and thus would not need any additional io.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

From pgsql-hackers-owner+M9473@postgresql.org Wed May 30 06:30:34 2001
Return-path: <pgsql-hackers-owner+M9473@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4UAUX715594
	for <pgman@candle.pha.pa.us>; Wed, 30 May 2001 06:30:33 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.1) with SMTP id f4UATiE98735;
	Wed, 30 May 2001 06:29:44 -0400 (EDT)
	(envelope-from pgsql-hackers-owner+M9473@postgresql.org)
Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73])
	by postgresql.org (8.11.3/8.11.1) with ESMTP id f4UAIRE94342
	for <pgsql-hackers@postgresql.org>; Wed, 30 May 2001 06:18:28 -0400 (EDT)
	(envelope-from ZeugswetterA@wien.spardat.at)
Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149])
	by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f4UAIKQ15061
	for <pgsql-hackers@postgresql.org>; Wed, 30 May 2001 12:18:20 +0200
Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
	id <L6WJQW79>; Wed, 30 May 2001 12:18:12 +0200
Message-ID: <11C1E6749A55D411A9670001FA6879633682FE@sdexcsrv1.f000.d0188.sd.spardat.at>
From: Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at>
To: "'Mikheev, Vadim'" <vmikheev@SECTORBASE.COM>
cc: "'pgsql-hackers@postgresql.org'" <pgsql-hackers@postgresql.org>
Subject: AW: AW: [HACKERS] Plans for solving the VACUUM problem
Date: Wed, 30 May 2001 12:18:07 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR


> > > > So are whole pages stored in rollback segments or just
> > > > the modified data?
> > >
> > > This is implementation dependent. Storing whole pages is
> > > much easy to do, but obviously it's better to store just
> > > modified data.
> >
> > I am not sure it is necessarily better. Seems to be a tradeoff here.
> > pros of whole pages:
> > 	a possible merge with physical log (for first
> >           modification of a page after checkpoint
> > 		there would be no overhead compared to current
> >           since it is already written now)
>
> Using WAL as RS data storage is questionable.

No, I meant the other way around. Move the physical log pages away from WAL
files to the "rollback segment" (imho "snapshot area" would be a better name)

> > 	in a clever implementation a page already in the
> >           "rollback segment" might satisfy the
> > 		modification of another row on that page, and
> >           thus would not need any additional io.
>
> This would be possible only if there was no commit (same SCN)
> between two modifications.

I don't think someone else's commit matters unless it touches the same page.
In that case a reader would possibly need to chain back to an older version
inside the snapshot area, and then it gets complicated even in the whole page
case. A good concept could probably involve both whole page and change
only, and let the optimizer decide what to do.

> But, aren't we too deep on overwriting smgr (O-smgr) implementation?

Yes, but some understanding of the possibilities needs to be sorted out
to allow good decicsions, no ?

Andreas

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly