From owner-pgsql-hackers@hub.org Fri Nov 13 13:24:37 1998
Received: from hub.org (majordom@hub.org [209.47.148.200])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13457
	for <maillist@candle.pha.pa.us>; Fri, 13 Nov 1998 13:24:35 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.1/8.9.1) with SMTP id NAA02464;
	Fri, 13 Nov 1998 13:22:52 -0500 (EST)
	(envelope-from owner-pgsql-hackers@hub.org)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 13 Nov 1998 13:21:14 +0000 (EST)
Received: (from majordom@localhost)
	by hub.org (8.9.1/8.9.1) id NAA02331
	for pgsql-hackers-outgoing; Fri, 13 Nov 1998 13:21:12 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
	by hub.org (8.9.1/8.9.1) with SMTP id NAA02316
	for <pgsql-hackers@postgreSQL.org>; Fri, 13 Nov 1998 13:21:06 -0500 (EST)
	(envelope-from wieck@sapserv.debis.de)
Received: by orion.SAPserv.Hamburg.dsh.de 
	for pgsql-hackers@postgreSQL.org 
	id m0zeOEf-000EBPC; Fri, 13 Nov 98 19:46 MET
Message-Id: <m0zeOEf-000EBPC@orion.SAPserv.Hamburg.dsh.de>
From: jwieck@debis.com (Jan Wieck)
Subject: [HACKERS] shmem limits and redolog
To: pgsql-hackers@postgreSQL.org (PostgreSQL HACKERS)
Date: Fri, 13 Nov 1998 19:46:20 +0100 (MET)
Reply-To: jwieck@debis.com (Jan Wieck)
X-Mailer: ELM [version 2.4 PL25]
Content-Type: text
Sender: owner-pgsql-hackers@postgreSQL.org
Precedence: bulk
Status: ROr

Hi,

    I'm  currently  hacking  around on a solution for logging all
    database operations at query level that can recover a crashed
    database  from  the last successful backup by redoing all the
    commands.

    Well, I wanted it to be as flexible as can. So I  decided  to
    make  it  per  database  configurable.  One  could  say which
    databases are logged and if a database is, if  it  is  logged
    sync  or async (in sync mode, every COMMIT forces an fsync of
    the actual logfile and controlfiles).

    To make async mode as fast as can, I'm using a shared  memory
    of  32K per database (not per backend) that is used as a wrap
    around  buffer  from  the  backends  to  place  their   query
    information.  So  the  log writer can fall a little behind if
    there are many backends doing  different  things  that  don't
    lock each other.

    Now  I'm  a  little  in  doubt about the shared memory limits
    reported.  Was it a good decision to use shared memory? Am  I
    better off using socket's?

    The  bad  thing  in  what  I  have  up  to now (it's far from
    complete) is, that even if a database isn't currently logged,
    a redolog writer is started and creates the 32K shmem segment
    (plus a semaphore set with 5 semaphores). This is  because  I
    plan to create commands like

        ALTER DATABASE LOG MODE=ASYNC LOGDIR='/somewhere/dbname';

    and the like that can be used at runtime (while more than one
    backend is connected to the database) to turn logging on/off,
    switch  to/from  backup  mode (all other activity is stopped)
    etc.

    So every 32 databases will require another megabyte of shared
    memory.  The  logging  master  controls  which databases have
    activity  and  kills  redolog  writers  after  some  time  of
    inactivity,  and  the shmem is freed then. But it can hurt if
    someone really has many many databases that are all  used  at
    the same time.

    What do the others say?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #


From owner-pgsql-hackers@hub.org Wed Dec 16 15:46:41 1998
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA00521
	for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:46:40 -0500 (EST)
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id PAA08772 for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:10:01 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.1/8.9.1) with SMTP id PAA01254;
	Wed, 16 Dec 1998 15:06:56 -0500 (EST)
	(envelope-from owner-pgsql-hackers@hub.org)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Dec 1998 14:58:11 +0000 (EST)
Received: (from majordom@localhost)
	by hub.org (8.9.1/8.9.1) id OAA00660
	for pgsql-hackers-outgoing; Wed, 16 Dec 1998 14:58:10 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
	by hub.org (8.9.1/8.9.1) with SMTP id OAA00643
	for <pgsql-hackers@postgreSQL.org>; Wed, 16 Dec 1998 14:58:05 -0500 (EST)
	(envelope-from wieck@sapserv.debis.de)
Received: by orion.SAPserv.Hamburg.dsh.de 
	for pgsql-hackers@postgreSQL.org 
	id m0zqNDo-000EBTC; Wed, 16 Dec 98 21:07 MET
Message-Id: <m0zqNDo-000EBTC@orion.SAPserv.Hamburg.dsh.de>
From: jwieck@debis.com (Jan Wieck)
Subject: Re: [HACKERS] redolog - for discussion
To: vadim@krs.ru (Vadim Mikheev)
Date: Wed, 16 Dec 1998 21:07:00 +0100 (MET)
Cc: jwieck@debis.com, pgsql-hackers@postgreSQL.org
Reply-To: jwieck@debis.com (Jan Wieck)
In-Reply-To: <3677B71D.C67462B3@krs.ru> from "Vadim Mikheev" at Dec 16, 98 08:35:25 pm
X-Mailer: ELM [version 2.4 PL25]
Content-Type: text
Sender: owner-pgsql-hackers@postgreSQL.org
Precedence: bulk
Status: RO

Vadim wrote:

>
> Jan Wieck wrote:
> >
> >     RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
> >
> ...
> >
> >         For  the  others, the backend starts the recovery program
> >         which  reads  the  redolog  files,  establishes  database
> >         connections  as  required  and reruns all the commands in
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
> >         them. If a required logfile isn't  found,  it  tells  the
>           ^^^^^
>
> I foresee problems with using _commands_ logging for
> recovery/replication -:((
>
> Let's consider two concurrent updates in READ COMMITTED mode:
>
> update test set x = 2 where y = 1;
>
>    and
>
> update test set x = 3 where y = 1;
>
> The result of both committed transaction will be x = 2
> if the 1st transaction updated row _after_ 2nd transaction
> and x = 3 if the 2nd transaction gets row after 1st one.
> Order of updates is not defined by order in which commands
> begun and so order in which commands should be rerun
> will be unknown...

    Yepp,  the order in which commands begun is absolutely not of
    interest. Locking could already delay the  execution  of  one
    command  until  another  one  started  later has finished and
    released the lock.  It's a classic race condition.

    Thus, my plan was to log the queries just before the call  to
    CommitTransactionCommand()  in  tcop. This has the advantage,
    that queries which bail out with errors don't  get  into  the
    log  at  all  and  must not get rerun. And I can set a static
    flag to false before starting the command, which  is  set  to
    true  in  the buffer manager when a buffer is written (marked
    dirty), so filtering out queries that do no updates at all is
    easy.

    Unfortunately  query  level  logging get's hit by the current
    implementation of sequence numbers. If  a  query  that  get's
    aborted  somewhere  in the middle (maybe by a trigger) called
    nextval() for rows processed  earlier,  the  sequence  number
    isn't  advanced  at  recovery  time,  because  the  query  is
    suppressed at all.   And  sequences  aren't  locked,  so  for
    concurrently  running  queries  getting numbers from the same
    sequence,  the  results   aren't   reproduceable.   If   some
    application  selects  a  value  resulting from a sequence and
    uses that later in another query, how could the redolog  know
    that  this has changed? It's a Const in the query logged, and
    all that corrupts the whole thing.

    All that is painful and I don't see another solution yet than
    to  hook  into  nextval(),  log  out the numbers generated in
    normal operation and getting back the same  numbers  in  redo
    mode.

    The whole thing gets more and more complicated :-(


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #


From owner-pgsql-hackers@hub.org Wed Jun 16 09:29:31 1999
Received: from hub.org (hub.org [209.167.229.1])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA22504
	for <maillist@candle.pha.pa.us>; Wed, 16 Jun 1999 09:29:29 -0400 (EDT)
Received: from hub.org (hub.org [209.167.229.1])
	by hub.org (8.9.3/8.9.3) with ESMTP id JAA02132;
	Wed, 16 Jun 1999 09:18:20 -0400 (EDT)
	(envelope-from owner-pgsql-hackers@hub.org)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Jun 1999 09:14:07 +0000 (EDT)
Received: (from majordom@localhost)
	by hub.org (8.9.3/8.9.3) id JAA01318
	for pgsql-hackers-outgoing; Wed, 16 Jun 1999 09:14:06 -0400 (EDT)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f
Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37])
	by hub.org (8.9.3/8.9.3) with ESMTP id JAA01278
	for <hackers@postgreSQL.org>; Wed, 16 Jun 1999 09:13:48 -0400 (EDT)
	(envelope-from vadim@krs.ru)
Received: from krs.ru (dune.krs.ru [195.161.16.38])
	by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id VAA06276
	for <hackers@postgreSQL.org>; Wed, 16 Jun 1999 21:12:49 +0800 (KRSS)
Message-ID: <3767A2CF.E6E4A5F9@krs.ru>
Date: Wed, 16 Jun 1999 21:12:47 +0800
From: Vadim Mikheev <vadim@krs.ru>
Organization: OJSC Rostelecom (Krasnoyarsk)
X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
X-Accept-Language: ru, en
MIME-Version: 1.0
To: PostgreSQL Developers List <hackers@postgreSQL.org>
Subject: [HACKERS] Savepoints...
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-pgsql-hackers@postgreSQL.org
Precedence: bulk
Status: ROr

To have them I need to add tuple id (6 bytes) to heap tuple
header. Are there objections? Though it's not good to increase 
tuple header size, subj is, imho, very nice feature...

Implementation is , hm, "easy":

- heap_insert/heap_delete/heap_replace/heap_mark4update will
  remember updated tid (and current command id) in relation cache
  and store previously updated tid (remembered in relation cache)
  in additional heap header tid;
- lmgr will remember command id when lock was acquired;
- for a savepoint we will just store command id when
  the savepoint was setted;
- when going to sleep due to concurrent the-same-row update,
  backend will store MyProc and tuple id in shmem hash table.

When rolling back to a savepoint, backend will:

- release locks acquired after savepoint;
- for a relation updated after savepoint, get last updated tid 
  from relation cache, walk through relation, set 
  HEAP_XMIN_INVALID/HEAP_XMAX_INVALID in all tuples updated 
  after savepoint and wake up concurrent writers blocked
  on these tuples (using shmem hash table mentioned above).

The last feature (waking up of concurrent writers) is most hard
part to implement. AFAIK, Oracle 7.3 was not able to do it.
Can someone comment is this feature implemented in Oracle 8.X,
other DBMSes?

Now about implicit savepoints. Backend will place them before
user statements execution. In the case of failure, transaction
state will be rolled back to the one before execution of query.
As side-effect, this means that we'll get rid of complaints
about entire transaction abort in the case of mistyping
causing abort due to parser errors...

Comments?

Vadim