From owner-pgsql-hackers@hub.org Fri Nov 13 13:24:37 1998
Received: from hub.org (majordom@hub.org [209.47.148.200])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13457
	for <maillist@candle.pha.pa.us>; Fri, 13 Nov 1998 13:24:35 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.1/8.9.1) with SMTP id NAA02464;
	Fri, 13 Nov 1998 13:22:52 -0500 (EST)
	(envelope-from owner-pgsql-hackers@hub.org)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 13 Nov 1998 13:21:14 +0000 (EST)
Received: (from majordom@localhost)
	by hub.org (8.9.1/8.9.1) id NAA02331
	for pgsql-hackers-outgoing; Fri, 13 Nov 1998 13:21:12 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
	by hub.org (8.9.1/8.9.1) with SMTP id NAA02316
	for <pgsql-hackers@postgreSQL.org>; Fri, 13 Nov 1998 13:21:06 -0500 (EST)
	(envelope-from wieck@sapserv.debis.de)
Received: by orion.SAPserv.Hamburg.dsh.de 
	for pgsql-hackers@postgreSQL.org 
	id m0zeOEf-000EBPC; Fri, 13 Nov 98 19:46 MET
Message-Id: <m0zeOEf-000EBPC@orion.SAPserv.Hamburg.dsh.de>
From: jwieck@debis.com (Jan Wieck)
Subject: [HACKERS] shmem limits and redolog
To: pgsql-hackers@postgreSQL.org (PostgreSQL HACKERS)
Date: Fri, 13 Nov 1998 19:46:20 +0100 (MET)
Reply-To: jwieck@debis.com (Jan Wieck)
X-Mailer: ELM [version 2.4 PL25]
Content-Type: text
Sender: owner-pgsql-hackers@postgreSQL.org
Precedence: bulk
Status: ROr

Hi,

    I'm  currently  hacking  around on a solution for logging all
    database operations at query level that can recover a crashed
    database  from  the last successful backup by redoing all the
    commands.

    Well, I wanted it to be as flexible as can. So I  decided  to
    make  it  per  database  configurable.  One  could  say which
    databases are logged and if a database is, if  it  is  logged
    sync  or async (in sync mode, every COMMIT forces an fsync of
    the actual logfile and controlfiles).

    To make async mode as fast as can, I'm using a shared  memory
    of  32K per database (not per backend) that is used as a wrap
    around  buffer  from  the  backends  to  place  their   query
    information.  So  the  log writer can fall a little behind if
    there are many backends doing  different  things  that  don't
    lock each other.

    Now  I'm  a  little  in  doubt about the shared memory limits
    reported.  Was it a good decision to use shared memory? Am  I
    better off using socket's?

    The  bad  thing  in  what  I  have  up  to now (it's far from
    complete) is, that even if a database isn't currently logged,
    a redolog writer is started and creates the 32K shmem segment
    (plus a semaphore set with 5 semaphores). This is  because  I
    plan to create commands like

        ALTER DATABASE LOG MODE=ASYNC LOGDIR='/somewhere/dbname';

    and the like that can be used at runtime (while more than one
    backend is connected to the database) to turn logging on/off,
    switch  to/from  backup  mode (all other activity is stopped)
    etc.

    So every 32 databases will require another megabyte of shared
    memory.  The  logging  master  controls  which databases have
    activity  and  kills  redolog  writers  after  some  time  of
    inactivity,  and  the shmem is freed then. But it can hurt if
    someone really has many many databases that are all  used  at
    the same time.

    What do the others say?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #


From owner-pgsql-hackers@hub.org Wed Dec 16 15:46:41 1998
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA00521
	for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:46:40 -0500 (EST)
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$Revision: 1.3 $) with ESMTP id PAA08772 for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:10:01 -0500 (EST)
Received: from localhost (majordom@localhost)
	by hub.org (8.9.1/8.9.1) with SMTP id PAA01254;
	Wed, 16 Dec 1998 15:06:56 -0500 (EST)
	(envelope-from owner-pgsql-hackers@hub.org)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Dec 1998 14:58:11 +0000 (EST)
Received: (from majordom@localhost)
	by hub.org (8.9.1/8.9.1) id OAA00660
	for pgsql-hackers-outgoing; Wed, 16 Dec 1998 14:58:10 -0500 (EST)
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
	by hub.org (8.9.1/8.9.1) with SMTP id OAA00643
	for <pgsql-hackers@postgreSQL.org>; Wed, 16 Dec 1998 14:58:05 -0500 (EST)
	(envelope-from wieck@sapserv.debis.de)
Received: by orion.SAPserv.Hamburg.dsh.de 
	for pgsql-hackers@postgreSQL.org 
	id m0zqNDo-000EBTC; Wed, 16 Dec 98 21:07 MET
Message-Id: <m0zqNDo-000EBTC@orion.SAPserv.Hamburg.dsh.de>
From: jwieck@debis.com (Jan Wieck)
Subject: Re: [HACKERS] redolog - for discussion
To: vadim@krs.ru (Vadim Mikheev)
Date: Wed, 16 Dec 1998 21:07:00 +0100 (MET)
Cc: jwieck@debis.com, pgsql-hackers@postgreSQL.org
Reply-To: jwieck@debis.com (Jan Wieck)
In-Reply-To: <3677B71D.C67462B3@krs.ru> from "Vadim Mikheev" at Dec 16, 98 08:35:25 pm
X-Mailer: ELM [version 2.4 PL25]
Content-Type: text
Sender: owner-pgsql-hackers@postgreSQL.org
Precedence: bulk
Status: RO

Vadim wrote:

>
> Jan Wieck wrote:
> >
> >     RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
> >
> ...
> >
> >         For  the  others, the backend starts the recovery program
> >         which  reads  the  redolog  files,  establishes  database
> >         connections  as  required  and reruns all the commands in
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
> >         them. If a required logfile isn't  found,  it  tells  the
>           ^^^^^
>
> I foresee problems with using _commands_ logging for
> recovery/replication -:((
>
> Let's consider two concurrent updates in READ COMMITTED mode:
>
> update test set x = 2 where y = 1;
>
>    and
>
> update test set x = 3 where y = 1;
>
> The result of both committed transaction will be x = 2
> if the 1st transaction updated row _after_ 2nd transaction
> and x = 3 if the 2nd transaction gets row after 1st one.
> Order of updates is not defined by order in which commands
> begun and so order in which commands should be rerun
> will be unknown...

    Yepp,  the order in which commands begun is absolutely not of
    interest. Locking could already delay the  execution  of  one
    command  until  another  one  started  later has finished and
    released the lock.  It's a classic race condition.

    Thus, my plan was to log the queries just before the call  to
    CommitTransactionCommand()  in  tcop. This has the advantage,
    that queries which bail out with errors don't  get  into  the
    log  at  all  and  must not get rerun. And I can set a static
    flag to false before starting the command, which  is  set  to
    true  in  the buffer manager when a buffer is written (marked
    dirty), so filtering out queries that do no updates at all is
    easy.

    Unfortunately  query  level  logging get's hit by the current
    implementation of sequence numbers. If  a  query  that  get's
    aborted  somewhere  in the middle (maybe by a trigger) called
    nextval() for rows processed  earlier,  the  sequence  number
    isn't  advanced  at  recovery  time,  because  the  query  is
    suppressed at all.   And  sequences  aren't  locked,  so  for
    concurrently  running  queries  getting numbers from the same
    sequence,  the  results   aren't   reproduceable.   If   some
    application  selects  a  value  resulting from a sequence and
    uses that later in another query, how could the redolog  know
    that  this has changed? It's a Const in the query logged, and
    all that corrupts the whole thing.

    All that is painful and I don't see another solution yet than
    to  hook  into  nextval(),  log  out the numbers generated in
    normal operation and getting back the same  numbers  in  redo
    mode.

    The whole thing gets more and more complicated :-(


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #