From owner-pgsql-hackers@hub.org Fri Nov 13 13:24:37 1998 Received: from hub.org (majordom@hub.org [209.47.148.200]) by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13457 for ; Fri, 13 Nov 1998 13:24:35 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.9.1/8.9.1) with SMTP id NAA02464; Fri, 13 Nov 1998 13:22:52 -0500 (EST) (envelope-from owner-pgsql-hackers@hub.org) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 13 Nov 1998 13:21:14 +0000 (EST) Received: (from majordom@localhost) by hub.org (8.9.1/8.9.1) id NAA02331 for pgsql-hackers-outgoing; Fri, 13 Nov 1998 13:21:12 -0500 (EST) (envelope-from owner-pgsql-hackers@postgreSQL.org) Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8]) by hub.org (8.9.1/8.9.1) with SMTP id NAA02316 for ; Fri, 13 Nov 1998 13:21:06 -0500 (EST) (envelope-from wieck@sapserv.debis.de) Received: by orion.SAPserv.Hamburg.dsh.de for pgsql-hackers@postgreSQL.org id m0zeOEf-000EBPC; Fri, 13 Nov 98 19:46 MET Message-Id: From: jwieck@debis.com (Jan Wieck) Subject: [HACKERS] shmem limits and redolog To: pgsql-hackers@postgreSQL.org (PostgreSQL HACKERS) Date: Fri, 13 Nov 1998 19:46:20 +0100 (MET) Reply-To: jwieck@debis.com (Jan Wieck) X-Mailer: ELM [version 2.4 PL25] Content-Type: text Sender: owner-pgsql-hackers@postgreSQL.org Precedence: bulk Status: ROr Hi, I'm currently hacking around on a solution for logging all database operations at query level that can recover a crashed database from the last successful backup by redoing all the commands. Well, I wanted it to be as flexible as can. So I decided to make it per database configurable. One could say which databases are logged and if a database is, if it is logged sync or async (in sync mode, every COMMIT forces an fsync of the actual logfile and controlfiles). To make async mode as fast as can, I'm using a shared memory of 32K per database (not per backend) that is used as a wrap around buffer from the backends to place their query information. So the log writer can fall a little behind if there are many backends doing different things that don't lock each other. Now I'm a little in doubt about the shared memory limits reported. Was it a good decision to use shared memory? Am I better off using socket's? The bad thing in what I have up to now (it's far from complete) is, that even if a database isn't currently logged, a redolog writer is started and creates the 32K shmem segment (plus a semaphore set with 5 semaphores). This is because I plan to create commands like ALTER DATABASE LOG MODE=ASYNC LOGDIR='/somewhere/dbname'; and the like that can be used at runtime (while more than one backend is connected to the database) to turn logging on/off, switch to/from backup mode (all other activity is stopped) etc. So every 32 databases will require another megabyte of shared memory. The logging master controls which databases have activity and kills redolog writers after some time of inactivity, and the shmem is freed then. But it can hurt if someone really has many many databases that are all used at the same time. What do the others say? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) # From owner-pgsql-hackers@hub.org Wed Dec 16 15:46:41 1998 Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA00521 for ; Wed, 16 Dec 1998 15:46:40 -0500 (EST) Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id PAA08772 for ; Wed, 16 Dec 1998 15:10:01 -0500 (EST) Received: from localhost (majordom@localhost) by hub.org (8.9.1/8.9.1) with SMTP id PAA01254; Wed, 16 Dec 1998 15:06:56 -0500 (EST) (envelope-from owner-pgsql-hackers@hub.org) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Dec 1998 14:58:11 +0000 (EST) Received: (from majordom@localhost) by hub.org (8.9.1/8.9.1) id OAA00660 for pgsql-hackers-outgoing; Wed, 16 Dec 1998 14:58:10 -0500 (EST) (envelope-from owner-pgsql-hackers@postgreSQL.org) Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8]) by hub.org (8.9.1/8.9.1) with SMTP id OAA00643 for ; Wed, 16 Dec 1998 14:58:05 -0500 (EST) (envelope-from wieck@sapserv.debis.de) Received: by orion.SAPserv.Hamburg.dsh.de for pgsql-hackers@postgreSQL.org id m0zqNDo-000EBTC; Wed, 16 Dec 98 21:07 MET Message-Id: From: jwieck@debis.com (Jan Wieck) Subject: Re: [HACKERS] redolog - for discussion To: vadim@krs.ru (Vadim Mikheev) Date: Wed, 16 Dec 1998 21:07:00 +0100 (MET) Cc: jwieck@debis.com, pgsql-hackers@postgreSQL.org Reply-To: jwieck@debis.com (Jan Wieck) In-Reply-To: <3677B71D.C67462B3@krs.ru> from "Vadim Mikheev" at Dec 16, 98 08:35:25 pm X-Mailer: ELM [version 2.4 PL25] Content-Type: text Sender: owner-pgsql-hackers@postgreSQL.org Precedence: bulk Status: RO Vadim wrote: > > Jan Wieck wrote: > > > > RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET}; > > > ... > > > > For the others, the backend starts the recovery program > > which reads the redolog files, establishes database > > connections as required and reruns all the commands in > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > > them. If a required logfile isn't found, it tells the > ^^^^^ > > I foresee problems with using _commands_ logging for > recovery/replication -:(( > > Let's consider two concurrent updates in READ COMMITTED mode: > > update test set x = 2 where y = 1; > > and > > update test set x = 3 where y = 1; > > The result of both committed transaction will be x = 2 > if the 1st transaction updated row _after_ 2nd transaction > and x = 3 if the 2nd transaction gets row after 1st one. > Order of updates is not defined by order in which commands > begun and so order in which commands should be rerun > will be unknown... Yepp, the order in which commands begun is absolutely not of interest. Locking could already delay the execution of one command until another one started later has finished and released the lock. It's a classic race condition. Thus, my plan was to log the queries just before the call to CommitTransactionCommand() in tcop. This has the advantage, that queries which bail out with errors don't get into the log at all and must not get rerun. And I can set a static flag to false before starting the command, which is set to true in the buffer manager when a buffer is written (marked dirty), so filtering out queries that do no updates at all is easy. Unfortunately query level logging get's hit by the current implementation of sequence numbers. If a query that get's aborted somewhere in the middle (maybe by a trigger) called nextval() for rows processed earlier, the sequence number isn't advanced at recovery time, because the query is suppressed at all. And sequences aren't locked, so for concurrently running queries getting numbers from the same sequence, the results aren't reproduceable. If some application selects a value resulting from a sequence and uses that later in another query, how could the redolog know that this has changed? It's a Const in the query logged, and all that corrupts the whole thing. All that is painful and I don't see another solution yet than to hook into nextval(), log out the numbers generated in normal operation and getting back the same numbers in redo mode. The whole thing gets more and more complicated :-( Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) # From owner-pgsql-hackers@hub.org Wed Jun 16 09:29:31 1999 Received: from hub.org (hub.org [209.167.229.1]) by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA22504 for ; Wed, 16 Jun 1999 09:29:29 -0400 (EDT) Received: from hub.org (hub.org [209.167.229.1]) by hub.org (8.9.3/8.9.3) with ESMTP id JAA02132; Wed, 16 Jun 1999 09:18:20 -0400 (EDT) (envelope-from owner-pgsql-hackers@hub.org) Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Jun 1999 09:14:07 +0000 (EDT) Received: (from majordom@localhost) by hub.org (8.9.3/8.9.3) id JAA01318 for pgsql-hackers-outgoing; Wed, 16 Jun 1999 09:14:06 -0400 (EDT) (envelope-from owner-pgsql-hackers@postgreSQL.org) X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37]) by hub.org (8.9.3/8.9.3) with ESMTP id JAA01278 for ; Wed, 16 Jun 1999 09:13:48 -0400 (EDT) (envelope-from vadim@krs.ru) Received: from krs.ru (dune.krs.ru [195.161.16.38]) by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id VAA06276 for ; Wed, 16 Jun 1999 21:12:49 +0800 (KRSS) Message-ID: <3767A2CF.E6E4A5F9@krs.ru> Date: Wed, 16 Jun 1999 21:12:47 +0800 From: Vadim Mikheev Organization: OJSC Rostelecom (Krasnoyarsk) X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386) X-Accept-Language: ru, en MIME-Version: 1.0 To: PostgreSQL Developers List Subject: [HACKERS] Savepoints... Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-pgsql-hackers@postgreSQL.org Precedence: bulk Status: ROr To have them I need to add tuple id (6 bytes) to heap tuple header. Are there objections? Though it's not good to increase tuple header size, subj is, imho, very nice feature... Implementation is , hm, "easy": - heap_insert/heap_delete/heap_replace/heap_mark4update will remember updated tid (and current command id) in relation cache and store previously updated tid (remembered in relation cache) in additional heap header tid; - lmgr will remember command id when lock was acquired; - for a savepoint we will just store command id when the savepoint was setted; - when going to sleep due to concurrent the-same-row update, backend will store MyProc and tuple id in shmem hash table. When rolling back to a savepoint, backend will: - release locks acquired after savepoint; - for a relation updated after savepoint, get last updated tid from relation cache, walk through relation, set HEAP_XMIN_INVALID/HEAP_XMAX_INVALID in all tuples updated after savepoint and wake up concurrent writers blocked on these tuples (using shmem hash table mentioned above). The last feature (waking up of concurrent writers) is most hard part to implement. AFAIK, Oracle 7.3 was not able to do it. Can someone comment is this feature implemented in Oracle 8.X, other DBMSes? Now about implicit savepoints. Backend will place them before user statements execution. In the case of failure, transaction state will be rolled back to the one before execution of query. As side-effect, this means that we'll get rid of complaints about entire transaction abort in the case of mistyping causing abort due to parser errors... Comments? Vadim