From pgsql-hackers-owner+M59479@postgresql.org Thu Sep 30 15:55:23 2004 Return-path: Received: from svr1.postgresql.org (svr1.postgresql.org [200.46.204.71]) by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i8UJtHw26165 for ; Thu, 30 Sep 2004 15:55:19 -0400 (EDT) Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id A4BDD32A219 for ; Thu, 30 Sep 2004 20:55:10 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 24195-05 for ; Thu, 30 Sep 2004 19:55:08 +0000 (GMT) Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) by svr1.postgresql.org (Postfix) with ESMTP id 537C532A216 for ; Thu, 30 Sep 2004 20:55:10 +0100 (BST) X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id 333B932A1EF for ; Thu, 30 Sep 2004 20:49:20 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 21793-04 for ; Thu, 30 Sep 2004 19:49:09 +0000 (GMT) Received: from authenticity.encs.concordia.ca (authenticity-96.encs.concordia.ca [132.205.96.93]) by svr1.postgresql.org (Postfix) with ESMTP id BEB6A32A156 for ; Thu, 30 Sep 2004 20:49:03 +0100 (BST) Received: from haida.cs.concordia.ca (IDENT:mokhov@haida.cs.concordia.ca [132.205.64.45]) by authenticity.encs.concordia.ca (8.12.11/8.12.11) with ESMTP id i8UJn0Xe022202; Thu, 30 Sep 2004 15:49:00 -0400 Date: Thu, 30 Sep 2004 15:49:00 -0400 (EDT) From: "Serguei A. Mokhov" To: pgsql-hackers@postgresql.org cc: "Serguei A. Mokhov" Subject: [HACKERS] pg_upgrade project: high-level design proposal of in-place upgrade facility Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new at hub.org X-Mailing-List: pgsql-hackers Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on candle.pha.pa.us X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham version=2.61 Status: OR Hello dear all, [Please CC your replies to me as I am on the digest mode] Here's finally a very high-level design proposal of the pg_upgrade feature I was handwaiving a couple of weeks ago. Since, I am almost done with the moving, I can allocate some time for this for 8.1/8.2. If this topic is of interest to you, please read on until the very end before flaming or bashing the ideas out. I had designed that thing and kept updating (the design) more or less regularly, and also reflected some issues from the nearby threads [1] and [2]. This design is very high-level at the moment and is not very detailed. I will need to figure out more stuff as I go and design some aspects in finer detail. I started to poke around asking for initdb-forcing code paths in [3], but got no response so far. But I guess if the general idea or, rather, ideas accepted I will insist on more information more aggressively :) if I can't figure something out for myself. [1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php [2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php [3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php Comments are very welcome, especially _*CONSTRUCTIVE*_... Thank you, and now sit back and read... CONTENTS: ========= 1. The Need 1. Utilities and User's View of the pg_upgrade Feature 2. Storage Management - Storage Managers and the smgr API 3. Source Code Maintenance Aspects 2. The Upgrade Sequence 4. Proposed Implementation Plan - initdb() API - upgrade API 1. The Need ----------- It's been a problem for PG for quite awhile now to have a less painful upgrade procedure with every new revision of PostgreSQL, so the dump/restore sequence is required. That can take a while for a production DB, while keeping it offline. The new replication-related solutions, such as Slony I, pg_pool, and others can remedy the problem somewhat, but require to roughly double the storage requirements of a given database while replicating from the older server to a newer one. The proposed implementation of an in-server pg_upgrade facility attempts to address both issues at the same time -- a possibility to keep the server running and upgrading lazily w/o doubling the storage requirements (there will be some extra disk space taken, but far from doubling the size). The in-process upgrade will not take much of down time and won't require that much memory/disk/network resources as replication solutions do. Prerequisites ------------- Ideally, the (maybe not so anymore) ambitious goal is to simply be able to "drop in" the new binaries of the new server and kick off on the older version of data files. I think is this feasible now a lot more than before since we have those things available, which should ease up the implementation: - bgwriter - pg_autovacuum (the one to be integrated into the backend in 8.1) - smgr API for pluggable storage managers - initdb in C - ... initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager have made the possibility of creation of the Upgrade Subsystem for PostgreSQL to be something more reasonable, complete, feasible, and sane to a point. Utilities and the User's (DBA) View of the Feature -------------------------------------------------- Two instances exist: pg_upgrade (in C) A standalone utility to upgrade the binary on-disk format from one version to another when the database is offline. We should always have this as an option. pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore options that do not require a connection. I haven't thought through this in detail yet. pg_autoupgrade a postgres subprocess, modeled after bgwriter and pg_autovacuum daemons. This will work when the database system is running on old data directory, and lazily converting relations to the new format. pg_autoupgrade daemon can be triggered by the following events in addition to the lazy upgrade process: "SQL" level: UPGRADE [NOW | time] While the database won't be offline running over older database files, SELECT/read-only queries would be allowed using older storage managers*. Any write operation on old data will act using write-invalidate approach that will force the upgrade the affected relations to the new format to be scheduled after the relation-in-progress. (* See the "Storage Management" section.) Availability of the relations while upgrade is in progress is likely to be the same as in VACUUM FULL for that relation, i.e. the entire relation is locked until the upgrade is complete. Maybe we could optimize that by locking only particular pages of relations, I have to figure that out. The upgrade of indices can be done using REINDEX, which seems far less complicated than trying to convert its on-disk representation. This has to be done after the relation is converted. Alternatively, the index upgrade can simply be done by "CREATE INDEX" after the upgrade of relations. The relations to be upgraded are ordered according to some priority, e.g. system relations being first, then user-owned relations. System relations upgrade is forced upon the postmaster startup, and then user relations are processed lazily. So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate storage manager (like a driver) between the new server and the old data file upgrading them on-demand. For that purpose we might need to add a pg_upgradeproxy to intercept backend requests and use appropriate storage manager. There will be one proxy process per backend. Storage Management ================== Somebody has made a possibility to plug a different storage manager in postgres and we even had two of them at some point . for the magnetic disk and the main memory. The main memory one is gone, but the smgr API is still there. Some were dubious why we would ever need another third-party storage manager, but here I propose to "plug in" storage managers from the older Postgres versions itself! Here is where the pluggable storage manager API would be handy once fully resurrected. Instead of trying to plug some third party storage managers it will primarily be used by the storage managers of different versions of Postgres. We can take the storage manager code from the past maintenance releases, namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them in appropriate fashion and have them implement the API properly. Anyone can contribute a storage manager as they see fit, there's no need to get them all at once. As a trial implementation I will try to do the last three or four maybe. Where to put relations being upgraded? -------------------------------------- At the beginning of the upgrade process if pg detects the old version of data files, it moves them under $PGDATA/, and keeps the old relations there until upgraded. The relations to be upgraded will be kept in the pg_upgrade_catalog. Once all relations upgraded, the directory is removed and the auto and proxy processes are shut down. The contents of the pg_upgrade_catalog emptied. The only issue remains is how to deal with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can probably be addressed in the similar fashion, but having a /my/tablespace/ directory. Source Code Maintenance ======================= Now, after the above some of you may get scared on the amount of similar code to possibly maintain in all those storage managers, but in reality they would require as much maintenance as the corresponding releases do get back-patched in that code area, and some are not being maintained for quite some time already. Plus, I should be around to maintain it, should this become realized. Release-time Maintenance ------------------------ For maintenance of pg_upgrade itself, one will have to fork out a new storage manager from the previous stable release and "register" it within the system. Alternatively, the new storage manager can be forked when the new release cycle begins. Additionally, a pg_upgrade version has to be added implementing the API steps outlined in the pg_upgrade API section. Implementation Steps ==================== To materialize the above idea, I'd proceed as follows: *) Provide the initdb() API (quick) *) Resurrect the pluggable storage manager API to be usable for the purpose. *) Document it *) Implement pg_upgrade API for 8.0 and 7.4.5. *) Extract 8.0 and 7.4.5 storage managers and have them implement the API as a proof of concept. Massage the API as needed. *) Document the process of adding new storage managers and pg_upgrade drivers. *) Extract other versions storage managers. pg_upgrade sequence ------------------- pg_upgrade API for the steps below to update for the next release. What to do with WAL?? Maybe upgrade can simply be done using WAL replay with old WAL manager? Not, fully, because not everything is in WAL, but some WAL recovery maybe needed in case the server was not shutdown cleanly before the upgrade. pg_upgrade will proceed as follows: - move PGDATA to PGDATA/ - move tablespaces likewise - optional recovery from WAL in case old server was not shutdown properly -? Shall I upgrade PITR logs of 8.x??? So one can recover to a point-in-time in the upgraded database? - CLUSTER all old data - ANALYZE all old data - initdb() new system catalogs - Merge in modifications from old system catalogs - upgrade schemas/users -- variations - upgrade user relations Upgrade API: ------------ First draft, to be refined multiple times, but to convey the ideas behind: moveData() movePGData() moveTablespaces() 8.0+ moveDbLocation() < 8.0 preliminaryRecovery() - WAL?? - PITR 8.0+?? preliminaryCleanup() CLUSTER -- recover some dead space ANALYZE -- gives us stats upgradeSystemInfo() initdb() mergeOldCatalogs() mergeOldTemplates() upgradeUsers() upgradeSchemas() - > 7.2, else NULL upgradeUserRelations() upgradeIndices() DROP/CREATE upgradeInit() { } The main body in pseudocode: upgradeLoop() { moveData(); preliminaryRecovery(); preliminaryCleanup(); upgradeSystemInfo(); upgradeUsers(); upgradeSchemas(); upgradeUserRelations(); } Something along these lines the API would be: typedef struct t_upgrade { bool (*moveData) (void); bool (*preliminaryRecovery) (void); /* may be NULL */ bool (*preliminaryCleanup) (void); /* may be NULL */ bool (*upgradeSystemInfo) (void); /* may be NULL */ bool (*upgradeUsers) (void); /* may be NULL */ bool (*upgradeSchemas) (void); /* may be NULL */ bool (*upgradeUserRelations) (void); /* may be NULL */ } t_upgrade; The above sequence is executed by either pg_upgrade utility uninterrupted or by the pg_autoupgrade daemon. In the former the upgrade priority is simply by OID, in the latter also, but can be overridden by the user using the UPGRADE command to schedule relations upgrade, write operation can also change such schedule, with user's selected choice to be first. The more write requests a relation receives while in the upgrade queue, its priority increases; thus, the relation with most hits is on top. In case of tie, OID is the decision mark. Some issues to look into: - catalog merger - a crash in the middle of upgrade - PITR logs for 8.x+ - ... Flames and Handwaiving ---------------------- Okay, flame is on, but before you flame, mind you, this is a very initial version of the design. Some of the ideas may seem far fetched, the contents may seem messy, but I believe it's now more doable than ever and I am willing to put effort in it for the next release or two and then maintain it afterwards. It's not going to be done in one shot maybe, but incrementally, using input, feedback, and hints from you, guys. Thank you for reading till this far :-) I.d like to hear from you if any of this made sense to you. Truly yours, -- Serguei A. Mokhov | /~\ The ASCII Computer Science Department | \ / Ribbon Campaign Concordia University | X Against HTML Montreal, Quebec, Canada | / \ Email! ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html From pgsql-hackers-owner+M59486@postgresql.org Thu Sep 30 16:39:54 2004 Return-path: Received: from svr1.postgresql.org (svr1.postgresql.org [200.46.204.71]) by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i8UKdrw02740 for ; Thu, 30 Sep 2004 16:39:53 -0400 (EDT) Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id EF25F329E3B for ; Thu, 30 Sep 2004 21:39:48 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 38456-02 for ; Thu, 30 Sep 2004 20:39:46 +0000 (GMT) Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) by svr1.postgresql.org (Postfix) with ESMTP id 88673329C6B for ; Thu, 30 Sep 2004 21:39:48 +0100 (BST) X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id AA522329DAE for ; Thu, 30 Sep 2004 21:37:43 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 38130-02 for ; Thu, 30 Sep 2004 20:37:39 +0000 (GMT) Received: from sss.pgh.pa.us (sss.pgh.pa.us [66.207.139.130]) by svr1.postgresql.org (Postfix) with ESMTP id 846E9329C63 for ; Thu, 30 Sep 2004 21:37:39 +0100 (BST) Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) by sss.pgh.pa.us (8.13.1/8.13.1) with ESMTP id i8UKa3jk025254; Thu, 30 Sep 2004 16:36:03 -0400 (EDT) To: "Serguei A. Mokhov" cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] pg_upgrade project: high-level design proposal of in-place upgrade facility In-Reply-To: References: Comments: In-reply-to "Serguei A. Mokhov" message dated "Thu, 30 Sep 2004 15:49:00 -0400" Date: Thu, 30 Sep 2004 16:36:02 -0400 Message-ID: <25253.1096576562@sss.pgh.pa.us> From: Tom Lane X-Virus-Scanned: by amavisd-new at hub.org X-Mailing-List: pgsql-hackers Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on candle.pha.pa.us X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham version=2.61 Status: OR "Serguei A. Mokhov" writes: > Comments are very welcome, especially _*CONSTRUCTIVE*_... This is fundamentally wrong, because you are assigning the storage manager functionality that it does not have. In particular, the storage manager knows nothing of the contents or format of the files it is managing, and so you can't realistically expect to use the smgr switch as a way to support access to tables with different internal formats. The places that change in interesting ways across versions are usually far above the smgr switch. I don't believe in the idea of incremental "lazy" upgrades very much either. It certainly will not work on the system catalogs --- you have to convert those in a big-bang fashion, because how are you going to find the other stuff otherwise? And the real problem with it IMHO is that if something goes wrong partway through the process, you're in deep trouble because you have no way to back out. You can't just revert to the old version because it won't understand your data, and your old backups that are compatible with it are now out of date. If there are going to be any problems, you really need to find out about them immediately while your old backups are still current, not in a "lazy" fashion. The design we'd pretty much all bought into six months ago involved being able to do in-place upgrades when the format/contents of user relations and indexes is not changing. All you'd have to do is dump and restore the schema data (system catalogs) which is a reasonably short process even on a large DB, so the big-bang nature of the conversion isn't a problem. Admittedly this will not work for every single upgrade, but we had agreed that we could schedule upgrades so that the majority of releases do not change user data. Historically that's been mostly true anyway, even without any deliberate attempt to group user-data-changing features together. I think the last major discussion about it started here: http://archives.postgresql.org/pgsql-hackers/2003-12/msg00379.php (I got distracted by other stuff and never did the promised work, but I still think the approach is sound.) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) From pgsql-hackers-owner+M59497@postgresql.org Thu Sep 30 17:44:44 2004 Return-path: Received: from svr1.postgresql.org (svr1.postgresql.org [200.46.204.71]) by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i8ULihw11377 for ; Thu, 30 Sep 2004 17:44:43 -0400 (EDT) Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id D0A6B329E2E for ; Thu, 30 Sep 2004 22:44:38 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 55636-04 for ; Thu, 30 Sep 2004 21:44:36 +0000 (GMT) Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) by svr1.postgresql.org (Postfix) with ESMTP id 6ED67329DFC for ; Thu, 30 Sep 2004 22:44:38 +0100 (BST) X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id 040D2329E2E for ; Thu, 30 Sep 2004 22:42:24 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 55767-04 for ; Thu, 30 Sep 2004 21:42:18 +0000 (GMT) Received: from authenticity.encs.concordia.ca (authenticity-96.encs.concordia.ca [132.205.96.93]) by svr1.postgresql.org (Postfix) with ESMTP id E3A4D329DFC for ; Thu, 30 Sep 2004 22:42:19 +0100 (BST) Received: from haida.cs.concordia.ca (IDENT:mokhov@haida.cs.concordia.ca [132.205.64.45]) by authenticity.encs.concordia.ca (8.12.11/8.12.11) with ESMTP id i8ULgJrP001049; Thu, 30 Sep 2004 17:42:19 -0400 Date: Thu, 30 Sep 2004 17:42:19 -0400 (EDT) From: "Serguei A. Mokhov" To: Tom Lane cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] pg_upgrade project: high-level design proposal of In-Reply-To: <25253.1096576562@sss.pgh.pa.us> Message-ID: References: <25253.1096576562@sss.pgh.pa.us> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new at hub.org X-Mailing-List: pgsql-hackers Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on candle.pha.pa.us X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham version=2.61 Status: OR On Thu, 30 Sep 2004, Tom Lane wrote: > Date: Thu, 30 Sep 2004 16:36:02 -0400 > > "Serguei A. Mokhov" writes: > > Comments are very welcome, especially _*CONSTRUCTIVE*_... > > This is fundamentally wrong, because you are assigning the storage > manager functionality that it does not have. In particular, the Maybe, that's why I was asking of all init-db forcing paths, so I can go level above smgr to upgrade stuff, let say in access/ and other parts. I did ask that before and never got a reply. So the concept of "Storage Managers" may and will go well beyond the smgt API. That's the design refinement stage. > I don't believe in the idea of incremental "lazy" upgrades very much > either. It certainly will not work on the system catalogs --- you have > to convert those in a big-bang fashion, I never proposed to do that to system catalogs, on the contrary, I said the system catalogs are to be upgraded upon the postmaster startup. only user relations are upgraded lazily: > The relations to be upgraded are ordered according to some priority, > e.g. system relations being first, then user-owned relations. System > relations upgrade is forced upon the postmaster startup, and then user > relations are processed lazily. So looks like we don't disagree here :) > The design we'd pretty much all bought into six months ago involved > being able to do in-place upgrades when the format/contents of user > relations and indexes is not changing. All you'd have to do is dump and > restore the schema data (system catalogs) which is a reasonably short > process even on a large DB, so the big-bang nature of the conversion > isn't a problem. Admittedly this will not work for every single > upgrade, but we had agreed that we could schedule upgrades so that the > majority of releases do not change user data. Historically that's been > mostly true anyway, even without any deliberate attempt to group > user-data-changing features together. Annoyingly enough, they still do change. > I think the last major discussion about it started here: > http://archives.postgresql.org/pgsql-hackers/2003-12/msg00379.php > (I got distracted by other stuff and never did the promised work, > but I still think the approach is sound.) I'll go over that discussion and maybe will combine useful ideas together. I'll open a pgfoundry project to develop it there and then will submit for evaluation UNLESS you reserved it for yourself, Tom, to fullfill the promise... If anybody objects the pgfoundry idea to test the concepts, I'll apply for a project there. Thank you for the comments! > regards, tom lane > -- Serguei A. Mokhov | /~\ The ASCII Computer Science Department | \ / Ribbon Campaign Concordia University | X Against HTML Montreal, Quebec, Canada | / \ Email! ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings