From ccshag@cclabs.missouri.edu Wed Jul 10 04:27:16 1996 Received: from realtime.cc.missouri.edu (realtime.cc.missouri.edu [128.206.212.69]) by ki.net (8.7.5/8.7.5) with ESMTP id EAA12722 for ; Wed, 10 Jul 1996 04:27:13 -0400 (EDT) Received: from localhost (ccshag@localhost) by realtime.cc.missouri.edu (8.7.1/8.7.1) with SMTP id DAA00915; Wed, 10 Jul 1996 03:26:42 -0500 (CDT) X-Authentication-Warning: realtime.cc.missouri.edu: ccshag owned process doing -bs Date: Wed, 10 Jul 1996 03:26:41 -0500 (CDT) From: "Paul 'Shag' Walmsley" Sender: ccshag@cclabs.missouri.edu To: postgres95-hackers@oozoo.vnet.net cc: pixel@tiger.coe.missouri.edu, michael.siebenborn@ae3.Hypo.DE, scrappy@ki.net Subject: Workaround for database corruption problems [long] Message-ID: X-Disclaimer: The opinions of this writer do not necessarily represent those of the University of Missouri-Columbia. MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: I've been poking through the guts of POSTGRES95 in pursuit of the previously-mentioned "database corruption" problem, and I think that I have some preliminary answers and a workaround for people to play around with. Specifically, the problem that I'm referring to is manifested by pg_log, pg_time, or pg_variable either growing to huge sizes (usually 128MB) or showing evidence of internal corruption. Symptoms that appear to the user include receiving messages like the following in psql or monitor: "WARN:cannot open segment 1 of relation pg_time" "WARN:cannot write block 16384 of pg_log" .. which are followed up in a 'postmaster -d 3' debug log by a "NOTICE:AbortTransaction and not in in-progress state" It's been my experience that this problem is difficult to reproduce on demand; I've tried several different ways and haven't succeeded. (If anyone knows how to reliably reproduce this, please E-mail me.) I was, however, able to obtain a database in a corrupted state from one of the fellows on campus here running POSTGRES95; and following is something of a post-mortem. ---------- The problem with the database is that after any query, POSTGRES95 attempts to generate a huge pg_log file and fails (running out of disk space) with a "WARN:cannot write block 16384 of pg_log" message. The contents of the database and its internal structure were reasonably intact - I discovered that I was able to execute a single SQL query per-backend, and could therefore back the contents of the database up with a COPY command. (I wasn't able to use pg_dump; it failed on its BEGIN TRANSACTION step.) Inspection of data/pg_log, data/pg_variable, and data/pg_time revealed the following; realtime /usr/postgres95 > od -x data/pg_log.old | head 0000000 0060 2000 2000 2000 f680 0000 ed00 0000 0000020 e300 0000 dc00 0000 ce00 0000 bc00 0000 0000040 b140 0000 a740 0000 9dc0 0000 9180 0000 0000060 8580 0000 7c80 0000 6fc0 0000 6480 0000 0000100 58c0 0000 4fc0 0000 4640 0000 3a80 0000 0000120 30c0 0000 23c0 0000 1900 0000 0c40 0000 0000140 0000 0000 0000 0000 0000 0000 0000 0000 * 0000200 8800 0000 0000 0000 0800 0000 0000 0000 0000220 0800 0000 0000 0000 0800 0000 0000 0000 Now, already something is obviously wrong: pg_log shouldn't have any values other than 0, 1, 2, 4, 8, a, c, or e for each nybble (see the XID_ ... definitions in src/backend/access/transam.h:35; two of these can be stored per nybble.) realtime /usr/postgres95 > od -x data/pg_variable.old | head 0000000 0058 2000 2000 2740 f240 0000 e400 0000 0000020 d780 0000 cac0 0000 b840 0000 a900 0000 0000040 9f80 0000 9280 0000 8580 0000 78c0 0000 0000060 6c80 0000 6200 0000 5800 0000 4f80 0000 0000100 44c0 0000 39c0 0000 2d80 0000 2600 0000 0000120 1c00 0000 0e00 0000 0000 0000 0000 0000 0000140 0000 0000 0000 0000 0000 0000 0000 0000 * 0000700 0000 01bc 0000 009c 0014 ffff ffff 0000 0000720 0004 5f0b 0000 0000 0000 0ba3 0000 0c23 This is obviously broken too; the only bytes that should be non-zero in pg_variable are the first 16 (see src/backend/access/transam.h:99). realtime /usr/postgres95 > od -x data/pg_time.old | head -10 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0004000 31c6 09ae 0000 0000 31c6 09b3 31a9 79a6 0004020 31a9 79a6 31a9 79a8 31a9 79a9 31a9 79a9 0004040 31a9 79a9 31a9 79aa 31a9 79aa 31a9 79aa 0004060 31a9 79aa 31a9 79aa 31a9 79ab 31a9 79ab 0004100 31a9 79ab 31a9 79ac 31a9 79ac 31a9 79ac 0004120 31a9 79ac 31a9 79ad 31a9 79ad 31a9 79ae 0004140 31a9 79ae 31a9 79af 31a9 79af 31a9 79af 0004160 31a9 79b0 31a9 79b0 31a9 79b4 31a9 79b4 realtime /usr/postgres95 > od -x data/pg_time.old +32b | head -10 0032000 6665 6520 666f 7220 31c5 cb61 696e 6720 0032020 6465 7665 6c6f 7065 7273 2062 7579 2061 0032040 206c 6f77 2d69 6e63 6f6d 6520 6170 6172 0032060 746d 656e 7420 636f 6d70 6c65 7869 6e20 0032100 7468 6520 7472 6561 7375 7265 7227 7320 0032120 7761 7264 2066 726f 6d20 7468 6520 6665 0032140 6465 7261 6c20 676f 7665 726e 6d65 6e74 0032160 2066 6f72 2024 313b 2064 6576 656c 6f70 0032200 6572 7361 6c73 6f20 31c5 cb7c 31be 91a2 0032220 31be 91a2 31be 91a2 31be 91a2 31be 91a3 The first part of pg_time looked okay -- filled with 4-byte time values with the occasional "00 00 00 00" here and there -- but inspection into pg_time's guts revealed some preposterous numbers. For your comparing pleasure, I include copies of what these files _should_ look like -- more or less -- from a newly initdb'd database directory: realtime /usr/postgres95 > od -x data-new-initdb/pg_log 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0000200 8aaa aaaa aaaa aaa8 0000 0000 0000 0000 0000220 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 realtime /usr/postgres95 > od -x data-new-initdb/pg_variable 0000000 0000 0000 0000 0222 0000 021e 0000 5120 0000020 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 realtime /usr/postgres95 > od -x data-new-initdb/pg_time 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0004000 31e3 4fb3 0000 0000 31e3 4fb4 31e3 4fb4 0004020 31e3 4fb4 31e3 4fb5 31e3 4fb5 31e3 4fb5 0004040 31e3 4fb6 31e3 4fb6 31e3 4fb6 31e3 4fb6 0004060 31e3 4fb7 31e3 4fb7 31e3 4fb7 31e3 4fb7 0004100 31e3 4fb8 31e3 4fb8 31e3 4fb8 31e3 4fb9 0004120 31e3 4fb9 31e3 4fb9 31e3 4fb9 31e3 4fba 0004140 31e3 4fba 31e3 4fba 31e3 4fbb 31e3 4fbb 0004160 31e3 4fbb 31e3 4fbb 31e3 4fbb 0000 0000 0004200 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 What's more, if I view pg_log, pg_variable, and pg_time with a text editor, I discover that the corrupted areas contain fragments of tuples from the database! ---------- Since I have been unable to reproduce this problem on demand (and therefore have a complete picture of all of the queries executed on the database), I've only been able to speculate on the original cause of the problem. One explanation would be some kind of buffer manager bug; or perhaps the backend is running low on file descriptors and bollixing existing ones? I intend to continue tracking this down as I have the time, but in the interim, I've created a quick Perl script that is designed to restore some sense of rationality to the pg_{log,time,variable} files. It's a horribly barbaric way to do this -- at the very least, it almost certainly breaks time travel, and will almost certainly result in duplicate oid problems -- but it restored my pathological database here to a sufficiently working state such that I could run pg_dump on it and do various other queries with it. The script is included at the bottom of this message. Anyway, fellow posthackers, assuming that no one raises any red flags about this, I'll post it on postgres95@oozoo.vnet.net as a temporary stop-gap workaround to allow people to recover their databases when this sort of problem occurs. I wish I had more time to do this kind of thing. - Paul "Shag" Walmsley "Knowing is not enough." -- Hal Hartley, "Surviving Desire"