Rework order of end-of-recovery actions to delay timeline history write

A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written.  Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf.  With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster.  It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions.  The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.

Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now.  This is left as a future item and this
commit fixes the original complain.

Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
This commit is contained in:
Michael Paquier 2018-07-09 10:25:40 +09:00
parent e8137295b3
commit 5d7c9347e4

View File

@ -7494,6 +7494,13 @@ StartupXLOG(void)
}
}
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
* This information is not quite needed yet, but it is positioned here so
* as potential problems are detected before any on-disk change is done.
*/
oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
/*
* Consider whether we need to assign a new timeline ID.
*
@ -7548,6 +7555,24 @@ StartupXLOG(void)
else
snprintf(reason, sizeof(reason), "no recovery target specified");
/*
* We are now done reading the old WAL. Turn off archive fetching if
* it was active, and make a writable copy of the last WAL segment.
* (Note that we also have a copy of the last block of the old WAL in
* readBuf; we will use that below.)
*/
exitArchiveRecovery(EndOfLogTLI, EndOfLog);
/*
* Write the timeline history file, and have it archived. After this
* point (or rather, as soon as the file is archived), the timeline
* will appear as "taken" in the WAL archive and to any standby
* servers. If we crash before actually switching to the new
* timeline, standby servers will nevertheless think that we switched
* to the new timeline, and will try to connect to the new timeline.
* To minimize the window for that, try to do as little as possible
* between here and writing the end-of-recovery record.
*/
writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
EndRecPtr, reason);
}
@ -7556,15 +7581,6 @@ StartupXLOG(void)
XLogCtl->ThisTimeLineID = ThisTimeLineID;
XLogCtl->PrevTimeLineID = PrevTimeLineID;
/*
* We are now done reading the old WAL. Turn off archive fetching if it
* was active, and make a writable copy of the last WAL segment. (Note
* that we also have a copy of the last block of the old WAL in readBuf;
* we will use that below.)
*/
if (ArchiveRecoveryRequested)
exitArchiveRecovery(EndOfLogTLI, EndOfLog);
/*
* Prepare to write WAL starting at EndOfLog location, and init xlog
* buffer cache using the block containing the last record from the
@ -7617,9 +7633,6 @@ StartupXLOG(void)
XLogCtl->LogwrtRqst.Write = EndOfLog;
XLogCtl->LogwrtRqst.Flush = EndOfLog;
/* Pre-scan prepared transactions to find out the range of XIDs present */
oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
* record before resource manager writes cleanup WAL records or checkpoint