Don't lose walreceiver start requests due to race condition in postmaster.

When a walreceiver dies, the startup process will notice that and send
a PMSIGNAL_START_WALRECEIVER signal to the postmaster, asking for a new
walreceiver to be launched.  There's a race condition, which at least
in HEAD is very easy to hit, whereby the postmaster might see that
signal before it processes the SIGCHLD from the walreceiver process.
In that situation, sigusr1_handler() just dropped the start request
on the floor, reasoning that it must be redundant.  Eventually, after
10 seconds (WALRCV_STARTUP_TIMEOUT), the startup process would make a
fresh request --- but that's a long time if the connection could have
been re-established almost immediately.

Fix it by setting a state flag inside the postmaster that we won't
clear until we do launch a walreceiver.  In cases where that results
in an extra walreceiver launch, it's up to the walreceiver to realize
it's unwanted and go away --- but we have, and need, that logic anyway
for the opposite race case.

I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there in test cases where
a master server is stopped and restarted while leaving streaming
slaves active.

This logic has been broken all along, so back-patch to all supported
branches.

Discussion: https://postgr.es/m/21344.1498494720@sss.pgh.pa.us
This commit is contained in:
Tom Lane 2017-06-26 17:31:56 -04:00
parent ad1b5c842b
commit e5d494d78c
1 changed files with 32 additions and 7 deletions

View File

@ -357,6 +357,9 @@ static volatile sig_atomic_t start_autovac_launcher = false;
/* the launcher needs to be signalled to communicate some condition */
static volatile bool avlauncher_needs_signal = false;
/* received START_WALRECEIVER signal */
static volatile sig_atomic_t WalReceiverRequested = false;
/* set when there's a worker that needs to be started up */
static volatile bool StartWorkerNeeded = true;
static volatile bool HaveCrashedWorker = false;
@ -426,6 +429,7 @@ static void maybe_start_bgworkers(void);
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@ -1810,6 +1814,10 @@ ServerLoop(void)
kill(AutoVacPID, SIGUSR2);
}
/* If we need to start a WAL receiver, try to do that now */
if (WalReceiverRequested)
MaybeStartWalReceiver();
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@ -2958,7 +2966,8 @@ reaper(SIGNAL_ARGS)
/*
* Was it the wal receiver? If exit status is zero (normal) or one
* (FATAL exit), we assume everything is all right just like normal
* backends.
* backends. (If we need a new wal receiver, we'll start one at the
* next iteration of the postmaster's main loop.)
*/
if (pid == WalReceiverPID)
{
@ -5066,14 +5075,12 @@ sigusr1_handler(SIGNAL_ARGS)
StartAutovacuumWorker();
}
if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
WalReceiverPID == 0 &&
(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
Shutdown == NoShutdown)
if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER))
{
/* Startup Process wants us to start the walreceiver process. */
WalReceiverPID = StartWalReceiver();
/* Start immediately if possible, else remember request for later. */
WalReceiverRequested = true;
MaybeStartWalReceiver();
}
if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
@ -5409,6 +5416,24 @@ StartAutovacuumWorker(void)
}
}
/*
* MaybeStartWalReceiver
* Start the WAL receiver process, if not running and our state allows.
*/
static void
MaybeStartWalReceiver(void)
{
if (WalReceiverPID == 0 &&
(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
Shutdown == NoShutdown)
{
WalReceiverPID = StartWalReceiver();
WalReceiverRequested = false;
}
}
/*
* Create the opts file
*/