Fix postmaster state machine to handle dead_end child crashes better.

A report from Alvaro Herrera shows that if we're in PM_STARTUP
state, and we spawn a dead_end child to reject some incoming
connection request, and that child dies with an unexpected exit
code, the postmaster does not respond well.  We correctly send
SIGQUIT to the startup process, but then:

* if the startup process exits with nonzero exit code, as expected,
we thought that that indicated a crash and aborted startup.

* if the startup process exits with zero exit code, which is possible
due to the inherent race condition, we'd advance to PM_RUN state
which is fine --- but the code forgot that AbortStartTime would be
nonzero in this situation.  We'd either die on the Asserts saying
that it was zero, or perhaps misbehave later on.  (A quick look
suggests that the only misbehavior might be busy-waiting due to
DetermineSleepTime doing the wrong thing.)

To fix the first point, adjust the state-machine logic to recognize
that a nonzero exit code is expected after sending SIGQUIT, and have
it transition to a state where we can restart the startup process.
To fix the second point, change the Asserts to clear the variable
rather than just claiming it should be clear already.

Perhaps we could improve this further by not treating a crash of
a dead_end child as a reason for panic'ing the database.  However,
since those child processes are connected to shared memory, that
seems a bit risky.  There are few good reasons for a dead_end child
to report failure anyway (the cause of this in Alvaro's report is
quite unclear).  On balance, therefore, a minimal fix seems best.

This is an oversight in commit 45811be94.  While that was back-patched,
I'm hesitant to back-patch this change.  The lack of reasons for a
dead_end child to fail suggests that the case should be very rare in
the field, which squares with the lack of reports; so it seems like
this might not be worth the risk of introducing new issues.  In any
case we can let it bake awhile in HEAD before considering a back-patch.

Discussion: https://postgr.es/m/20190615160950.GA31378@alvherre.pgsql
This commit is contained in:
Tom Lane 2019-08-26 15:59:44 -04:00
parent 348778ddbc
commit ee32782395
1 changed files with 19 additions and 4 deletions

View File

@ -2920,7 +2920,9 @@ reaper(SIGNAL_ARGS)
* during PM_STARTUP is treated as catastrophic. There are no
* other processes running yet, so we can just exit.
*/
if (pmState == PM_STARTUP && !EXIT_STATUS_0(exitstatus))
if (pmState == PM_STARTUP &&
StartupStatus != STARTUP_SIGNALED &&
!EXIT_STATUS_0(exitstatus))
{
LogChildExit(LOG, _("startup process"),
pid, exitstatus);
@ -2937,11 +2939,24 @@ reaper(SIGNAL_ARGS)
* then we previously sent the startup process a SIGQUIT; so
* that's probably the reason it died, and we do want to try to
* restart in that case.
*
* This stanza also handles the case where we sent a SIGQUIT
* during PM_STARTUP due to some dead_end child crashing: in that
* situation, if the startup process dies on the SIGQUIT, we need
* to transition to PM_WAIT_BACKENDS state which will allow
* PostmasterStateMachine to restart the startup process. (On the
* other hand, the startup process might complete normally, if we
* were too late with the SIGQUIT. In that case we'll fall
* through and commence normal operations.)
*/
if (!EXIT_STATUS_0(exitstatus))
{
if (StartupStatus == STARTUP_SIGNALED)
{
StartupStatus = STARTUP_NOT_RUNNING;
if (pmState == PM_STARTUP)
pmState = PM_WAIT_BACKENDS;
}
else
StartupStatus = STARTUP_CRASHED;
HandleChildCrash(pid, exitstatus,
@ -2954,7 +2969,7 @@ reaper(SIGNAL_ARGS)
*/
StartupStatus = STARTUP_NOT_RUNNING;
FatalError = false;
Assert(AbortStartTime == 0);
AbortStartTime = 0;
ReachedNormalRunning = true;
pmState = PM_RUN;
@ -3504,7 +3519,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
if (pid == StartupPID)
{
StartupPID = 0;
StartupStatus = STARTUP_CRASHED;
/* Caller adjusts StartupStatus, so don't touch it here */
}
else if (StartupPID != 0 && take_action)
{
@ -5100,7 +5115,7 @@ sigusr1_handler(SIGNAL_ARGS)
{
/* WAL redo has started. We're out of reinitialization. */
FatalError = false;
Assert(AbortStartTime == 0);
AbortStartTime = 0;
/*
* Crank up the background tasks. It doesn't matter if this fails,