postgresql/src/include/replication/walsender_private.h

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

128 lines
3.4 KiB
C
Raw Normal View History

/*-------------------------------------------------------------------------
*
* walsender_private.h
* Private definitions from replication/walsender.c.
*
* Portions Copyright (c) 2010-2021, PostgreSQL Global Development Group
*
* src/include/replication/walsender_private.h
*
*-------------------------------------------------------------------------
*/
#ifndef _WALSENDER_PRIVATE_H
#define _WALSENDER_PRIVATE_H
#include "access/xlog.h"
#include "nodes/nodes.h"
#include "replication/syncrep.h"
#include "storage/latch.h"
#include "storage/shmem.h"
#include "storage/spin.h"
typedef enum WalSndState
{
WALSNDSTATE_STARTUP = 0,
WALSNDSTATE_BACKUP,
WALSNDSTATE_CATCHUP,
Prevent possibility of panics during shutdown checkpoint. When the checkpointer writes the shutdown checkpoint, it checks afterwards whether any WAL has been written since it started and throws a PANIC if so. At that point, only walsenders are still active, so one might think this could not happen, but walsenders can also generate WAL, for instance in BASE_BACKUP and logical decoding related commands (e.g. via hint bits). So they can trigger this panic if such a command is run while the shutdown checkpoint is being written. To fix this, divide the walsender shutdown into two phases. First, checkpointer, itself triggered by postmaster, sends a PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend is idle or runs an SQL query this causes the backend to shutdown, if logical replication is in progress all existing WAL records are processed followed by a shutdown. Otherwise this causes the walsender to switch to the "stopping" state. In this state, the walsender will reject any further replication commands. The checkpointer begins the shutdown checkpoint once all walsenders are confirmed as stopping. When the shutdown checkpoint finishes, the postmaster sends us SIGUSR2. This instructs walsender to send any outstanding WAL, including the shutdown checkpoint record, wait for it to be replicated to the standby, and then exit. Author: Andres Freund, based on an earlier patch by Michael Paquier Reported-By: Fujii Masao, Andres Freund Reviewed-By: Michael Paquier Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de Backpatch: 9.4, where logical decoding was introduced
2017-06-06 03:53:41 +02:00
WALSNDSTATE_STREAMING,
WALSNDSTATE_STOPPING
} WalSndState;
/*
* Each walsender has a WalSnd struct in shared memory.
2017-07-01 00:06:33 +02:00
*
Fix race conditions in synchronous standby management. We have repeatedly seen the buildfarm reach the Assert(false) in SyncRepGetSyncStandbysPriority. This apparently is due to failing to consider the possibility that the sync_standby_priority values in shared memory might be inconsistent; but they will be whenever only some of the walsenders have updated their values after a change in the synchronous_standby_names setting. That function is vastly too complex for what it does, anyway, so rewriting it seems better than trying to apply a band-aid fix. Furthermore, the API of SyncRepGetSyncStandbys is broken by design: it returns a list of WalSnd array indexes, but there is nothing guaranteeing that the contents of the WalSnd array remain stable. Thus, if some walsender exits and then a new walsender process takes over that WalSnd array slot, a caller might make use of WAL position data that it should not, potentially leading to incorrect decisions about whether to release transactions that are waiting for synchronous commit. To fix, replace SyncRepGetSyncStandbys with a new function SyncRepGetCandidateStandbys that copies all the required data from shared memory while holding the relevant mutexes. If the associated walsender process then exits, this data is still safe to make release decisions with, since we know that that much WAL *was* sent to a valid standby server. This incidentally means that we no longer need to treat sync_standby_priority as protected by the SyncRepLock rather than the per-walsender mutex. SyncRepGetSyncStandbys is no longer used by the core code, so remove it entirely in HEAD. However, it seems possible that external code is relying on that function, so do not remove it from the back branches. Instead, just remove the known-incorrect Assert. When the bug occurs, the function will return a too-short list, which callers should treat as meaning there are not enough sync standbys, which seems like a reasonably safe fallback until the inconsistent state is resolved. Moreover it's bug-compatible with what has been happening in non-assert builds. We cannot do anything about the walsender-replacement race condition without an API/ABI break. The bogus assertion exists back to 9.6, but 9.6 is sufficiently different from the later branches that the patch doesn't apply at all. I chose to just remove the bogus assertion in 9.6, feeling that the probability of a bad outcome from the walsender-replacement race condition is too low to justify rewriting the whole patch for 9.6. Discussion: https://postgr.es/m/21519.1585272409@sss.pgh.pa.us
2020-04-18 20:02:44 +02:00
* This struct is protected by its 'mutex' spinlock field, except that some
2017-07-01 00:06:33 +02:00
* members are only written by the walsender process itself, and thus that
* process is free to read those members without holding spinlock. pid and
* needreload always require the spinlock to be held for all accesses.
*/
typedef struct WalSnd
{
2017-07-01 00:06:33 +02:00
pid_t pid; /* this walsender's PID, or 0 if not active */
WalSndState state; /* this walsender's state */
XLogRecPtr sentPtr; /* WAL has been sent up to this point */
bool needreload; /* does currently-open file need to be
* reloaded? */
/*
* The xlog locations that have been written, flushed, and applied by
* standby-side. These may be invalid if the standby-side has not offered
* values yet.
*/
XLogRecPtr write;
XLogRecPtr flush;
XLogRecPtr apply;
/* Measured lag times, or -1 for unknown/none. */
TimeOffset writeLag;
TimeOffset flushLag;
TimeOffset applyLag;
Fix race conditions in synchronous standby management. We have repeatedly seen the buildfarm reach the Assert(false) in SyncRepGetSyncStandbysPriority. This apparently is due to failing to consider the possibility that the sync_standby_priority values in shared memory might be inconsistent; but they will be whenever only some of the walsenders have updated their values after a change in the synchronous_standby_names setting. That function is vastly too complex for what it does, anyway, so rewriting it seems better than trying to apply a band-aid fix. Furthermore, the API of SyncRepGetSyncStandbys is broken by design: it returns a list of WalSnd array indexes, but there is nothing guaranteeing that the contents of the WalSnd array remain stable. Thus, if some walsender exits and then a new walsender process takes over that WalSnd array slot, a caller might make use of WAL position data that it should not, potentially leading to incorrect decisions about whether to release transactions that are waiting for synchronous commit. To fix, replace SyncRepGetSyncStandbys with a new function SyncRepGetCandidateStandbys that copies all the required data from shared memory while holding the relevant mutexes. If the associated walsender process then exits, this data is still safe to make release decisions with, since we know that that much WAL *was* sent to a valid standby server. This incidentally means that we no longer need to treat sync_standby_priority as protected by the SyncRepLock rather than the per-walsender mutex. SyncRepGetSyncStandbys is no longer used by the core code, so remove it entirely in HEAD. However, it seems possible that external code is relying on that function, so do not remove it from the back branches. Instead, just remove the known-incorrect Assert. When the bug occurs, the function will return a too-short list, which callers should treat as meaning there are not enough sync standbys, which seems like a reasonably safe fallback until the inconsistent state is resolved. Moreover it's bug-compatible with what has been happening in non-assert builds. We cannot do anything about the walsender-replacement race condition without an API/ABI break. The bogus assertion exists back to 9.6, but 9.6 is sufficiently different from the later branches that the patch doesn't apply at all. I chose to just remove the bogus assertion in 9.6, feeling that the probability of a bad outcome from the walsender-replacement race condition is too low to justify rewriting the whole patch for 9.6. Discussion: https://postgr.es/m/21519.1585272409@sss.pgh.pa.us
2020-04-18 20:02:44 +02:00
/*
* The priority order of the standby managed by this WALSender, as listed
* in synchronous_standby_names, or 0 if not-listed.
*/
int sync_standby_priority;
/* Protects shared variables shown above. */
slock_t mutex;
/*
* Pointer to the walsender's latch. Used by backends to wake up this
* walsender when it has work to do. NULL if the walsender isn't active.
*/
Latch *latch;
/*
* Timestamp of the last message received from standby.
*/
TimestampTz replyTime;
} WalSnd;
extern WalSnd *MyWalSnd;
/* There is one WalSndCtl struct for the whole database cluster */
typedef struct
{
/*
* Synchronous replication queue with one queue per request type.
* Protected by SyncRepLock.
*/
SHM_QUEUE SyncRepQueue[NUM_SYNC_REP_WAIT_MODE];
/*
* Current location of the head of the queue. All waiters should have a
* waitLSN that follows this value. Protected by SyncRepLock.
*/
XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE];
/*
* Are any sync standbys defined? Waiting backends can't reload the
* config file safely, so checkpointer updates this value as needed.
* Protected by SyncRepLock.
*/
bool sync_standbys_defined;
WalSnd walsnds[FLEXIBLE_ARRAY_MEMBER];
} WalSndCtlData;
extern WalSndCtlData *WalSndCtl;
extern void WalSndSetState(WalSndState state);
/*
* Internal functions for parsing the replication grammar, in repl_gram.y and
* repl_scanner.l
*/
extern int replication_yyparse(void);
extern int replication_yylex(void);
extern void replication_yyerror(const char *str) pg_attribute_noreturn();
extern void replication_scanner_init(const char *query_string);
extern void replication_scanner_finish(void);
extern Node *replication_parse_result;
#endif /* _WALSENDER_PRIVATE_H */