2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat.c
|
|
|
|
*
|
|
|
|
* All the statistics collector stuff hacked up in one big, ugly file.
|
|
|
|
*
|
|
|
|
* TODO: - Separate collector, postmaster and backend stuff
|
|
|
|
* into different files.
|
|
|
|
*
|
|
|
|
* - Add some automatic call for pgstat vacuuming.
|
|
|
|
*
|
|
|
|
* - Add a pgstat config column to pg_database, so this
|
2004-05-28 07:13:32 +02:00
|
|
|
* entire thing can be enabled/disabled on a per db basis.
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Copyright (c) 2001-2021, PostgreSQL Global Development Group
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/postmaster/pgstat.c
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
2001-06-30 21:01:27 +02:00
|
|
|
#include "postgres.h"
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
#include <unistd.h>
|
|
|
|
#include <fcntl.h>
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/time.h>
|
|
|
|
#include <sys/socket.h>
|
2003-06-12 09:36:51 +02:00
|
|
|
#include <netdb.h>
|
2001-06-22 21:18:36 +02:00
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <arpa/inet.h>
|
|
|
|
#include <signal.h>
|
2004-06-03 04:08:07 +02:00
|
|
|
#include <time.h>
|
2016-09-27 06:05:21 +02:00
|
|
|
#ifdef HAVE_SYS_SELECT_H
|
|
|
|
#include <sys/select.h>
|
|
|
|
#endif
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2001-08-04 02:14:43 +02:00
|
|
|
#include "access/heapam.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
#include "access/tableam.h"
|
2006-07-13 18:49:20 +02:00
|
|
|
#include "access/transam.h"
|
2007-05-27 05:50:39 +02:00
|
|
|
#include "access/twophase_rmgr.h"
|
2004-05-30 00:48:23 +02:00
|
|
|
#include "access/xact.h"
|
2001-08-04 02:14:43 +02:00
|
|
|
#include "catalog/pg_database.h"
|
2008-05-15 02:17:41 +02:00
|
|
|
#include "catalog/pg_proc.h"
|
2016-09-02 12:49:59 +02:00
|
|
|
#include "common/ip.h"
|
2020-12-02 05:00:15 +01:00
|
|
|
#include "executor/instrument.h"
|
2003-06-12 09:36:51 +02:00
|
|
|
#include "libpq/libpq.h"
|
2004-05-30 00:48:23 +02:00
|
|
|
#include "libpq/pqsignal.h"
|
2002-02-07 23:20:26 +01:00
|
|
|
#include "mb/pg_wchar.h"
|
2001-06-22 21:18:36 +02:00
|
|
|
#include "miscadmin.h"
|
2008-08-01 15:16:09 +02:00
|
|
|
#include "pg_trace.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "pgstat.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "postmaster/autovacuum.h"
|
2005-04-08 02:55:07 +02:00
|
|
|
#include "postmaster/fork_process.h"
|
2019-12-17 19:14:28 +01:00
|
|
|
#include "postmaster/interrupt.h"
|
2004-05-30 00:48:23 +02:00
|
|
|
#include "postmaster/postmaster.h"
|
2020-10-08 05:39:08 +02:00
|
|
|
#include "replication/slot.h"
|
2017-03-27 04:02:22 +02:00
|
|
|
#include "replication/walsender.h"
|
2001-06-22 21:18:36 +02:00
|
|
|
#include "storage/backendid.h"
|
2014-03-18 12:58:53 +01:00
|
|
|
#include "storage/dsm.h"
|
2004-10-28 03:38:41 +02:00
|
|
|
#include "storage/fd.h"
|
2002-05-05 02:03:29 +02:00
|
|
|
#include "storage/ipc.h"
|
2012-05-13 20:44:39 +02:00
|
|
|
#include "storage/latch.h"
|
2016-03-10 18:44:09 +01:00
|
|
|
#include "storage/lmgr.h"
|
2003-11-07 22:55:50 +01:00
|
|
|
#include "storage/pg_shmem.h"
|
2011-01-03 12:46:03 +01:00
|
|
|
#include "storage/procsignal.h"
|
2014-02-25 18:34:04 +01:00
|
|
|
#include "storage/sinvaladt.h"
|
2011-10-21 19:26:40 +02:00
|
|
|
#include "utils/ascii.h"
|
2007-09-24 05:12:23 +02:00
|
|
|
#include "utils/guc.h"
|
2004-05-30 00:48:23 +02:00
|
|
|
#include "utils/memutils.h"
|
2001-08-04 02:14:43 +02:00
|
|
|
#include "utils/ps_status.h"
|
2008-06-19 02:46:06 +02:00
|
|
|
#include "utils/rel.h"
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
#include "utils/snapmgr.h"
|
2011-09-09 19:23:41 +02:00
|
|
|
#include "utils/timestamp.h"
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2004-06-26 18:32:04 +02:00
|
|
|
/* ----------
|
|
|
|
* Timer definitions.
|
|
|
|
* ----------
|
|
|
|
*/
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats file
|
|
|
|
* updates; in milliseconds. */
|
2008-11-03 02:17:08 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#define PGSTAT_RETRY_DELAY 10 /* How long to wait between checks for a
|
|
|
|
* new file; in milliseconds. */
|
2008-11-03 02:17:08 +01:00
|
|
|
|
2011-09-17 00:25:27 +02:00
|
|
|
#define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a stats
|
2008-11-03 02:17:08 +01:00
|
|
|
* file update; in milliseconds. */
|
2004-06-26 18:32:04 +02:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#define PGSTAT_INQ_INTERVAL 640 /* How often to ping the collector for a
|
|
|
|
* new file; in milliseconds. */
|
2011-09-17 00:25:27 +02:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to restart a
|
|
|
|
* failed statistics collector; in
|
|
|
|
* seconds. */
|
2004-06-26 18:32:04 +02:00
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
#define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
|
2011-09-17 00:25:27 +02:00
|
|
|
#define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
|
2008-11-03 02:17:08 +01:00
|
|
|
|
Try to ensure that stats collector's receive buffer size is at least 100KB.
Since commit 4e37b3e15, buildfarm member frogmouth has been failing
occasionally with symptoms indicating that some expected stats data is
getting dropped. The reason that that commit changed the behavior seems
probably to be that more data is getting shoved at the collector in a short
span of time. In current sources, the stats test's first session sends
about 9KB of data while exiting, which is probably the same as what was
sent just before wait_for_stats() in the previous test design. But now,
the test's second session is starting up concurrently, and it sends another
2KB (presumably reflecting its initial catalog accesses). Since frogmouth
is running on Windows XP, which reputedly has a default socket receive
buffer size of only 8KB, it is not very surprising if this has put us over
the threshold where the receive buffer can overflow and drop messages.
The same mechanism could very easily explain the intermittent stats test
failures we've been seeing for years, since background processes such
as the bgwriter will sometimes send data concurrently with all this, and
could thus cause occasional buffer overflows.
Hence, insert some code into pgstat_init() to increase the stats socket's
receive buffer size to 100KB if it's less than that. (On failure, emit a
LOG message, but keep going.) Modern systems seem to have default sizes
in the range of 100KB-250KB, but older platforms don't. I couldn't find
any platforms that wouldn't accept 100KB, so in theory this won't cause
any portability problems.
If this is successful at reducing the buildfarm failure rate in HEAD,
we should back-patch it, because it's certain that similar buffer overflows
happen in the field on platforms with small buffer sizes. Going forward,
there might be an argument for trying to increase the buffer size even
more, but let's take a baby step first.
Discussion: https://postgr.es/m/22173.1494788088@sss.pgh.pa.us
2017-05-16 21:24:52 +02:00
|
|
|
/* Minimum receive buffer size for the collector's socket. */
|
|
|
|
#define PGSTAT_MIN_RCVBUF (100 * 1024)
|
|
|
|
|
2004-06-26 18:32:04 +02:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* The initial size hints for the hash tables used in the collector.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
#define PGSTAT_DB_HASH_SIZE 16
|
|
|
|
#define PGSTAT_TAB_HASH_SIZE 512
|
2008-05-15 02:17:41 +02:00
|
|
|
#define PGSTAT_FUNCTION_HASH_SIZE 512
|
2004-06-26 18:32:04 +02:00
|
|
|
|
|
|
|
|
2017-03-27 04:02:22 +02:00
|
|
|
/* ----------
|
|
|
|
* Total number of backends including auxiliary
|
|
|
|
*
|
|
|
|
* We reserve a slot for each possible BackendId, plus one for each
|
|
|
|
* possible auxiliary process type. (This scheme assumes there is not
|
|
|
|
* more than one of any auxiliary process type at a time.) MaxBackends
|
|
|
|
* includes autovacuum workers and background workers as well.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
|
|
|
|
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
2001-08-04 02:14:43 +02:00
|
|
|
* GUC parameters
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
2007-09-24 05:12:23 +02:00
|
|
|
bool pgstat_track_activities = false;
|
|
|
|
bool pgstat_track_counts = false;
|
2008-05-15 02:17:41 +02:00
|
|
|
int pgstat_track_functions = TRACK_FUNC_OFF;
|
2008-06-30 12:58:47 +02:00
|
|
|
int pgstat_track_activity_query_size = 1024;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2008-08-15 10:37:41 +02:00
|
|
|
/* ----------
|
|
|
|
* Built from GUC parameter
|
|
|
|
* ----------
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
char *pgstat_stat_directory = NULL;
|
2008-08-15 10:37:41 +02:00
|
|
|
char *pgstat_stat_filename = NULL;
|
|
|
|
char *pgstat_stat_tmpname = NULL;
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
2020-10-02 03:17:11 +02:00
|
|
|
* BgWriter and WAL global statistics counters.
|
|
|
|
* Stored directly in a stats message structure so they can be sent
|
|
|
|
* without needing to copy things around. We assume these init to zeroes.
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
PgStat_MsgBgWriter BgWriterStats;
|
2020-10-02 03:17:11 +02:00
|
|
|
PgStat_MsgWal WalStats;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
2020-12-02 05:00:15 +01:00
|
|
|
/*
|
|
|
|
* WAL usage counters saved from pgWALUsage at the previous call to
|
|
|
|
* pgstat_send_wal(). This is used to calculate how much WAL usage
|
|
|
|
* happens between pgstat_send_wal() calls, by substracting
|
|
|
|
* the previous counters from the current ones.
|
|
|
|
*/
|
|
|
|
static WalUsage prevWalUsage;
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
* List of SLRU names that we keep stats for. There is no central registry of
|
|
|
|
* SLRUs, so we use this fixed list instead. The "other" entry is used for
|
|
|
|
* all SLRUs without an explicit entry (e.g. SLRUs in extensions).
|
|
|
|
*/
|
|
|
|
static const char *const slru_names[] = {
|
Rename SLRU structures and associated LWLocks.
Originally, the names assigned to SLRUs had no purpose other than
being shmem lookup keys, so not a lot of thought went into them.
As of v13, though, we're exposing them in the pg_stat_slru view and
the pg_stat_reset_slru function, so it seems advisable to take a bit
more care. Rename them to names based on the associated on-disk
storage directories (which fortunately we *did* think about, to some
extent; since those are also visible to DBAs, consistency seems like
a good thing). Also rename the associated LWLocks, since those names
are likewise user-exposed now as wait event names.
For the most part I only touched symbols used in the respective modules'
SimpleLruInit() calls, not the names of other related objects. This
renaming could have been taken further, and maybe someday we will do so.
But for now it seems undesirable to change the names of any globally
visible functions or structs, so some inconsistency is unavoidable.
(But I *did* terminate "oldserxid" with prejudice, as I found that
name both unreadable and not descriptive of the SLRU's contents.)
Table 27.12 needs re-alphabetization now, but I'll leave that till
after the other LWLock renamings I have in mind.
Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
2020-05-15 20:28:19 +02:00
|
|
|
"CommitTs",
|
|
|
|
"MultiXactMember",
|
|
|
|
"MultiXactOffset",
|
|
|
|
"Notify",
|
|
|
|
"Serial",
|
|
|
|
"Subtrans",
|
|
|
|
"Xact",
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
"other" /* has to be last */
|
|
|
|
};
|
|
|
|
|
|
|
|
#define SLRU_NUM_ELEMENTS lengthof(slru_names)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
/*
|
|
|
|
* SLRU statistics counts waiting to be sent to the collector. These are
|
|
|
|
* stored directly in stats message format so they can be sent without needing
|
|
|
|
* to copy things around. We assume this variable inits to zeroes. Entries
|
|
|
|
* are one-to-one with slru_names[].
|
|
|
|
*/
|
|
|
|
static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* Local data
|
|
|
|
* ----------
|
|
|
|
*/
|
2010-01-10 15:16:08 +01:00
|
|
|
NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2003-06-12 09:36:51 +02:00
|
|
|
static struct sockaddr_storage pgStatAddr;
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2003-04-26 04:57:14 +02:00
|
|
|
static time_t last_pgstat_start_time;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-01-03 17:42:17 +01:00
|
|
|
static bool pgStatRunningInCollector = false;
|
2003-07-22 21:00:12 +02:00
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
/*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Structures in which backends store per-table info that's waiting to be
|
|
|
|
* sent to the collector.
|
2007-04-21 06:10:53 +02:00
|
|
|
*
|
2007-05-27 05:50:39 +02:00
|
|
|
* NOTE: once allocated, TabStatusArray structures are never moved or deleted
|
|
|
|
* for the life of the backend. Also, we zero out the t_id fields of the
|
|
|
|
* contained PgStat_TableStatus structs whenever they are not actively in use.
|
|
|
|
* This allows relcache pgstat_info pointers to be treated as long-lived data,
|
|
|
|
* avoiding repeated searches in pgstat_initstats() when a relation is
|
|
|
|
* repeatedly opened during a transaction.
|
|
|
|
*/
|
2007-11-15 22:14:46 +01:00
|
|
|
#define TABSTAT_QUANTUM 100 /* we alloc this many at a time */
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
typedef struct TabStatusArray
|
2005-07-29 21:30:09 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
struct TabStatusArray *tsa_next; /* link to next array, if any */
|
2007-11-15 22:14:46 +01:00
|
|
|
int tsa_used; /* # entries currently used */
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM]; /* per-table data */
|
2007-11-15 23:25:18 +01:00
|
|
|
} TabStatusArray;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
static TabStatusArray *pgStatTabList = NULL;
|
2003-08-04 02:43:34 +02:00
|
|
|
|
2017-03-27 17:34:42 +02:00
|
|
|
/*
|
2017-05-15 04:52:41 +02:00
|
|
|
* pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
|
2017-03-27 17:34:42 +02:00
|
|
|
*/
|
|
|
|
typedef struct TabStatHashEntry
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
Oid t_id;
|
|
|
|
PgStat_TableStatus *tsa_entry;
|
2017-03-27 17:34:42 +02:00
|
|
|
} TabStatHashEntry;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hash table for O(1) t_id -> tsa_entry lookup
|
|
|
|
*/
|
|
|
|
static HTAB *pgStatTabHash = NULL;
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/*
|
|
|
|
* Backends store per-function info that's waiting to be sent to the collector
|
|
|
|
* in this hash table (indexed by function OID).
|
|
|
|
*/
|
|
|
|
static HTAB *pgStatFunctions = NULL;
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
/*
|
|
|
|
* Indicates if backend has some function stats that it hasn't yet
|
|
|
|
* sent to the collector.
|
|
|
|
*/
|
|
|
|
static bool have_function_stats = false;
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
|
|
|
* Tuple insertion/deletion counts for an open transaction can't be propagated
|
|
|
|
* into PgStat_TableStatus counters until we know if it is going to commit
|
|
|
|
* or abort. Hence, we keep these counts in per-subxact structs that live
|
|
|
|
* in TopTransactionContext. This data structure is designed on the assumption
|
|
|
|
* that subxacts won't usually modify very many tables.
|
|
|
|
*/
|
|
|
|
typedef struct PgStat_SubXactStatus
|
|
|
|
{
|
2007-11-15 22:14:46 +01:00
|
|
|
int nest_level; /* subtransaction nest level */
|
2007-05-27 05:50:39 +02:00
|
|
|
struct PgStat_SubXactStatus *prev; /* higher-level subxact if any */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
PgStat_TableXactStatus *first; /* head of list for this subxact */
|
2007-11-15 23:25:18 +01:00
|
|
|
} PgStat_SubXactStatus;
|
2003-07-22 21:00:12 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
static PgStat_SubXactStatus *pgStatXactStack = NULL;
|
2005-07-29 21:30:09 +02:00
|
|
|
|
2001-10-25 07:50:21 +02:00
|
|
|
static int pgStatXactCommit = 0;
|
|
|
|
static int pgStatXactRollback = 0;
|
2012-04-30 00:13:33 +02:00
|
|
|
PgStat_Counter pgStatBlockReadTime = 0;
|
|
|
|
PgStat_Counter pgStatBlockWriteTime = 0;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/* Record that's written to 2PC state file when pgstat state is persisted */
|
|
|
|
typedef struct TwoPhasePgStatRecord
|
|
|
|
{
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
PgStat_Counter tuples_inserted; /* tuples inserted in xact */
|
|
|
|
PgStat_Counter tuples_updated; /* tuples updated in xact */
|
|
|
|
PgStat_Counter tuples_deleted; /* tuples deleted in xact */
|
2015-02-20 16:10:01 +01:00
|
|
|
PgStat_Counter inserted_pre_trunc; /* tuples inserted prior to truncate */
|
|
|
|
PgStat_Counter updated_pre_trunc; /* tuples updated prior to truncate */
|
|
|
|
PgStat_Counter deleted_pre_trunc; /* tuples deleted prior to truncate */
|
2007-11-15 22:14:46 +01:00
|
|
|
Oid t_id; /* table's OID */
|
|
|
|
bool t_shared; /* is it a shared catalog? */
|
2015-02-20 16:10:01 +01:00
|
|
|
bool t_truncated; /* was the relation truncated? */
|
2007-11-15 23:25:18 +01:00
|
|
|
} TwoPhasePgStatRecord;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Info about current "snapshot" of stats file
|
|
|
|
*/
|
2007-02-08 00:11:30 +01:00
|
|
|
static MemoryContext pgStatLocalContext = NULL;
|
2001-10-25 07:50:21 +02:00
|
|
|
static HTAB *pgStatDBHash = NULL;
|
2017-03-27 04:02:22 +02:00
|
|
|
|
|
|
|
/* Status for backends including auxiliary */
|
2014-02-25 18:34:04 +01:00
|
|
|
static LocalPgBackendStatus *localBackendStatusTable = NULL;
|
2017-03-27 04:02:22 +02:00
|
|
|
|
|
|
|
/* Total number of backends including auxiliary */
|
2006-06-19 03:51:22 +02:00
|
|
|
static int localNumBackends = 0;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/*
|
|
|
|
* Cluster wide statistics, kept in the stats collector.
|
|
|
|
* Contains statistics that are not collected per database
|
|
|
|
* or per table.
|
|
|
|
*/
|
2014-01-28 18:58:22 +01:00
|
|
|
static PgStat_ArchiverStats archiverStats;
|
2007-03-30 20:34:56 +02:00
|
|
|
static PgStat_GlobalStats globalStats;
|
2020-10-02 03:17:11 +02:00
|
|
|
static PgStat_WalStats walStats;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
|
2020-10-08 05:39:08 +02:00
|
|
|
static PgStat_ReplSlotStats *replSlotStats;
|
|
|
|
static int nReplSlotStats;
|
2007-03-30 20:34:56 +02:00
|
|
|
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
/*
|
|
|
|
* List of OIDs of databases we need to write out. If an entry is InvalidOid,
|
|
|
|
* it means to write only the shared-catalog stats ("DB 0"); otherwise, we
|
|
|
|
* will write both that DB's data and the shared stats.
|
|
|
|
*/
|
|
|
|
static List *pending_write_requests = NIL;
|
2008-11-03 02:17:08 +01:00
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/*
|
|
|
|
* Total time charged to functions so far in the current backend.
|
|
|
|
* We use this to help separate "self" and "other" time charges.
|
|
|
|
* (We assume this initializes to zero.)
|
|
|
|
*/
|
|
|
|
static instr_time total_func_time;
|
|
|
|
|
2006-01-04 22:06:32 +01:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* Local function forward declarations
|
|
|
|
* ----------
|
|
|
|
*/
|
2003-12-25 04:52:51 +01:00
|
|
|
#ifdef EXEC_BACKEND
|
2006-06-29 22:00:08 +02:00
|
|
|
static pid_t pgstat_forkexec(void);
|
2003-12-25 04:52:51 +01:00
|
|
|
#endif
|
2004-05-28 07:13:32 +02:00
|
|
|
|
2015-03-26 19:03:19 +01:00
|
|
|
NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
|
2005-04-01 01:20:49 +02:00
|
|
|
static void pgstat_beshutdown_hook(int code, Datum arg);
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
|
2009-09-05 00:32:33 +02:00
|
|
|
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
|
2019-05-22 19:04:48 +02:00
|
|
|
Oid tableoid, bool create);
|
2013-02-18 21:56:08 +01:00
|
|
|
static void pgstat_write_statsfiles(bool permanent, bool allDbs);
|
|
|
|
static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
|
|
|
|
static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
|
|
|
|
static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
|
2004-07-01 02:52:04 +02:00
|
|
|
static void backend_read_statsfile(void);
|
2006-06-19 03:51:22 +02:00
|
|
|
static void pgstat_read_current_status(void);
|
2007-05-27 05:50:39 +02:00
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
static bool pgstat_write_statsfile_needed(void);
|
|
|
|
static bool pgstat_db_requested(Oid databaseid);
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
static int pgstat_replslot_index(const char *name, bool create_it);
|
|
|
|
static void pgstat_reset_replslot(int i, TimestampTz ts);
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
|
2008-05-15 02:17:41 +02:00
|
|
|
static void pgstat_send_funcstats(void);
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
static void pgstat_send_slru(void);
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
|
|
|
|
|
2007-02-08 00:11:30 +01:00
|
|
|
static void pgstat_setup_memcxt(void);
|
|
|
|
|
2016-10-04 16:50:13 +02:00
|
|
|
static const char *pgstat_get_wait_activity(WaitEventActivity w);
|
|
|
|
static const char *pgstat_get_wait_client(WaitEventClient w);
|
|
|
|
static const char *pgstat_get_wait_ipc(WaitEventIPC w);
|
|
|
|
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
static const char *pgstat_get_wait_io(WaitEventIO w);
|
2016-10-04 16:50:13 +02:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
|
2001-10-25 07:50:21 +02:00
|
|
|
static void pgstat_send(void *msg, int len);
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
|
2001-10-25 07:50:21 +02:00
|
|
|
static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
|
|
|
|
static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
|
|
|
|
static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
|
|
|
|
static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
|
2010-01-19 15:11:32 +01:00
|
|
|
static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
|
2010-01-28 15:25:41 +01:00
|
|
|
static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
|
2020-10-08 05:39:08 +02:00
|
|
|
static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
|
2005-07-14 07:13:45 +02:00
|
|
|
static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
|
|
|
|
static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
|
|
|
|
static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
|
2014-01-28 18:58:22 +01:00
|
|
|
static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
|
2007-11-15 23:25:18 +01:00
|
|
|
static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
|
2020-10-02 03:17:11 +02:00
|
|
|
static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
|
2008-05-15 02:17:41 +02:00
|
|
|
static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
|
|
|
|
static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
|
2011-01-03 12:46:03 +01:00
|
|
|
static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
|
2012-01-26 15:58:19 +01:00
|
|
|
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
|
2019-03-09 19:45:17 +01:00
|
|
|
static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
|
2020-10-08 05:39:08 +02:00
|
|
|
static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
|
2012-01-26 14:41:19 +01:00
|
|
|
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ------------------------------------------------------------
|
|
|
|
* Public functions called from postmaster follow
|
|
|
|
* ------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_init() -
|
|
|
|
*
|
|
|
|
* Called from postmaster at startup. Create the resources required
|
2003-04-26 04:57:14 +02:00
|
|
|
* by the statistics collector process. If unable to do so, do not
|
|
|
|
* fail --- better to let the postmaster start with stats collection
|
|
|
|
* disabled.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
2003-04-26 04:57:14 +02:00
|
|
|
void
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_init(void)
|
|
|
|
{
|
2003-08-04 02:43:34 +02:00
|
|
|
ACCEPT_TYPE_ARG3 alen;
|
|
|
|
struct addrinfo *addrs = NULL,
|
|
|
|
*addr,
|
|
|
|
hints;
|
2003-06-12 09:36:51 +02:00
|
|
|
int ret;
|
2004-08-29 07:07:03 +02:00
|
|
|
fd_set rset;
|
2004-03-23 00:55:29 +01:00
|
|
|
struct timeval tv;
|
2004-08-29 07:07:03 +02:00
|
|
|
char test_byte;
|
|
|
|
int sel_res;
|
2006-04-20 12:51:32 +02:00
|
|
|
int tries = 0;
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2004-03-23 00:55:29 +01:00
|
|
|
#define TESTBYTEVAL ((char) 199)
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2014-01-03 03:45:47 +01:00
|
|
|
/*
|
|
|
|
* This static assertion verifies that we didn't mess up the calculations
|
|
|
|
* involved in selecting maximum payload sizes for our UDP messages.
|
|
|
|
* Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
|
|
|
|
* be silent performance loss from fragmentation, it seems worth having a
|
|
|
|
* compile-time cross-check that we didn't.
|
|
|
|
*/
|
|
|
|
StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
|
2014-01-03 03:45:47 +01:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2001-08-05 04:06:50 +02:00
|
|
|
* Create the UDP socket for sending and receiving statistic messages
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2003-06-12 09:36:51 +02:00
|
|
|
hints.ai_flags = AI_PASSIVE;
|
2014-04-16 19:20:54 +02:00
|
|
|
hints.ai_family = AF_UNSPEC;
|
2003-06-12 09:36:51 +02:00
|
|
|
hints.ai_socktype = SOCK_DGRAM;
|
|
|
|
hints.ai_protocol = 0;
|
|
|
|
hints.ai_addrlen = 0;
|
|
|
|
hints.ai_addr = NULL;
|
|
|
|
hints.ai_canonname = NULL;
|
|
|
|
hints.ai_next = NULL;
|
2005-10-17 18:24:20 +02:00
|
|
|
ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
|
2003-07-24 01:30:41 +02:00
|
|
|
if (ret || !addrs)
|
2003-06-12 09:36:51 +02:00
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(LOG,
|
2003-07-24 01:30:41 +02:00
|
|
|
(errmsg("could not resolve \"localhost\": %s",
|
2003-07-22 21:00:12 +02:00
|
|
|
gai_strerror(ret))));
|
2003-06-12 09:36:51 +02:00
|
|
|
goto startup_failed;
|
|
|
|
}
|
2003-08-04 02:43:34 +02:00
|
|
|
|
2003-11-15 18:24:07 +01:00
|
|
|
/*
|
2005-10-17 18:24:20 +02:00
|
|
|
* On some platforms, pg_getaddrinfo_all() may return multiple addresses
|
|
|
|
* only one of which will actually work (eg, both IPv6 and IPv4 addresses
|
|
|
|
* when kernel will reject IPv6). Worse, the failure may occur at the
|
2014-05-06 18:12:18 +02:00
|
|
|
* bind() or perhaps even connect() stage. So we must loop through the
|
2005-10-17 18:24:20 +02:00
|
|
|
* results till we find a working combination. We will generate LOG
|
|
|
|
* messages, but no error, for bogus combinations.
|
2003-11-15 18:24:07 +01:00
|
|
|
*/
|
2003-07-24 01:30:41 +02:00
|
|
|
for (addr = addrs; addr; addr = addr->ai_next)
|
|
|
|
{
|
|
|
|
#ifdef HAVE_UNIX_SOCKETS
|
|
|
|
/* Ignore AF_UNIX sockets, if any are returned. */
|
|
|
|
if (addr->ai_family == AF_UNIX)
|
|
|
|
continue;
|
|
|
|
#endif
|
2004-08-29 07:07:03 +02:00
|
|
|
|
2006-04-20 12:51:32 +02:00
|
|
|
if (++tries > 1)
|
|
|
|
ereport(LOG,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(errmsg("trying another address for the statistics collector")));
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2003-11-15 18:24:07 +01:00
|
|
|
/*
|
|
|
|
* Create the socket.
|
|
|
|
*/
|
2010-01-31 18:39:34 +01:00
|
|
|
if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
|
2003-11-15 18:24:07 +01:00
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not create socket for statistics collector: %m")));
|
2003-11-15 18:24:07 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Bind it to a kernel assigned port on localhost and get the assigned
|
|
|
|
* port via getsockname().
|
2003-11-15 18:24:07 +01:00
|
|
|
*/
|
|
|
|
if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not bind socket for statistics collector: %m")));
|
2003-11-15 18:24:07 +01:00
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2003-11-15 18:24:07 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
alen = sizeof(pgStatAddr);
|
2017-06-21 20:39:04 +02:00
|
|
|
if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
|
2003-11-15 18:24:07 +01:00
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
|
|
|
errmsg("could not get address of socket for statistics collector: %m")));
|
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2003-11-15 18:24:07 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Connect the socket to its own address. This saves a few cycles by
|
|
|
|
* not having to respecify the target address on every send. This also
|
|
|
|
* provides a kernel-level check that only packets from this same
|
|
|
|
* address will be received.
|
2003-11-15 18:24:07 +01:00
|
|
|
*/
|
2017-06-21 20:39:04 +02:00
|
|
|
if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
|
2003-11-15 18:24:07 +01:00
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not connect socket for statistics collector: %m")));
|
2003-11-15 18:24:07 +01:00
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2003-11-15 18:24:07 +01:00
|
|
|
continue;
|
|
|
|
}
|
2003-06-12 09:36:51 +02:00
|
|
|
|
2004-03-23 00:55:29 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Try to send and receive a one-byte test message on the socket. This
|
|
|
|
* is to catch situations where the socket can be created but will not
|
|
|
|
* actually pass data (for instance, because kernel packet filtering
|
|
|
|
* rules prevent it).
|
2004-03-23 00:55:29 +01:00
|
|
|
*/
|
|
|
|
test_byte = TESTBYTEVAL;
|
2006-07-16 20:17:14 +02:00
|
|
|
|
|
|
|
retry1:
|
2004-03-23 00:55:29 +01:00
|
|
|
if (send(pgStatSock, &test_byte, 1, 0) != 1)
|
|
|
|
{
|
2006-07-16 20:17:14 +02:00
|
|
|
if (errno == EINTR)
|
|
|
|
goto retry1; /* if interrupted, just retry */
|
2004-03-23 00:55:29 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
|
|
|
errmsg("could not send test message on socket for statistics collector: %m")));
|
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2004-03-23 00:55:29 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* There could possibly be a little delay before the message can be
|
|
|
|
* received. We arbitrarily allow up to half a second before deciding
|
|
|
|
* it's broken.
|
2004-03-23 00:55:29 +01:00
|
|
|
*/
|
|
|
|
for (;;) /* need a loop to handle EINTR */
|
|
|
|
{
|
|
|
|
FD_ZERO(&rset);
|
2010-07-06 21:19:02 +02:00
|
|
|
FD_SET(pgStatSock, &rset);
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2004-03-23 00:55:29 +01:00
|
|
|
tv.tv_sec = 0;
|
|
|
|
tv.tv_usec = 500000;
|
2004-08-29 07:07:03 +02:00
|
|
|
sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
|
2004-03-23 00:55:29 +01:00
|
|
|
if (sel_res >= 0 || errno != EINTR)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (sel_res < 0)
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("select() failed in statistics collector: %m")));
|
2004-03-23 00:55:29 +01:00
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2004-03-23 00:55:29 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
|
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* This is the case we actually think is likely, so take pains to
|
|
|
|
* give a specific message for it.
|
2004-03-23 00:55:29 +01:00
|
|
|
*
|
|
|
|
* errno will not be set meaningfully here, so don't use it.
|
|
|
|
*/
|
|
|
|
ereport(LOG,
|
2004-12-20 20:17:56 +01:00
|
|
|
(errcode(ERRCODE_CONNECTION_FAILURE),
|
2004-03-23 00:55:29 +01:00
|
|
|
errmsg("test message did not get through on socket for statistics collector")));
|
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2004-03-23 00:55:29 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
test_byte++; /* just make sure variable is changed */
|
|
|
|
|
2006-07-16 20:17:14 +02:00
|
|
|
retry2:
|
2004-03-23 00:55:29 +01:00
|
|
|
if (recv(pgStatSock, &test_byte, 1, 0) != 1)
|
|
|
|
{
|
2006-07-16 20:17:14 +02:00
|
|
|
if (errno == EINTR)
|
|
|
|
goto retry2; /* if interrupted, just retry */
|
2004-03-23 00:55:29 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
|
|
|
errmsg("could not receive test message on socket for statistics collector: %m")));
|
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2004-03-23 00:55:29 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2004-08-29 07:07:03 +02:00
|
|
|
if (test_byte != TESTBYTEVAL) /* strictly paranoia ... */
|
2004-03-23 00:55:29 +01:00
|
|
|
{
|
|
|
|
ereport(LOG,
|
2004-12-20 20:17:56 +01:00
|
|
|
(errcode(ERRCODE_INTERNAL_ERROR),
|
2004-03-23 00:55:29 +01:00
|
|
|
errmsg("incorrect test message transmission on socket for statistics collector")));
|
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2004-03-23 00:55:29 +01:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2003-11-15 18:24:07 +01:00
|
|
|
/* If we get here, we have a working socket */
|
|
|
|
break;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2003-11-15 18:24:07 +01:00
|
|
|
/* Did we find a working address? */
|
2010-01-31 18:39:34 +01:00
|
|
|
if (!addr || pgStatSock == PGINVALID_SOCKET)
|
2002-04-03 02:27:25 +02:00
|
|
|
goto startup_failed;
|
2001-08-05 04:06:50 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Set the socket to non-blocking IO. This ensures that if the collector
|
2006-06-29 22:00:08 +02:00
|
|
|
* falls behind, statistics messages will be discarded; backends won't
|
|
|
|
* block waiting to send messages to the collector.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-03-25 01:34:31 +01:00
|
|
|
if (!pg_set_noblock(pgStatSock))
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_socket_access(),
|
2004-08-29 07:07:03 +02:00
|
|
|
errmsg("could not set statistics collector socket to nonblocking mode: %m")));
|
2002-04-03 02:27:25 +02:00
|
|
|
goto startup_failed;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
Try to ensure that stats collector's receive buffer size is at least 100KB.
Since commit 4e37b3e15, buildfarm member frogmouth has been failing
occasionally with symptoms indicating that some expected stats data is
getting dropped. The reason that that commit changed the behavior seems
probably to be that more data is getting shoved at the collector in a short
span of time. In current sources, the stats test's first session sends
about 9KB of data while exiting, which is probably the same as what was
sent just before wait_for_stats() in the previous test design. But now,
the test's second session is starting up concurrently, and it sends another
2KB (presumably reflecting its initial catalog accesses). Since frogmouth
is running on Windows XP, which reputedly has a default socket receive
buffer size of only 8KB, it is not very surprising if this has put us over
the threshold where the receive buffer can overflow and drop messages.
The same mechanism could very easily explain the intermittent stats test
failures we've been seeing for years, since background processes such
as the bgwriter will sometimes send data concurrently with all this, and
could thus cause occasional buffer overflows.
Hence, insert some code into pgstat_init() to increase the stats socket's
receive buffer size to 100KB if it's less than that. (On failure, emit a
LOG message, but keep going.) Modern systems seem to have default sizes
in the range of 100KB-250KB, but older platforms don't. I couldn't find
any platforms that wouldn't accept 100KB, so in theory this won't cause
any portability problems.
If this is successful at reducing the buildfarm failure rate in HEAD,
we should back-patch it, because it's certain that similar buffer overflows
happen in the field on platforms with small buffer sizes. Going forward,
there might be an argument for trying to increase the buffer size even
more, but let's take a baby step first.
Discussion: https://postgr.es/m/22173.1494788088@sss.pgh.pa.us
2017-05-16 21:24:52 +02:00
|
|
|
/*
|
|
|
|
* Try to ensure that the socket's receive buffer is at least
|
|
|
|
* PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
|
|
|
|
* data. Use of UDP protocol means that we are willing to lose data under
|
|
|
|
* heavy load, but we don't want it to happen just because of ridiculously
|
|
|
|
* small default buffer sizes (such as 8KB on older Windows versions).
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
int old_rcvbuf;
|
|
|
|
int new_rcvbuf;
|
|
|
|
ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
|
|
|
|
|
|
|
|
if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
|
|
|
|
(char *) &old_rcvbuf, &rcvbufsize) < 0)
|
|
|
|
{
|
2020-12-04 14:25:23 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
|
Try to ensure that stats collector's receive buffer size is at least 100KB.
Since commit 4e37b3e15, buildfarm member frogmouth has been failing
occasionally with symptoms indicating that some expected stats data is
getting dropped. The reason that that commit changed the behavior seems
probably to be that more data is getting shoved at the collector in a short
span of time. In current sources, the stats test's first session sends
about 9KB of data while exiting, which is probably the same as what was
sent just before wait_for_stats() in the previous test design. But now,
the test's second session is starting up concurrently, and it sends another
2KB (presumably reflecting its initial catalog accesses). Since frogmouth
is running on Windows XP, which reputedly has a default socket receive
buffer size of only 8KB, it is not very surprising if this has put us over
the threshold where the receive buffer can overflow and drop messages.
The same mechanism could very easily explain the intermittent stats test
failures we've been seeing for years, since background processes such
as the bgwriter will sometimes send data concurrently with all this, and
could thus cause occasional buffer overflows.
Hence, insert some code into pgstat_init() to increase the stats socket's
receive buffer size to 100KB if it's less than that. (On failure, emit a
LOG message, but keep going.) Modern systems seem to have default sizes
in the range of 100KB-250KB, but older platforms don't. I couldn't find
any platforms that wouldn't accept 100KB, so in theory this won't cause
any portability problems.
If this is successful at reducing the buildfarm failure rate in HEAD,
we should back-patch it, because it's certain that similar buffer overflows
happen in the field on platforms with small buffer sizes. Going forward,
there might be an argument for trying to increase the buffer size even
more, but let's take a baby step first.
Discussion: https://postgr.es/m/22173.1494788088@sss.pgh.pa.us
2017-05-16 21:24:52 +02:00
|
|
|
/* if we can't get existing size, always try to set it */
|
|
|
|
old_rcvbuf = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
new_rcvbuf = PGSTAT_MIN_RCVBUF;
|
|
|
|
if (old_rcvbuf < new_rcvbuf)
|
|
|
|
{
|
|
|
|
if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
|
|
|
|
(char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
|
2020-12-04 14:25:23 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
|
Try to ensure that stats collector's receive buffer size is at least 100KB.
Since commit 4e37b3e15, buildfarm member frogmouth has been failing
occasionally with symptoms indicating that some expected stats data is
getting dropped. The reason that that commit changed the behavior seems
probably to be that more data is getting shoved at the collector in a short
span of time. In current sources, the stats test's first session sends
about 9KB of data while exiting, which is probably the same as what was
sent just before wait_for_stats() in the previous test design. But now,
the test's second session is starting up concurrently, and it sends another
2KB (presumably reflecting its initial catalog accesses). Since frogmouth
is running on Windows XP, which reputedly has a default socket receive
buffer size of only 8KB, it is not very surprising if this has put us over
the threshold where the receive buffer can overflow and drop messages.
The same mechanism could very easily explain the intermittent stats test
failures we've been seeing for years, since background processes such
as the bgwriter will sometimes send data concurrently with all this, and
could thus cause occasional buffer overflows.
Hence, insert some code into pgstat_init() to increase the stats socket's
receive buffer size to 100KB if it's less than that. (On failure, emit a
LOG message, but keep going.) Modern systems seem to have default sizes
in the range of 100KB-250KB, but older platforms don't. I couldn't find
any platforms that wouldn't accept 100KB, so in theory this won't cause
any portability problems.
If this is successful at reducing the buildfarm failure rate in HEAD,
we should back-patch it, because it's certain that similar buffer overflows
happen in the field on platforms with small buffer sizes. Going forward,
there might be an argument for trying to increase the buffer size even
more, but let's take a baby step first.
Discussion: https://postgr.es/m/22173.1494788088@sss.pgh.pa.us
2017-05-16 21:24:52 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-10-17 18:24:20 +02:00
|
|
|
pg_freeaddrinfo_all(hints.ai_family, addrs);
|
2003-11-15 18:24:07 +01:00
|
|
|
|
Account explicitly for long-lived FDs that are allocated outside fd.c.
The comments in fd.c have long claimed that all file allocations should
go through that module, but in reality that's not always practical.
fd.c doesn't supply APIs for invoking some FD-producing syscalls like
pipe() or epoll_create(); and the APIs it does supply for non-virtual
FDs are mostly insistent on releasing those FDs at transaction end;
and in some cases the actual open() call is in code that can't be made
to use fd.c, such as libpq.
This has led to a situation where, in a modern server, there are likely
to be seven or so long-lived FDs per backend process that are not known
to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had *very*
few spare FDs if max_files_per_process is >= the system ulimit and
fd.c had opened all the files it thought it safely could. The
contrib/postgres_fdw regression test, in particular, could easily be
made to fall over by running it under a restrictive ulimit.
To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD
that allow outside callers to tell fd.c that they have or want to allocate
a FD that's not directly managed by fd.c. Add calls to track all the
fixed FDs in a standard backend session, so that we are honestly
guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE
limit in a backend's idle state. The coding rules for these functions say
that there's no need to call them in code that just allocates one FD over
a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases.
That means that there aren't all that many places where we need to worry.
But postgres_fdw and dblink must use this facility to account for
long-lived FDs consumed by libpq connections. There may be other places
where it's worth doing such accounting, too, but this seems like enough
to solve the immediate problem.
Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs.
(Callers can choose to ignore this limit, but of course it's unwise
to do so except for fixed file allocations.) I also reduced the limit
on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2).
Conceivably a smarter rule could be used here --- but in practice,
on reasonable systems, max_safe_fds should be large enough that this
isn't much of an issue, so KISS for now. To avoid possible regression
in the number of external or allocated files that can be opened,
increase FD_MINFREE and the lower limit on max_files_per_process a
little bit; we now insist that the effective "ulimit -n" be at least 64.
This seems like pretty clearly a bug fix, but in view of the lack of
field complaints, I'll refrain from risking a back-patch.
Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
2020-02-24 23:28:33 +01:00
|
|
|
/* Now that we have a long-lived socket, tell fd.c about it. */
|
|
|
|
ReserveExternalFD();
|
|
|
|
|
2003-04-26 04:57:14 +02:00
|
|
|
return;
|
2002-04-03 02:27:25 +02:00
|
|
|
|
|
|
|
startup_failed:
|
2004-12-20 20:17:56 +01:00
|
|
|
ereport(LOG,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(errmsg("disabling statistics collector for lack of working socket")));
|
2004-12-20 20:17:56 +01:00
|
|
|
|
2003-07-24 01:30:41 +02:00
|
|
|
if (addrs)
|
2005-10-17 18:24:20 +02:00
|
|
|
pg_freeaddrinfo_all(hints.ai_family, addrs);
|
2003-06-12 09:36:51 +02:00
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock != PGINVALID_SOCKET)
|
2003-04-25 03:24:00 +02:00
|
|
|
closesocket(pgStatSock);
|
2010-01-10 15:16:08 +01:00
|
|
|
pgStatSock = PGINVALID_SOCKET;
|
2002-04-03 02:27:25 +02:00
|
|
|
|
2007-09-24 05:12:23 +02:00
|
|
|
/*
|
|
|
|
* Adjust GUC variables to suppress useless activity, and for debugging
|
2007-11-15 22:14:46 +01:00
|
|
|
* purposes (seeing track_counts off is a clue that we failed here). We
|
|
|
|
* use PGC_S_OVERRIDE because there is no point in trying to turn it back
|
|
|
|
* on from postgresql.conf without a restart.
|
2007-09-24 05:12:23 +02:00
|
|
|
*/
|
|
|
|
SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/*
|
|
|
|
* subroutine for pgstat_reset_all
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_reset_remove_files(const char *directory)
|
|
|
|
{
|
|
|
|
DIR *dir;
|
|
|
|
struct dirent *entry;
|
2017-04-11 20:13:31 +02:00
|
|
|
char fname[MAXPGPATH * 2];
|
2013-02-18 21:56:08 +01:00
|
|
|
|
2013-08-19 23:48:17 +02:00
|
|
|
dir = AllocateDir(directory);
|
|
|
|
while ((entry = ReadDir(dir, directory)) != NULL)
|
2013-02-18 21:56:08 +01:00
|
|
|
{
|
2013-08-20 01:36:04 +02:00
|
|
|
int nchars;
|
|
|
|
Oid tmp_oid;
|
2013-02-18 21:56:08 +01:00
|
|
|
|
2013-08-19 23:48:17 +02:00
|
|
|
/*
|
|
|
|
* Skip directory entries that don't match the file names we write.
|
|
|
|
* See get_dbstat_filename for the database-specific pattern.
|
|
|
|
*/
|
2013-08-20 01:36:04 +02:00
|
|
|
if (strncmp(entry->d_name, "global.", 7) == 0)
|
|
|
|
nchars = 7;
|
|
|
|
else
|
2013-08-19 23:48:17 +02:00
|
|
|
{
|
2013-08-20 01:36:04 +02:00
|
|
|
nchars = 0;
|
|
|
|
(void) sscanf(entry->d_name, "db_%u.%n",
|
|
|
|
&tmp_oid, &nchars);
|
|
|
|
if (nchars <= 0)
|
|
|
|
continue;
|
|
|
|
/* %u allows leading whitespace, so reject that */
|
|
|
|
if (strchr("0123456789", entry->d_name[3]) == NULL)
|
2013-08-19 23:48:17 +02:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2013-08-20 01:36:04 +02:00
|
|
|
if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
|
|
|
|
strcmp(entry->d_name + nchars, "stat") != 0)
|
2013-08-19 23:48:17 +02:00
|
|
|
continue;
|
2013-02-18 21:56:08 +01:00
|
|
|
|
2017-04-11 20:13:31 +02:00
|
|
|
snprintf(fname, sizeof(fname), "%s/%s", directory,
|
2013-02-18 21:56:08 +01:00
|
|
|
entry->d_name);
|
|
|
|
unlink(fname);
|
|
|
|
}
|
|
|
|
FreeDir(dir);
|
|
|
|
}
|
|
|
|
|
2005-08-11 23:11:50 +02:00
|
|
|
/*
|
|
|
|
* pgstat_reset_all() -
|
|
|
|
*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Remove the stats files. This is currently used only if WAL
|
2005-08-11 23:11:50 +02:00
|
|
|
* recovery is needed after a crash.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_reset_all(void)
|
|
|
|
{
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_reset_remove_files(pgstat_stat_directory);
|
|
|
|
pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
|
2005-08-11 23:11:50 +02:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2003-12-25 04:52:51 +01:00
|
|
|
#ifdef EXEC_BACKEND
|
|
|
|
|
2004-05-28 07:13:32 +02:00
|
|
|
/*
|
2004-01-07 00:15:22 +01:00
|
|
|
* pgstat_forkexec() -
|
2003-12-25 04:52:51 +01:00
|
|
|
*
|
2006-06-29 22:00:08 +02:00
|
|
|
* Format up the arglist for, then fork and exec, statistics collector process
|
2003-12-25 04:52:51 +01:00
|
|
|
*/
|
2004-01-07 00:15:22 +01:00
|
|
|
static pid_t
|
2006-06-29 22:00:08 +02:00
|
|
|
pgstat_forkexec(void)
|
2003-12-25 04:52:51 +01:00
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
char *av[10];
|
2006-06-29 22:00:08 +02:00
|
|
|
int ac = 0;
|
2003-12-25 04:52:51 +01:00
|
|
|
|
|
|
|
av[ac++] = "postgres";
|
2006-06-29 22:00:08 +02:00
|
|
|
av[ac++] = "--forkcol";
|
2004-05-28 07:13:32 +02:00
|
|
|
av[ac++] = NULL; /* filled in by postmaster_forkexec */
|
|
|
|
|
|
|
|
av[ac] = NULL;
|
|
|
|
Assert(ac < lengthof(av));
|
2003-12-25 04:52:51 +01:00
|
|
|
|
2004-05-28 07:13:32 +02:00
|
|
|
return postmaster_forkexec(ac, av);
|
2003-12-25 04:52:51 +01:00
|
|
|
}
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* EXEC_BACKEND */
|
2004-05-28 07:13:32 +02:00
|
|
|
|
2003-12-25 04:52:51 +01:00
|
|
|
|
2007-09-24 05:12:23 +02:00
|
|
|
/*
|
2001-06-22 21:18:36 +02:00
|
|
|
* pgstat_start() -
|
|
|
|
*
|
|
|
|
* Called from postmaster at startup or after an existing collector
|
2003-04-26 04:57:14 +02:00
|
|
|
* died. Attempt to fire up a fresh statistics collector.
|
2002-04-03 02:27:25 +02:00
|
|
|
*
|
2004-06-14 20:08:19 +02:00
|
|
|
* Returns PID of child process, or 0 if fail.
|
|
|
|
*
|
2003-04-26 04:57:14 +02:00
|
|
|
* Note: if fail, we will be called again from the postmaster main loop.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2004-06-14 20:08:19 +02:00
|
|
|
int
|
2001-10-21 05:25:36 +02:00
|
|
|
pgstat_start(void)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2003-04-26 04:57:14 +02:00
|
|
|
time_t curtime;
|
2004-06-14 20:08:19 +02:00
|
|
|
pid_t pgStatPid;
|
2003-04-26 04:57:14 +02:00
|
|
|
|
2001-07-05 17:19:40 +02:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Check that the socket is there, else pgstat_init failed and we can do
|
|
|
|
* nothing useful.
|
2001-07-05 17:19:40 +02:00
|
|
|
*/
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2004-06-14 20:08:19 +02:00
|
|
|
return 0;
|
2001-07-05 17:19:40 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Do nothing if too soon since last collector start. This is a safety
|
|
|
|
* valve to protect against continuous respawn attempts if the collector
|
2014-05-06 18:12:18 +02:00
|
|
|
* is dying immediately at launch. Note that since we will be re-called
|
2005-10-15 04:49:52 +02:00
|
|
|
* from the postmaster main loop, we will get another chance later.
|
2003-04-26 04:57:14 +02:00
|
|
|
*/
|
|
|
|
curtime = time(NULL);
|
|
|
|
if ((unsigned int) (curtime - last_pgstat_start_time) <
|
|
|
|
(unsigned int) PGSTAT_RESTART_INTERVAL)
|
2004-06-14 20:08:19 +02:00
|
|
|
return 0;
|
2003-04-26 04:57:14 +02:00
|
|
|
last_pgstat_start_time = curtime;
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2004-06-14 20:08:19 +02:00
|
|
|
* Okay, fork off the collector.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2004-01-07 00:15:22 +01:00
|
|
|
#ifdef EXEC_BACKEND
|
2006-06-29 22:00:08 +02:00
|
|
|
switch ((pgStatPid = pgstat_forkexec()))
|
2004-01-07 00:15:22 +01:00
|
|
|
#else
|
2005-04-08 02:55:07 +02:00
|
|
|
switch ((pgStatPid = fork_process()))
|
2004-01-07 00:15:22 +01:00
|
|
|
#endif
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
|
|
|
case -1:
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(LOG,
|
2006-06-29 22:00:08 +02:00
|
|
|
(errmsg("could not fork statistics collector: %m")));
|
2004-06-14 20:08:19 +02:00
|
|
|
return 0;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2004-01-07 00:15:22 +01:00
|
|
|
#ifndef EXEC_BACKEND
|
2001-06-22 21:18:36 +02:00
|
|
|
case 0:
|
2004-01-07 00:15:22 +01:00
|
|
|
/* in postmaster child ... */
|
2015-01-13 13:12:37 +01:00
|
|
|
InitPostmasterChild();
|
|
|
|
|
2004-05-30 00:48:23 +02:00
|
|
|
/* Close the postmaster's sockets */
|
2004-08-06 01:32:13 +02:00
|
|
|
ClosePostmasterPorts(false);
|
2004-01-07 00:15:22 +01:00
|
|
|
|
|
|
|
/* Drop our connection to postmaster's shared memory, as well */
|
2014-03-18 12:58:53 +01:00
|
|
|
dsm_detach_all();
|
2004-01-07 00:15:22 +01:00
|
|
|
PGSharedMemoryDetach();
|
|
|
|
|
2006-06-29 22:00:08 +02:00
|
|
|
PgstatCollectorMain(0, NULL);
|
2001-06-22 21:18:36 +02:00
|
|
|
break;
|
2004-01-07 00:15:22 +01:00
|
|
|
#endif
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
default:
|
2004-06-14 20:08:19 +02:00
|
|
|
return (int) pgStatPid;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2004-06-14 20:08:19 +02:00
|
|
|
/* shouldn't get here */
|
|
|
|
return 0;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2007-11-15 22:14:46 +01:00
|
|
|
void
|
|
|
|
allow_immediate_pgstat_restart(void)
|
2007-03-22 20:53:31 +01:00
|
|
|
{
|
2007-11-15 22:14:46 +01:00
|
|
|
last_pgstat_start_time = 0;
|
2007-03-22 20:53:31 +01:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ------------------------------------------------------------
|
|
|
|
* Public functions used by backends follow
|
2001-10-25 07:50:21 +02:00
|
|
|
*------------------------------------------------------------
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
2008-05-15 02:17:41 +02:00
|
|
|
* pgstat_report_stat() -
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
2017-04-14 20:35:05 +02:00
|
|
|
* Must be called by processes that performs DML: tcop/postgres.c, logical
|
|
|
|
* receiver processes, SPI worker, etc. to send the so far collected
|
|
|
|
* per-table and function usage statistics to the collector. Note that this
|
|
|
|
* is called only when not within a transaction, so it is fair to use
|
2008-05-15 02:17:41 +02:00
|
|
|
* transaction stop time as an approximation of current time.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
2008-05-15 02:17:41 +02:00
|
|
|
pgstat_report_stat(bool force)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
/* we assume this inits to all zeroes: */
|
|
|
|
static const PgStat_TableCounts all_zeroes;
|
2007-11-15 22:14:46 +01:00
|
|
|
static TimestampTz last_report = 0;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
2007-04-30 05:23:49 +02:00
|
|
|
TimestampTz now;
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_MsgTabstat regular_msg;
|
|
|
|
PgStat_MsgTabstat shared_msg;
|
|
|
|
TabStatusArray *tsa;
|
|
|
|
int i;
|
2007-04-30 05:23:49 +02:00
|
|
|
|
|
|
|
/* Don't expend a clock check if nothing to do */
|
2013-02-07 20:44:00 +01:00
|
|
|
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
|
2014-07-02 22:20:30 +02:00
|
|
|
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
|
|
|
|
!have_function_stats)
|
2007-04-30 05:23:49 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
|
2007-04-30 18:37:08 +02:00
|
|
|
* msec since we last sent one, or the caller wants to force stats out.
|
2007-04-30 05:23:49 +02:00
|
|
|
*/
|
|
|
|
now = GetCurrentTransactionStopTimestamp();
|
2007-04-30 18:37:08 +02:00
|
|
|
if (!force &&
|
|
|
|
!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
|
2007-04-30 05:23:49 +02:00
|
|
|
return;
|
|
|
|
last_report = now;
|
|
|
|
|
2017-05-15 04:52:41 +02:00
|
|
|
/*
|
|
|
|
* Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
|
|
|
|
* entries it points to. (Should we fail partway through the loop below,
|
|
|
|
* it's okay to have removed the hashtable already --- the only
|
|
|
|
* consequence is we'd get multiple entries for the same table in the
|
|
|
|
* pgStatTabList, and that's safe.)
|
|
|
|
*/
|
|
|
|
if (pgStatTabHash)
|
|
|
|
hash_destroy(pgStatTabHash);
|
|
|
|
pgStatTabHash = NULL;
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Scan through the TabStatusArray struct(s) to find tables that actually
|
|
|
|
* have counts, and build messages to send. We have to separate shared
|
2007-11-15 22:14:46 +01:00
|
|
|
* relations from regular ones because the databaseid field in the message
|
|
|
|
* header has to depend on that.
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
regular_msg.m_databaseid = MyDatabaseId;
|
|
|
|
shared_msg.m_databaseid = InvalidOid;
|
|
|
|
regular_msg.m_nentries = 0;
|
|
|
|
shared_msg.m_nentries = 0;
|
|
|
|
|
|
|
|
for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
|
|
|
|
{
|
|
|
|
for (i = 0; i < tsa->tsa_used; i++)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *entry = &tsa->tsa_entries[i];
|
|
|
|
PgStat_MsgTabstat *this_msg;
|
|
|
|
PgStat_TableEntry *this_ent;
|
|
|
|
|
|
|
|
/* Shouldn't have any pending transaction-dependent counts */
|
|
|
|
Assert(entry->trans == NULL);
|
|
|
|
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Ignore entries that didn't accumulate any actual counts, such
|
|
|
|
* as indexes that were opened by the planner but not used.
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
if (memcmp(&entry->t_counts, &all_zeroes,
|
|
|
|
sizeof(PgStat_TableCounts)) == 0)
|
|
|
|
continue;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
|
|
|
* OK, insert data into the appropriate message, and send if full.
|
|
|
|
*/
|
|
|
|
this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
|
|
|
|
this_ent = &this_msg->m_entry[this_msg->m_nentries];
|
|
|
|
this_ent->t_id = entry->t_id;
|
|
|
|
memcpy(&this_ent->t_counts, &entry->t_counts,
|
|
|
|
sizeof(PgStat_TableCounts));
|
|
|
|
if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
|
|
|
|
{
|
|
|
|
pgstat_send_tabstat(this_msg);
|
|
|
|
this_msg->m_nentries = 0;
|
|
|
|
}
|
|
|
|
}
|
2019-08-19 09:21:39 +02:00
|
|
|
/* zero out PgStat_TableStatus structs after use */
|
2007-05-27 05:50:39 +02:00
|
|
|
MemSet(tsa->tsa_entries, 0,
|
|
|
|
tsa->tsa_used * sizeof(PgStat_TableStatus));
|
|
|
|
tsa->tsa_used = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-07-02 22:20:30 +02:00
|
|
|
* Send partial messages. Make sure that any pending xact commit/abort
|
|
|
|
* gets counted, even if there are no table stats to send.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2007-05-27 05:50:39 +02:00
|
|
|
if (regular_msg.m_nentries > 0 ||
|
2014-07-02 22:20:30 +02:00
|
|
|
pgStatXactCommit > 0 || pgStatXactRollback > 0)
|
2007-05-27 05:50:39 +02:00
|
|
|
pgstat_send_tabstat(®ular_msg);
|
|
|
|
if (shared_msg.m_nentries > 0)
|
|
|
|
pgstat_send_tabstat(&shared_msg);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
/* Now, send function statistics */
|
|
|
|
pgstat_send_funcstats();
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/* Send WAL statistics */
|
|
|
|
pgstat_send_wal();
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/* Finally send SLRU statistics */
|
|
|
|
pgstat_send_slru();
|
2007-04-21 06:10:53 +02:00
|
|
|
}
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
2008-05-15 02:17:41 +02:00
|
|
|
* Subroutine for pgstat_report_stat: finish and send a tabstat message
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
2007-04-21 06:10:53 +02:00
|
|
|
static void
|
2007-05-27 05:50:39 +02:00
|
|
|
pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
|
2007-04-21 06:10:53 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
int n;
|
|
|
|
int len;
|
2003-08-12 18:21:18 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/* It's unlikely we'd get here with no socket, but maybe not impossible */
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2007-05-27 05:50:39 +02:00
|
|
|
return;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
2012-04-28 21:11:13 +02:00
|
|
|
* Report and reset accumulated xact commit/rollback and I/O timings
|
|
|
|
* whenever we send a normal tabstat message
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
if (OidIsValid(tsmsg->m_databaseid))
|
|
|
|
{
|
2003-08-12 18:21:18 +02:00
|
|
|
tsmsg->m_xact_commit = pgStatXactCommit;
|
|
|
|
tsmsg->m_xact_rollback = pgStatXactRollback;
|
2012-04-30 00:13:33 +02:00
|
|
|
tsmsg->m_block_read_time = pgStatBlockReadTime;
|
|
|
|
tsmsg->m_block_write_time = pgStatBlockWriteTime;
|
2001-10-25 07:50:21 +02:00
|
|
|
pgStatXactCommit = 0;
|
2001-06-22 21:18:36 +02:00
|
|
|
pgStatXactRollback = 0;
|
2012-04-30 00:13:33 +02:00
|
|
|
pgStatBlockReadTime = 0;
|
|
|
|
pgStatBlockWriteTime = 0;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
tsmsg->m_xact_commit = 0;
|
|
|
|
tsmsg->m_xact_rollback = 0;
|
2012-04-30 00:13:33 +02:00
|
|
|
tsmsg->m_block_read_time = 0;
|
|
|
|
tsmsg->m_block_write_time = 0;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
n = tsmsg->m_nentries;
|
|
|
|
len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
|
|
|
|
n * sizeof(PgStat_TableEntry);
|
2005-07-29 21:30:09 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
|
|
|
|
pgstat_send(tsmsg, len);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/*
|
|
|
|
* Subroutine for pgstat_report_stat: populate and send a function stat message
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_send_funcstats(void)
|
|
|
|
{
|
|
|
|
/* we assume this inits to all zeroes: */
|
|
|
|
static const PgStat_FunctionCounts all_zeroes;
|
|
|
|
|
|
|
|
PgStat_MsgFuncstat msg;
|
|
|
|
PgStat_BackendFunctionEntry *entry;
|
|
|
|
HASH_SEQ_STATUS fstat;
|
|
|
|
|
|
|
|
if (pgStatFunctions == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
|
|
|
|
msg.m_databaseid = MyDatabaseId;
|
|
|
|
msg.m_nentries = 0;
|
|
|
|
|
|
|
|
hash_seq_init(&fstat, pgStatFunctions);
|
|
|
|
while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
|
|
|
|
{
|
|
|
|
PgStat_FunctionEntry *m_ent;
|
|
|
|
|
|
|
|
/* Skip it if no counts accumulated since last time */
|
|
|
|
if (memcmp(&entry->f_counts, &all_zeroes,
|
|
|
|
sizeof(PgStat_FunctionCounts)) == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* need to convert format of time accumulators */
|
|
|
|
m_ent = &msg.m_entry[msg.m_nentries];
|
|
|
|
m_ent->f_id = entry->f_id;
|
|
|
|
m_ent->f_numcalls = entry->f_counts.f_numcalls;
|
2012-04-30 20:02:47 +02:00
|
|
|
m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
|
|
|
|
m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
|
|
|
|
{
|
|
|
|
pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
|
|
|
|
msg.m_nentries * sizeof(PgStat_FunctionEntry));
|
|
|
|
msg.m_nentries = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* reset the entry's counts */
|
|
|
|
MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (msg.m_nentries > 0)
|
|
|
|
pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
|
|
|
|
msg.m_nentries * sizeof(PgStat_FunctionEntry));
|
2008-11-03 02:17:08 +01:00
|
|
|
|
|
|
|
have_function_stats = false;
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ----------
|
2008-05-15 02:17:41 +02:00
|
|
|
* pgstat_vacuum_stat() -
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
|
|
|
* Will tell the collector about objects he can get rid of.
|
|
|
|
* ----------
|
|
|
|
*/
|
2006-01-18 21:35:06 +01:00
|
|
|
void
|
2008-05-15 02:17:41 +02:00
|
|
|
pgstat_vacuum_stat(void)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2007-01-12 00:06:03 +01:00
|
|
|
HTAB *htab;
|
2006-01-18 21:35:06 +01:00
|
|
|
PgStat_MsgTabpurge msg;
|
2008-05-15 02:17:41 +02:00
|
|
|
PgStat_MsgFuncpurge f_msg;
|
2001-10-25 07:50:21 +02:00
|
|
|
HASH_SEQ_STATUS hstat;
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
2008-05-15 02:17:41 +02:00
|
|
|
PgStat_StatFuncEntry *funcentry;
|
2001-10-25 07:50:21 +02:00
|
|
|
int len;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2006-01-18 21:35:06 +01:00
|
|
|
return;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* If not done for this transaction, read the statistics collector stats
|
|
|
|
* file into some hash tables.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2004-07-01 02:52:04 +02:00
|
|
|
backend_read_statsfile();
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2006-01-18 21:35:06 +01:00
|
|
|
* Read pg_database and make a list of OIDs of all existing databases
|
|
|
|
*/
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
|
2006-01-18 21:35:06 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Search the database hash table for dead databases and tell the
|
|
|
|
* collector to drop them.
|
|
|
|
*/
|
|
|
|
hash_seq_init(&hstat, pgStatDBHash);
|
|
|
|
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
|
|
|
|
{
|
|
|
|
Oid dbid = dbentry->databaseid;
|
|
|
|
|
2007-01-12 00:06:03 +01:00
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
2007-06-07 20:53:17 +02:00
|
|
|
/* the DB entry for shared tables (with InvalidOid) is never dropped */
|
|
|
|
if (OidIsValid(dbid) &&
|
|
|
|
hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
|
2006-01-18 21:35:06 +01:00
|
|
|
pgstat_drop_database(dbid);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Clean up */
|
2007-01-12 00:06:03 +01:00
|
|
|
hash_destroy(htab);
|
2006-01-18 21:35:06 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup our own database entry; if not found, nothing more to do.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2001-10-25 07:50:21 +02:00
|
|
|
dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
|
|
|
|
(void *) &MyDatabaseId,
|
|
|
|
HASH_FIND, NULL);
|
2006-01-18 21:35:06 +01:00
|
|
|
if (dbentry == NULL || dbentry->tables == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Similarly to above, make a list of all known relations in this DB.
|
|
|
|
*/
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize our messages table counter to zero
|
|
|
|
*/
|
|
|
|
msg.m_nentries = 0;
|
|
|
|
|
|
|
|
/*
|
2005-07-29 21:30:09 +02:00
|
|
|
* Check for all tables listed in stats hashtable if they still exist.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2001-10-25 07:50:21 +02:00
|
|
|
hash_seq_init(&hstat, dbentry->tables);
|
2001-10-05 19:28:13 +02:00
|
|
|
while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2007-01-12 00:06:03 +01:00
|
|
|
Oid tabid = tabentry->tableid;
|
|
|
|
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
|
|
|
if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
|
2001-06-22 21:18:36 +02:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
2006-01-18 21:35:06 +01:00
|
|
|
* Not there, so add this table's Oid to the message
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2007-01-12 00:06:03 +01:00
|
|
|
msg.m_tableid[msg.m_nentries++] = tabid;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2006-01-18 21:35:06 +01:00
|
|
|
* If the message is full, send it out and reinitialize to empty
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
|
|
|
|
{
|
2001-11-26 23:31:08 +01:00
|
|
|
len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
|
2017-06-21 20:39:04 +02:00
|
|
|
+ msg.m_nentries * sizeof(Oid);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
|
2005-07-29 21:30:09 +02:00
|
|
|
msg.m_databaseid = MyDatabaseId;
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_send(&msg, len);
|
|
|
|
|
|
|
|
msg.m_nentries = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Send the rest
|
|
|
|
*/
|
|
|
|
if (msg.m_nentries > 0)
|
|
|
|
{
|
2001-11-26 23:31:08 +01:00
|
|
|
len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
|
2017-06-21 20:39:04 +02:00
|
|
|
+ msg.m_nentries * sizeof(Oid);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
|
2005-05-11 03:41:41 +02:00
|
|
|
msg.m_databaseid = MyDatabaseId;
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_send(&msg, len);
|
|
|
|
}
|
|
|
|
|
2006-01-18 21:35:06 +01:00
|
|
|
/* Clean up */
|
2007-01-12 00:06:03 +01:00
|
|
|
hash_destroy(htab);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
/*
|
2008-12-08 16:44:54 +01:00
|
|
|
* Now repeat the above steps for functions. However, we needn't bother
|
|
|
|
* in the common case where no function stats are being collected.
|
2008-05-15 02:17:41 +02:00
|
|
|
*/
|
2008-12-08 16:44:54 +01:00
|
|
|
if (dbentry->functions != NULL &&
|
|
|
|
hash_get_num_entries(dbentry->functions) > 0)
|
|
|
|
{
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
2008-12-08 16:44:54 +01:00
|
|
|
pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
|
|
|
|
f_msg.m_databaseid = MyDatabaseId;
|
|
|
|
f_msg.m_nentries = 0;
|
2008-05-15 02:17:41 +02:00
|
|
|
|
2008-12-08 16:44:54 +01:00
|
|
|
hash_seq_init(&hstat, dbentry->functions);
|
|
|
|
while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
|
|
|
|
{
|
|
|
|
Oid funcid = funcentry->functionid;
|
2008-05-15 02:17:41 +02:00
|
|
|
|
2008-12-08 16:44:54 +01:00
|
|
|
CHECK_FOR_INTERRUPTS();
|
2008-05-15 02:17:41 +02:00
|
|
|
|
2008-12-08 16:44:54 +01:00
|
|
|
if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
|
|
|
|
continue;
|
2008-05-15 02:17:41 +02:00
|
|
|
|
2008-12-08 16:44:54 +01:00
|
|
|
/*
|
|
|
|
* Not there, so add this function's Oid to the message
|
|
|
|
*/
|
|
|
|
f_msg.m_functionid[f_msg.m_nentries++] = funcid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the message is full, send it out and reinitialize to empty
|
|
|
|
*/
|
|
|
|
if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
|
|
|
|
{
|
|
|
|
len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
|
2017-06-21 20:39:04 +02:00
|
|
|
+ f_msg.m_nentries * sizeof(Oid);
|
2008-12-08 16:44:54 +01:00
|
|
|
|
|
|
|
pgstat_send(&f_msg, len);
|
|
|
|
|
|
|
|
f_msg.m_nentries = 0;
|
|
|
|
}
|
|
|
|
}
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
/*
|
2008-12-08 16:44:54 +01:00
|
|
|
* Send the rest
|
2008-05-15 02:17:41 +02:00
|
|
|
*/
|
2008-12-08 16:44:54 +01:00
|
|
|
if (f_msg.m_nentries > 0)
|
2008-05-15 02:17:41 +02:00
|
|
|
{
|
|
|
|
len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
|
2017-06-21 20:39:04 +02:00
|
|
|
+ f_msg.m_nentries * sizeof(Oid);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
pgstat_send(&f_msg, len);
|
|
|
|
}
|
|
|
|
|
2008-12-08 16:44:54 +01:00
|
|
|
hash_destroy(htab);
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
2007-01-12 00:06:03 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_collect_oids() -
|
|
|
|
*
|
2008-05-15 02:17:41 +02:00
|
|
|
* Collect the OIDs of all objects listed in the specified system catalog
|
|
|
|
* into a temporary hash table. Caller should hash_destroy the result
|
2014-05-06 18:12:18 +02:00
|
|
|
* when done with it. (However, we make the table in CurrentMemoryContext
|
2009-12-27 20:40:07 +01:00
|
|
|
* so that it will be freed properly in event of an error.)
|
2007-01-12 00:06:03 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static HTAB *
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
|
2007-01-12 00:06:03 +01:00
|
|
|
{
|
|
|
|
HTAB *htab;
|
|
|
|
HASHCTL hash_ctl;
|
|
|
|
Relation rel;
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
TableScanDesc scan;
|
2007-01-12 00:06:03 +01:00
|
|
|
HeapTuple tup;
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
Snapshot snapshot;
|
2007-01-12 00:06:03 +01:00
|
|
|
|
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
|
|
|
hash_ctl.entrysize = sizeof(Oid);
|
2009-12-27 20:40:07 +01:00
|
|
|
hash_ctl.hcxt = CurrentMemoryContext;
|
2007-01-12 00:06:03 +01:00
|
|
|
htab = hash_create("Temporary table of OIDs",
|
|
|
|
PGSTAT_TAB_HASH_SIZE,
|
|
|
|
&hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
2007-01-12 00:06:03 +01:00
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
rel = table_open(catalogid, AccessShareLock);
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
snapshot = RegisterSnapshot(GetLatestSnapshot());
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
scan = table_beginscan(rel, snapshot, 0, NULL);
|
2007-01-12 00:06:03 +01:00
|
|
|
while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
|
|
|
|
{
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
Oid thisoid;
|
|
|
|
bool isnull;
|
|
|
|
|
|
|
|
thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
|
|
|
|
Assert(!isnull);
|
2007-01-12 00:06:03 +01:00
|
|
|
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
|
|
|
(void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
|
|
|
|
}
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_endscan(scan);
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
UnregisterSnapshot(snapshot);
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(rel, AccessShareLock);
|
2007-01-12 00:06:03 +01:00
|
|
|
|
|
|
|
return htab;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_drop_database() -
|
|
|
|
*
|
|
|
|
* Tell the collector that we just dropped a database.
|
2006-01-18 21:35:06 +01:00
|
|
|
* (If the message gets lost, we will still clean the dead DB eventually
|
2008-05-15 02:17:41 +02:00
|
|
|
* via future invocations of pgstat_vacuum_stat().)
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
2007-02-09 17:12:19 +01:00
|
|
|
void
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_drop_database(Oid databaseid)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_MsgDropdb msg;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2001-06-22 21:18:36 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
|
2005-05-11 03:41:41 +02:00
|
|
|
msg.m_databaseid = databaseid;
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2006-01-18 21:35:06 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_drop_relation() -
|
|
|
|
*
|
|
|
|
* Tell the collector that we just dropped a relation.
|
|
|
|
* (If the message gets lost, we will still clean the dead entry eventually
|
2008-05-15 02:17:41 +02:00
|
|
|
* via future invocations of pgstat_vacuum_stat().)
|
2007-07-09 00:23:16 +02:00
|
|
|
*
|
|
|
|
* Currently not used for lack of any good place to call it; we rely
|
2008-05-15 02:17:41 +02:00
|
|
|
* entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
|
2006-01-18 21:35:06 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
2007-07-09 00:23:16 +02:00
|
|
|
#ifdef NOT_USED
|
2006-01-18 21:35:06 +01:00
|
|
|
void
|
|
|
|
pgstat_drop_relation(Oid relid)
|
|
|
|
{
|
|
|
|
PgStat_MsgTabpurge msg;
|
|
|
|
int len;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2006-01-18 21:35:06 +01:00
|
|
|
return;
|
|
|
|
|
|
|
|
msg.m_tableid[0] = relid;
|
|
|
|
msg.m_nentries = 1;
|
|
|
|
|
2017-06-21 20:39:04 +02:00
|
|
|
len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
|
2006-01-18 21:35:06 +01:00
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
|
|
|
|
msg.m_databaseid = MyDatabaseId;
|
|
|
|
pgstat_send(&msg, len);
|
|
|
|
}
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* NOT_USED */
|
2006-01-18 21:35:06 +01:00
|
|
|
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_reset_counters() -
|
|
|
|
*
|
|
|
|
* Tell the statistics collector to reset counters for our database.
|
2016-04-07 03:45:32 +02:00
|
|
|
*
|
|
|
|
* Permission checking for this function is managed through the normal
|
|
|
|
* GRANT system.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_reset_counters(void)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_MsgResetcounter msg;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2001-06-22 21:18:36 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
|
2005-05-11 03:41:41 +02:00
|
|
|
msg.m_databaseid = MyDatabaseId;
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2010-01-19 15:11:32 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_reset_shared_counters() -
|
|
|
|
*
|
|
|
|
* Tell the statistics collector to reset cluster-wide shared counters.
|
2016-04-07 03:45:32 +02:00
|
|
|
*
|
|
|
|
* Permission checking for this function is managed through the normal
|
|
|
|
* GRANT system.
|
2010-01-19 15:11:32 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_reset_shared_counters(const char *target)
|
|
|
|
{
|
|
|
|
PgStat_MsgResetsharedcounter msg;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2010-01-19 15:11:32 +01:00
|
|
|
return;
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
if (strcmp(target, "archiver") == 0)
|
|
|
|
msg.m_resettarget = RESET_ARCHIVER;
|
|
|
|
else if (strcmp(target, "bgwriter") == 0)
|
2010-01-19 15:11:32 +01:00
|
|
|
msg.m_resettarget = RESET_BGWRITER;
|
2020-10-02 03:17:11 +02:00
|
|
|
else if (strcmp(target, "wal") == 0)
|
|
|
|
msg.m_resettarget = RESET_WAL;
|
2010-01-19 15:11:32 +01:00
|
|
|
else
|
|
|
|
ereport(ERROR,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
|
|
|
|
errmsg("unrecognized reset target: \"%s\"", target),
|
2020-10-02 03:17:11 +02:00
|
|
|
errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
|
2010-01-19 15:11:32 +01:00
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2010-01-28 15:25:41 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_reset_single_counter() -
|
|
|
|
*
|
|
|
|
* Tell the statistics collector to reset a single counter.
|
2016-04-07 03:45:32 +02:00
|
|
|
*
|
|
|
|
* Permission checking for this function is managed through the normal
|
|
|
|
* GRANT system.
|
2010-01-28 15:25:41 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
2010-02-26 03:01:40 +01:00
|
|
|
void
|
|
|
|
pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
|
2010-01-28 15:25:41 +01:00
|
|
|
{
|
|
|
|
PgStat_MsgResetsinglecounter msg;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2010-01-28 15:25:41 +01:00
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
|
|
|
|
msg.m_databaseid = MyDatabaseId;
|
|
|
|
msg.m_resettype = type;
|
|
|
|
msg.m_objectid = objoid;
|
|
|
|
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_reset_slru_counter() -
|
|
|
|
*
|
|
|
|
* Tell the statistics collector to reset a single SLRU counter, or all
|
|
|
|
* SLRU counters (when name is null).
|
|
|
|
*
|
|
|
|
* Permission checking for this function is managed through the normal
|
|
|
|
* GRANT system.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_reset_slru_counter(const char *name)
|
|
|
|
{
|
|
|
|
PgStat_MsgResetslrucounter msg;
|
|
|
|
|
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
|
|
|
|
msg.m_index = (name) ? pgstat_slru_index(name) : -1;
|
|
|
|
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_reset_replslot_counter() -
|
|
|
|
*
|
|
|
|
* Tell the statistics collector to reset a single replication slot
|
|
|
|
* counter, or all replication slots counters (when name is null).
|
|
|
|
*
|
|
|
|
* Permission checking for this function is managed through the normal
|
|
|
|
* GRANT system.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_reset_replslot_counter(const char *name)
|
|
|
|
{
|
|
|
|
PgStat_MsgResetreplslotcounter msg;
|
|
|
|
|
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (name)
|
|
|
|
{
|
|
|
|
ReplicationSlot *slot;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if the slot exits with the given name. It is possible that by
|
|
|
|
* the time this message is executed the slot is dropped but at least
|
|
|
|
* this check will ensure that the given name is for a valid slot.
|
|
|
|
*/
|
|
|
|
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
|
|
|
|
slot = SearchNamedReplicationSlot(name);
|
|
|
|
LWLockRelease(ReplicationSlotControlLock);
|
|
|
|
|
|
|
|
if (!slot)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
|
|
|
|
errmsg("replication slot \"%s\" does not exist",
|
|
|
|
name)));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Nothing to do for physical slots as we collect stats only for
|
|
|
|
* logical slots.
|
|
|
|
*/
|
|
|
|
if (SlotIsPhysical(slot))
|
|
|
|
return;
|
|
|
|
|
2020-11-06 03:42:48 +01:00
|
|
|
strlcpy(msg.m_slotname, name, NAMEDATALEN);
|
2020-10-08 05:39:08 +02:00
|
|
|
msg.clearall = false;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
msg.clearall = true;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
|
|
|
|
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_report_autovac() -
|
|
|
|
*
|
|
|
|
* Called from autovacuum.c to report startup of an autovacuum process.
|
|
|
|
* We are called before InitPostgres is done, so can't rely on MyDatabaseId;
|
|
|
|
* the db OID must be passed in, instead.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_autovac(Oid dboid)
|
|
|
|
{
|
|
|
|
PgStat_MsgAutovacStart msg;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2006-06-19 03:51:22 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
|
|
|
|
msg.m_databaseid = dboid;
|
|
|
|
msg.m_start_time = GetCurrentTimestamp();
|
|
|
|
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ---------
|
|
|
|
* pgstat_report_vacuum() -
|
|
|
|
*
|
|
|
|
* Tell the collector about the table we just vacuumed.
|
|
|
|
* ---------
|
|
|
|
*/
|
|
|
|
void
|
2014-01-19 01:24:20 +01:00
|
|
|
pgstat_report_vacuum(Oid tableoid, bool shared,
|
|
|
|
PgStat_Counter livetuples, PgStat_Counter deadtuples)
|
2006-06-19 03:51:22 +02:00
|
|
|
{
|
|
|
|
PgStat_MsgVacuum msg;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
2006-06-19 03:51:22 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
|
|
|
|
msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
|
|
|
|
msg.m_tableoid = tableoid;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
msg.m_autovacuum = IsAutoVacuumWorkerProcess();
|
2006-06-19 03:51:22 +02:00
|
|
|
msg.m_vacuumtime = GetCurrentTimestamp();
|
2014-01-19 01:24:20 +01:00
|
|
|
msg.m_live_tuples = livetuples;
|
|
|
|
msg.m_dead_tuples = deadtuples;
|
2006-06-19 03:51:22 +02:00
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* --------
|
|
|
|
* pgstat_report_analyze() -
|
|
|
|
*
|
|
|
|
* Tell the collector about the table we just analyzed.
|
2016-06-06 23:44:17 +02:00
|
|
|
*
|
|
|
|
* Caller must provide new live- and dead-tuples estimates, as well as a
|
|
|
|
* flag indicating whether to reset the changes_since_analyze counter.
|
2006-06-19 03:51:22 +02:00
|
|
|
* --------
|
|
|
|
*/
|
|
|
|
void
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
pgstat_report_analyze(Relation rel,
|
2016-06-06 23:44:17 +02:00
|
|
|
PgStat_Counter livetuples, PgStat_Counter deadtuples,
|
|
|
|
bool resetcounter)
|
2006-06-19 03:51:22 +02:00
|
|
|
{
|
|
|
|
PgStat_MsgAnalyze msg;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
2006-06-19 03:51:22 +02:00
|
|
|
return;
|
|
|
|
|
2008-04-03 18:27:25 +02:00
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* Unlike VACUUM, ANALYZE might be running inside a transaction that has
|
|
|
|
* already inserted and/or deleted rows in the target table. ANALYZE will
|
|
|
|
* have counted such rows as live or dead respectively. Because we will
|
|
|
|
* report our counts of such rows at transaction end, we should subtract
|
|
|
|
* off these counts from what we send to the collector now, else they'll
|
2014-05-06 18:12:18 +02:00
|
|
|
* be double-counted after commit. (This approach also ensures that the
|
2009-06-11 16:49:15 +02:00
|
|
|
* collector ends up with the right numbers if we abort instead of
|
|
|
|
* committing.)
|
2008-04-03 18:27:25 +02:00
|
|
|
*/
|
|
|
|
if (rel->pgstat_info != NULL)
|
|
|
|
{
|
|
|
|
PgStat_TableXactStatus *trans;
|
|
|
|
|
|
|
|
for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
|
|
|
|
{
|
|
|
|
livetuples -= trans->tuples_inserted - trans->tuples_deleted;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
deadtuples -= trans->tuples_updated + trans->tuples_deleted;
|
2008-04-03 18:27:25 +02:00
|
|
|
}
|
|
|
|
/* count stuff inserted by already-aborted subxacts, too */
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
|
2008-04-03 18:27:25 +02:00
|
|
|
/* Since ANALYZE's counts are estimates, we could have underflowed */
|
|
|
|
livetuples = Max(livetuples, 0);
|
|
|
|
deadtuples = Max(deadtuples, 0);
|
|
|
|
}
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
|
2008-04-03 18:27:25 +02:00
|
|
|
msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
|
|
|
|
msg.m_tableoid = RelationGetRelid(rel);
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
msg.m_autovacuum = IsAutoVacuumWorkerProcess();
|
2016-06-06 23:44:17 +02:00
|
|
|
msg.m_resetcounter = resetcounter;
|
2006-06-19 03:51:22 +02:00
|
|
|
msg.m_analyzetime = GetCurrentTimestamp();
|
|
|
|
msg.m_live_tuples = livetuples;
|
|
|
|
msg.m_dead_tuples = deadtuples;
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2011-01-03 12:46:03 +01:00
|
|
|
/* --------
|
|
|
|
* pgstat_report_recovery_conflict() -
|
|
|
|
*
|
2011-04-10 17:42:00 +02:00
|
|
|
* Tell the collector about a Hot Standby recovery conflict.
|
2011-01-03 12:46:03 +01:00
|
|
|
* --------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_recovery_conflict(int reason)
|
|
|
|
{
|
|
|
|
PgStat_MsgRecoveryConflict msg;
|
|
|
|
|
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
|
|
|
|
msg.m_databaseid = MyDatabaseId;
|
|
|
|
msg.m_reason = reason;
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2012-01-26 15:58:19 +01:00
|
|
|
/* --------
|
|
|
|
* pgstat_report_deadlock() -
|
|
|
|
*
|
|
|
|
* Tell the collector about a deadlock detected.
|
|
|
|
* --------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_deadlock(void)
|
|
|
|
{
|
|
|
|
PgStat_MsgDeadlock msg;
|
|
|
|
|
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
|
|
|
|
msg.m_databaseid = MyDatabaseId;
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
2012-01-26 14:41:19 +01:00
|
|
|
|
2019-03-09 19:45:17 +01:00
|
|
|
|
|
|
|
|
|
|
|
/* --------
|
2019-04-12 14:04:50 +02:00
|
|
|
* pgstat_report_checksum_failures_in_db() -
|
2019-03-09 19:45:17 +01:00
|
|
|
*
|
|
|
|
* Tell the collector about one or more checksum failures.
|
|
|
|
* --------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
|
|
|
|
{
|
|
|
|
PgStat_MsgChecksumFailure msg;
|
|
|
|
|
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
|
|
|
|
msg.m_databaseid = dboid;
|
|
|
|
msg.m_failurecount = failurecount;
|
2019-04-12 14:04:50 +02:00
|
|
|
msg.m_failure_time = GetCurrentTimestamp();
|
|
|
|
|
2019-03-09 19:45:17 +01:00
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* --------
|
|
|
|
* pgstat_report_checksum_failure() -
|
|
|
|
*
|
|
|
|
* Tell the collector about a checksum failure.
|
|
|
|
* --------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_checksum_failure(void)
|
|
|
|
{
|
|
|
|
pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
|
|
|
|
}
|
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
/* --------
|
|
|
|
* pgstat_report_tempfile() -
|
|
|
|
*
|
|
|
|
* Tell the collector about a temporary file.
|
|
|
|
* --------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_tempfile(size_t filesize)
|
|
|
|
{
|
|
|
|
PgStat_MsgTempFile msg;
|
|
|
|
|
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
|
|
|
|
msg.m_databaseid = MyDatabaseId;
|
|
|
|
msg.m_filesize = filesize;
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_report_replslot() -
|
|
|
|
*
|
|
|
|
* Tell the collector about replication slot statistics.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
|
2020-10-29 04:41:51 +01:00
|
|
|
int spillbytes, int streamtxns, int streamcount, int streambytes)
|
2020-10-08 05:39:08 +02:00
|
|
|
{
|
|
|
|
PgStat_MsgReplSlot msg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare and send the message
|
|
|
|
*/
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
|
2020-11-06 03:42:48 +01:00
|
|
|
strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
|
2020-10-08 05:39:08 +02:00
|
|
|
msg.m_drop = false;
|
|
|
|
msg.m_spill_txns = spilltxns;
|
|
|
|
msg.m_spill_count = spillcount;
|
|
|
|
msg.m_spill_bytes = spillbytes;
|
2020-10-29 04:41:51 +01:00
|
|
|
msg.m_stream_txns = streamtxns;
|
|
|
|
msg.m_stream_count = streamcount;
|
|
|
|
msg.m_stream_bytes = streambytes;
|
2020-10-08 05:39:08 +02:00
|
|
|
pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_report_replslot_drop() -
|
|
|
|
*
|
|
|
|
* Tell the collector about dropping the replication slot.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_replslot_drop(const char *slotname)
|
|
|
|
{
|
|
|
|
PgStat_MsgReplSlot msg;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
|
2020-11-06 03:42:48 +01:00
|
|
|
strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
|
2020-10-08 05:39:08 +02:00
|
|
|
msg.m_drop = true;
|
|
|
|
pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
|
|
|
|
}
|
2012-01-26 14:41:19 +01:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_ping() -
|
|
|
|
*
|
|
|
|
* Send some junk data to the collector to increase traffic.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_ping(void)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_MsgDummy msg;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2001-06-22 21:18:36 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_send_inquiry() -
|
|
|
|
*
|
|
|
|
* Notify collector that we need fresh data.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
|
2008-11-03 02:17:08 +01:00
|
|
|
{
|
|
|
|
PgStat_MsgInquiry msg;
|
|
|
|
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
msg.clock_time = clock_time;
|
|
|
|
msg.cutoff_time = cutoff_time;
|
2013-02-18 21:56:08 +01:00
|
|
|
msg.databaseid = databaseid;
|
2008-11-03 02:17:08 +01:00
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/*
|
|
|
|
* Initialize function call usage data.
|
|
|
|
* Called by the executor before invoking a function.
|
|
|
|
*/
|
|
|
|
void
|
Change function call information to be variable length.
Before this change FunctionCallInfoData, the struct arguments etc for
V1 function calls are stored in, always had space for
FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two
arrays. For nearly every function call 100 arguments is far more than
needed, therefore wasting memory. Arg and argnull being two separate
arrays also guarantees that to access a single argument, two
cachelines have to be touched.
Change the layout so there's a single variable-length array with pairs
of value / isnull. That drastically reduces memory consumption for
most function calls (on x86-64 a two argument function now uses
64bytes, previously 936 bytes), and makes it very likely that argument
value and its nullness are on the same cacheline.
Arguments are stored in a new NullableDatum struct, which, due to
padding, needs more memory per argument than before. But as usually
far fewer arguments are stored, and individual arguments are cheaper
to access, that's still a clear win. It's likely that there's other
places where conversion to NullableDatum arrays would make sense,
e.g. TupleTableSlots, but that's for another commit.
Because the function call information is now variable-length
allocations have to take the number of arguments into account. For
heap allocations that can be done with SizeForFunctionCallInfoData(),
for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro
that helps to allocate an appropriately sized and aligned variable.
Some places with stack allocation function call information don't know
the number of arguments at compile time, and currently variably sized
stack allocations aren't allowed in postgres. Therefore allow for
FUNC_MAX_ARGS space in these cases. They're not that common, so for
now that seems acceptable.
Because of the need to allocate FunctionCallInfo of the appropriate
size, older extensions may need to update their code. To avoid subtle
breakages, the FunctionCallInfoData struct has been renamed to
FunctionCallInfoBaseData. Most code only references FunctionCallInfo,
so that shouldn't cause much collateral damage.
This change is also a prerequisite for more efficient expression JIT
compilation (by allocating the function call information on the stack,
allowing LLVM to optimize it away); previously the size of the call
information caused problems inside LLVM's optimizer.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
2019-01-26 23:17:52 +01:00
|
|
|
pgstat_init_function_usage(FunctionCallInfo fcinfo,
|
2008-05-15 02:17:41 +02:00
|
|
|
PgStat_FunctionCallUsage *fcu)
|
|
|
|
{
|
|
|
|
PgStat_BackendFunctionEntry *htabent;
|
2009-06-11 16:49:15 +02:00
|
|
|
bool found;
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
|
|
|
|
{
|
|
|
|
/* stats not wanted */
|
|
|
|
fcu->fs = NULL;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!pgStatFunctions)
|
|
|
|
{
|
|
|
|
/* First time through - initialize function stat table */
|
|
|
|
HASHCTL hash_ctl;
|
|
|
|
|
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
|
|
|
hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
|
|
|
|
pgStatFunctions = hash_create("Function stat entries",
|
|
|
|
PGSTAT_FUNCTION_HASH_SIZE,
|
|
|
|
&hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS);
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Get the stats entry for this function, create if necessary */
|
|
|
|
htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
|
|
|
|
HASH_ENTER, &found);
|
|
|
|
if (!found)
|
|
|
|
MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
|
|
|
|
|
|
|
|
fcu->fs = &htabent->f_counts;
|
|
|
|
|
|
|
|
/* save stats for this function, later used to compensate for recursion */
|
2012-04-30 20:02:47 +02:00
|
|
|
fcu->save_f_total_time = htabent->f_counts.f_total_time;
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
/* save current backend-wide total time */
|
|
|
|
fcu->save_total = total_func_time;
|
|
|
|
|
|
|
|
/* get clock time as of function start */
|
|
|
|
INSTR_TIME_SET_CURRENT(fcu->f_start);
|
|
|
|
}
|
|
|
|
|
2010-08-08 18:27:06 +02:00
|
|
|
/*
|
|
|
|
* find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
|
|
|
|
* for specified function
|
|
|
|
*
|
|
|
|
* If no entry, return NULL, don't create a new one
|
|
|
|
*/
|
|
|
|
PgStat_BackendFunctionEntry *
|
|
|
|
find_funcstat_entry(Oid func_id)
|
|
|
|
{
|
|
|
|
if (pgStatFunctions == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
|
|
|
|
(void *) &func_id,
|
|
|
|
HASH_FIND, NULL);
|
|
|
|
}
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/*
|
|
|
|
* Calculate function call usage and update stat counters.
|
|
|
|
* Called by the executor after invoking a function.
|
|
|
|
*
|
|
|
|
* In the case of a set-returning function that runs in value-per-call mode,
|
|
|
|
* we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
|
|
|
|
* calls for what the user considers a single call of the function. The
|
|
|
|
* finalize flag should be TRUE on the last call.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
|
|
|
|
{
|
|
|
|
PgStat_FunctionCounts *fs = fcu->fs;
|
|
|
|
instr_time f_total;
|
|
|
|
instr_time f_others;
|
|
|
|
instr_time f_self;
|
|
|
|
|
|
|
|
/* stats not wanted? */
|
|
|
|
if (fs == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* total elapsed time in this function call */
|
|
|
|
INSTR_TIME_SET_CURRENT(f_total);
|
|
|
|
INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
|
|
|
|
|
|
|
|
/* self usage: elapsed minus anything already charged to other calls */
|
|
|
|
f_others = total_func_time;
|
|
|
|
INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
|
|
|
|
f_self = f_total;
|
|
|
|
INSTR_TIME_SUBTRACT(f_self, f_others);
|
|
|
|
|
|
|
|
/* update backend-wide total time */
|
|
|
|
INSTR_TIME_ADD(total_func_time, f_self);
|
|
|
|
|
|
|
|
/*
|
2012-04-30 20:02:47 +02:00
|
|
|
* Compute the new f_total_time as the total elapsed time added to the
|
2014-05-06 18:12:18 +02:00
|
|
|
* pre-call value of f_total_time. This is necessary to avoid
|
2012-04-30 20:02:47 +02:00
|
|
|
* double-counting any time taken by recursive calls of myself. (We do
|
|
|
|
* not need any similar kluge for self time, since that already excludes
|
|
|
|
* any recursive calls.)
|
2008-05-15 02:17:41 +02:00
|
|
|
*/
|
2012-04-30 20:02:47 +02:00
|
|
|
INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
/* update counters in function stats table */
|
|
|
|
if (finalize)
|
|
|
|
fs->f_numcalls++;
|
2012-04-30 20:02:47 +02:00
|
|
|
fs->f_total_time = f_total;
|
|
|
|
INSTR_TIME_ADD(fs->f_self_time, f_self);
|
2008-11-03 02:17:08 +01:00
|
|
|
|
|
|
|
/* indicate that we have something to send */
|
|
|
|
have_function_stats = true;
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_initstats() -
|
|
|
|
*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Initialize a relcache entry to count access statistics.
|
|
|
|
* Called whenever a relation is opened.
|
2007-04-21 06:10:53 +02:00
|
|
|
*
|
|
|
|
* We assume that a relcache entry's pgstat_info field is zeroed by
|
|
|
|
* relcache.c when the relcache entry is made; thereafter it is long-lived
|
2007-05-27 05:50:39 +02:00
|
|
|
* data. We can avoid repeated searches of the TabStatus arrays when the
|
2007-04-21 06:10:53 +02:00
|
|
|
* same relation is touched repeatedly within a transaction.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
2007-05-27 05:50:39 +02:00
|
|
|
pgstat_initstats(Relation rel)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
Oid rel_id = rel->rd_id;
|
2007-05-27 05:50:39 +02:00
|
|
|
char relkind = rel->rd_rel->relkind;
|
|
|
|
|
|
|
|
/* We only count stats for things that have storage */
|
2020-06-12 08:51:16 +02:00
|
|
|
if (!RELKIND_HAS_STORAGE(relkind))
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
|
|
|
rel->pgstat_info = NULL;
|
|
|
|
return;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
|
2007-04-21 06:10:53 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
/* We're not counting at all */
|
|
|
|
rel->pgstat_info = NULL;
|
2001-06-22 21:18:36 +02:00
|
|
|
return;
|
2007-04-21 06:10:53 +02:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2007-04-21 06:10:53 +02:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* If we already set up this relation in the current transaction, nothing
|
|
|
|
* to do.
|
2007-04-21 06:10:53 +02:00
|
|
|
*/
|
2007-05-27 05:50:39 +02:00
|
|
|
if (rel->pgstat_info != NULL &&
|
|
|
|
rel->pgstat_info->t_id == rel_id)
|
2007-04-21 06:10:53 +02:00
|
|
|
return;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
/* Else find or make the PgStat_TableStatus entry, and update link */
|
|
|
|
rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
|
|
|
|
*/
|
|
|
|
static PgStat_TableStatus *
|
|
|
|
get_tabstat_entry(Oid rel_id, bool isshared)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
TabStatHashEntry *hash_entry;
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_TableStatus *entry;
|
|
|
|
TabStatusArray *tsa;
|
2017-05-17 22:31:56 +02:00
|
|
|
bool found;
|
2017-03-27 17:34:42 +02:00
|
|
|
|
2017-05-15 04:52:41 +02:00
|
|
|
/*
|
|
|
|
* Create hash table if we don't have it already.
|
|
|
|
*/
|
|
|
|
if (pgStatTabHash == NULL)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
HASHCTL ctl;
|
2017-05-15 04:52:41 +02:00
|
|
|
|
|
|
|
ctl.keysize = sizeof(Oid);
|
|
|
|
ctl.entrysize = sizeof(TabStatHashEntry);
|
|
|
|
|
|
|
|
pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
|
|
|
|
TABSTAT_QUANTUM,
|
|
|
|
&ctl,
|
|
|
|
HASH_ELEM | HASH_BLOBS);
|
|
|
|
}
|
2017-03-27 17:34:42 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Find an entry or create a new one.
|
|
|
|
*/
|
|
|
|
hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
|
2017-05-15 04:52:41 +02:00
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/* initialize new entry with null pointer */
|
|
|
|
hash_entry->tsa_entry = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If entry is already valid, we're done.
|
|
|
|
*/
|
|
|
|
if (hash_entry->tsa_entry)
|
2017-03-27 17:34:42 +02:00
|
|
|
return hash_entry->tsa_entry;
|
2005-07-29 21:30:09 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2017-05-15 04:52:41 +02:00
|
|
|
* Locate the first pgStatTabList entry with free space, making a new list
|
|
|
|
* entry if needed. Note that we could get an OOM failure here, but if so
|
|
|
|
* we have left the hashtable and the list in a consistent state.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2017-05-15 04:52:41 +02:00
|
|
|
if (pgStatTabList == NULL)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2017-05-15 04:52:41 +02:00
|
|
|
/* Set up first pgStatTabList entry */
|
|
|
|
pgStatTabList = (TabStatusArray *)
|
|
|
|
MemoryContextAllocZero(TopMemoryContext,
|
|
|
|
sizeof(TabStatusArray));
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2017-05-15 04:52:41 +02:00
|
|
|
tsa = pgStatTabList;
|
|
|
|
while (tsa->tsa_used >= TABSTAT_QUANTUM)
|
|
|
|
{
|
|
|
|
if (tsa->tsa_next == NULL)
|
|
|
|
tsa->tsa_next = (TabStatusArray *)
|
|
|
|
MemoryContextAllocZero(TopMemoryContext,
|
|
|
|
sizeof(TabStatusArray));
|
2017-03-27 17:34:42 +02:00
|
|
|
tsa = tsa->tsa_next;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-05-15 04:52:41 +02:00
|
|
|
* Allocate a PgStat_TableStatus entry within this list entry. We assume
|
|
|
|
* the entry was already zeroed, either at creation or after last use.
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
entry = &tsa->tsa_entries[tsa->tsa_used++];
|
|
|
|
entry->t_id = rel_id;
|
|
|
|
entry->t_shared = isshared;
|
2017-03-27 17:34:42 +02:00
|
|
|
|
|
|
|
/*
|
2017-05-15 04:52:41 +02:00
|
|
|
* Now we can fill the entry in pgStatTabHash.
|
2017-03-27 17:34:42 +02:00
|
|
|
*/
|
|
|
|
hash_entry->tsa_entry = entry;
|
2017-05-15 04:52:41 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
|
2010-08-08 18:27:06 +02:00
|
|
|
/*
|
|
|
|
* find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
|
|
|
|
*
|
|
|
|
* If no entry, return NULL, don't create a new one
|
2017-05-15 04:52:41 +02:00
|
|
|
*
|
|
|
|
* Note: if we got an error in the most recent execution of pgstat_report_stat,
|
|
|
|
* it's possible that an entry exists but there's no hashtable entry for it.
|
|
|
|
* That's okay, we'll treat this case as "doesn't exist".
|
2010-08-08 18:27:06 +02:00
|
|
|
*/
|
|
|
|
PgStat_TableStatus *
|
|
|
|
find_tabstat_entry(Oid rel_id)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
TabStatHashEntry *hash_entry;
|
2010-08-08 18:27:06 +02:00
|
|
|
|
2017-05-15 04:52:41 +02:00
|
|
|
/* If hashtable doesn't exist, there are no entries at all */
|
2017-05-17 22:31:56 +02:00
|
|
|
if (!pgStatTabHash)
|
2017-03-27 17:34:42 +02:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
|
2017-05-17 22:31:56 +02:00
|
|
|
if (!hash_entry)
|
2017-03-27 17:34:42 +02:00
|
|
|
return NULL;
|
2010-08-08 18:27:06 +02:00
|
|
|
|
2017-05-15 04:52:41 +02:00
|
|
|
/* Note that this step could also return NULL, but that's correct */
|
2017-03-27 17:34:42 +02:00
|
|
|
return hash_entry->tsa_entry;
|
2010-08-08 18:27:06 +02:00
|
|
|
}
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
|
|
|
* get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
|
|
|
|
*/
|
|
|
|
static PgStat_SubXactStatus *
|
|
|
|
get_tabstat_stack_level(int nest_level)
|
|
|
|
{
|
|
|
|
PgStat_SubXactStatus *xact_state;
|
|
|
|
|
|
|
|
xact_state = pgStatXactStack;
|
|
|
|
if (xact_state == NULL || xact_state->nest_level != nest_level)
|
|
|
|
{
|
|
|
|
xact_state = (PgStat_SubXactStatus *)
|
|
|
|
MemoryContextAlloc(TopTransactionContext,
|
|
|
|
sizeof(PgStat_SubXactStatus));
|
|
|
|
xact_state->nest_level = nest_level;
|
|
|
|
xact_state->prev = pgStatXactStack;
|
|
|
|
xact_state->first = NULL;
|
|
|
|
pgStatXactStack = xact_state;
|
|
|
|
}
|
|
|
|
return xact_state;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* add_tabstat_xact_level - add a new (sub)transaction state record
|
|
|
|
*/
|
|
|
|
static void
|
2007-11-15 23:25:18 +01:00
|
|
|
add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
|
|
|
PgStat_SubXactStatus *xact_state;
|
|
|
|
PgStat_TableXactStatus *trans;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* If this is the first rel to be modified at the current nest level, we
|
|
|
|
* first have to push a transaction stack entry.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2007-05-27 05:50:39 +02:00
|
|
|
xact_state = get_tabstat_stack_level(nest_level);
|
|
|
|
|
|
|
|
/* Now make a per-table stack entry */
|
|
|
|
trans = (PgStat_TableXactStatus *)
|
|
|
|
MemoryContextAllocZero(TopTransactionContext,
|
|
|
|
sizeof(PgStat_TableXactStatus));
|
|
|
|
trans->nest_level = nest_level;
|
|
|
|
trans->upper = pgstat_info->trans;
|
|
|
|
trans->parent = pgstat_info;
|
|
|
|
trans->next = xact_state->first;
|
|
|
|
xact_state->first = trans;
|
|
|
|
pgstat_info->trans = trans;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2011-11-09 09:54:41 +01:00
|
|
|
* pgstat_count_heap_insert - count a tuple insertion of n tuples
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
void
|
2017-03-18 22:49:06 +01:00
|
|
|
pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
|
|
|
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
|
|
|
|
|
2010-10-12 20:44:25 +02:00
|
|
|
if (pgstat_info != NULL)
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* We have to log the effect at the proper transactional level */
|
2007-11-15 22:14:46 +01:00
|
|
|
int nest_level = GetCurrentTransactionNestLevel();
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
if (pgstat_info->trans == NULL ||
|
|
|
|
pgstat_info->trans->nest_level != nest_level)
|
|
|
|
add_tabstat_xact_level(pgstat_info, nest_level);
|
|
|
|
|
2011-11-09 09:54:41 +01:00
|
|
|
pgstat_info->trans->tuples_inserted += n;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pgstat_count_heap_update - count a tuple update
|
|
|
|
*/
|
|
|
|
void
|
2007-09-20 19:56:33 +02:00
|
|
|
pgstat_count_heap_update(Relation rel, bool hot)
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
|
|
|
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
|
|
|
|
|
2010-10-12 20:44:25 +02:00
|
|
|
if (pgstat_info != NULL)
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* We have to log the effect at the proper transactional level */
|
2007-11-15 22:14:46 +01:00
|
|
|
int nest_level = GetCurrentTransactionNestLevel();
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
if (pgstat_info->trans == NULL ||
|
|
|
|
pgstat_info->trans->nest_level != nest_level)
|
|
|
|
add_tabstat_xact_level(pgstat_info, nest_level);
|
|
|
|
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
pgstat_info->trans->tuples_updated++;
|
|
|
|
|
|
|
|
/* t_tuples_hot_updated is nontransactional, so just advance it */
|
|
|
|
if (hot)
|
|
|
|
pgstat_info->t_counts.t_tuples_hot_updated++;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pgstat_count_heap_delete - count a tuple deletion
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_count_heap_delete(Relation rel)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
|
|
|
|
|
2010-10-12 20:44:25 +02:00
|
|
|
if (pgstat_info != NULL)
|
2007-05-27 05:50:39 +02:00
|
|
|
{
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* We have to log the effect at the proper transactional level */
|
2007-11-15 22:14:46 +01:00
|
|
|
int nest_level = GetCurrentTransactionNestLevel();
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
if (pgstat_info->trans == NULL ||
|
|
|
|
pgstat_info->trans->nest_level != nest_level)
|
|
|
|
add_tabstat_xact_level(pgstat_info, nest_level);
|
|
|
|
|
|
|
|
pgstat_info->trans->tuples_deleted++;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2015-02-20 16:10:01 +01:00
|
|
|
/*
|
|
|
|
* pgstat_truncate_save_counters
|
|
|
|
*
|
|
|
|
* Whenever a table is truncated, we save its i/u/d counters so that they can
|
|
|
|
* be cleared, and if the (sub)xact that executed the truncate later aborts,
|
|
|
|
* the counters can be restored to the saved (pre-truncate) values. Note we do
|
|
|
|
* this on the first truncate in any particular subxact level only.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
|
|
|
|
{
|
|
|
|
if (!trans->truncated)
|
|
|
|
{
|
|
|
|
trans->inserted_pre_trunc = trans->tuples_inserted;
|
|
|
|
trans->updated_pre_trunc = trans->tuples_updated;
|
|
|
|
trans->deleted_pre_trunc = trans->tuples_deleted;
|
|
|
|
trans->truncated = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pgstat_truncate_restore_counters - restore counters when a truncate aborts
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
|
|
|
|
{
|
|
|
|
if (trans->truncated)
|
|
|
|
{
|
|
|
|
trans->tuples_inserted = trans->inserted_pre_trunc;
|
|
|
|
trans->tuples_updated = trans->updated_pre_trunc;
|
|
|
|
trans->tuples_deleted = trans->deleted_pre_trunc;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pgstat_count_truncate - update tuple counters due to truncate
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_count_truncate(Relation rel)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
|
|
|
|
|
|
|
|
if (pgstat_info != NULL)
|
|
|
|
{
|
|
|
|
/* We have to log the effect at the proper transactional level */
|
|
|
|
int nest_level = GetCurrentTransactionNestLevel();
|
|
|
|
|
|
|
|
if (pgstat_info->trans == NULL ||
|
|
|
|
pgstat_info->trans->nest_level != nest_level)
|
|
|
|
add_tabstat_xact_level(pgstat_info, nest_level);
|
|
|
|
|
|
|
|
pgstat_truncate_save_counters(pgstat_info->trans);
|
|
|
|
pgstat_info->trans->tuples_inserted = 0;
|
|
|
|
pgstat_info->trans->tuples_updated = 0;
|
|
|
|
pgstat_info->trans->tuples_deleted = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-09-20 19:56:33 +02:00
|
|
|
/*
|
|
|
|
* pgstat_update_heap_dead_tuples - update dead-tuples count
|
|
|
|
*
|
|
|
|
* The semantics of this are that we are reporting the nontransactional
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
* recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
|
2007-09-20 19:56:33 +02:00
|
|
|
* rather than increasing, and the change goes straight into the per-table
|
|
|
|
* counter, not into transactional state.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_update_heap_dead_tuples(Relation rel, int delta)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *pgstat_info = rel->pgstat_info;
|
|
|
|
|
2010-10-12 20:44:25 +02:00
|
|
|
if (pgstat_info != NULL)
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
pgstat_info->t_counts.t_delta_dead_tuples -= delta;
|
2007-09-20 19:56:33 +02:00
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ----------
|
2007-05-27 05:50:39 +02:00
|
|
|
* AtEOXact_PgStat
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Called from access/transam/xact.c at top-level transaction commit/abort.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
2019-04-10 04:54:15 +02:00
|
|
|
AtEOXact_PgStat(bool isCommit, bool parallel)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_SubXactStatus *xact_state;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2019-04-10 04:54:15 +02:00
|
|
|
/* Don't count parallel worker transaction stats */
|
|
|
|
if (!parallel)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Count transaction commit or abort. (We use counters, not just
|
|
|
|
* bools, in case the reporting message isn't sent right away.)
|
|
|
|
*/
|
|
|
|
if (isCommit)
|
|
|
|
pgStatXactCommit++;
|
|
|
|
else
|
|
|
|
pgStatXactRollback++;
|
|
|
|
}
|
2004-10-28 03:38:41 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
|
|
|
* Transfer transactional insert/update counts into the base tabstat
|
2007-11-15 22:14:46 +01:00
|
|
|
* entries. We don't bother to free any of the transactional state, since
|
|
|
|
* it's all in TopTransactionContext and will go away anyway.
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
xact_state = pgStatXactStack;
|
|
|
|
if (xact_state != NULL)
|
2003-08-12 18:21:18 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_TableXactStatus *trans;
|
|
|
|
|
|
|
|
Assert(xact_state->nest_level == 1);
|
|
|
|
Assert(xact_state->prev == NULL);
|
|
|
|
for (trans = xact_state->first; trans != NULL; trans = trans->next)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *tabstat;
|
|
|
|
|
|
|
|
Assert(trans->nest_level == 1);
|
|
|
|
Assert(trans->upper == NULL);
|
|
|
|
tabstat = trans->parent;
|
|
|
|
Assert(tabstat->trans == trans);
|
2015-02-20 16:10:01 +01:00
|
|
|
/* restore pre-truncate stats (if any) in case of aborted xact */
|
|
|
|
if (!isCommit)
|
|
|
|
pgstat_truncate_restore_counters(trans);
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* count attempted actions regardless of commit/abort */
|
|
|
|
tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
|
|
|
|
tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
|
|
|
|
tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
|
2007-05-27 05:50:39 +02:00
|
|
|
if (isCommit)
|
|
|
|
{
|
2015-02-20 16:10:01 +01:00
|
|
|
tabstat->t_counts.t_truncated = trans->truncated;
|
|
|
|
if (trans->truncated)
|
|
|
|
{
|
|
|
|
/* forget live/dead stats seen by backend thus far */
|
|
|
|
tabstat->t_counts.t_delta_live_tuples = 0;
|
|
|
|
tabstat->t_counts.t_delta_dead_tuples = 0;
|
|
|
|
}
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* insert adds a live tuple, delete removes one */
|
|
|
|
tabstat->t_counts.t_delta_live_tuples +=
|
2007-05-27 19:28:36 +02:00
|
|
|
trans->tuples_inserted - trans->tuples_deleted;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* update and delete each create a dead tuple */
|
|
|
|
tabstat->t_counts.t_delta_dead_tuples +=
|
|
|
|
trans->tuples_updated + trans->tuples_deleted;
|
|
|
|
/* insert, update, delete each count as one change event */
|
|
|
|
tabstat->t_counts.t_changed_tuples +=
|
|
|
|
trans->tuples_inserted + trans->tuples_updated +
|
|
|
|
trans->tuples_deleted;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* inserted tuples are dead, deleted tuples are unaffected */
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabstat->t_counts.t_delta_dead_tuples +=
|
|
|
|
trans->tuples_inserted + trans->tuples_updated;
|
|
|
|
/* an aborted xact generates no changed_tuple events */
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
tabstat->trans = NULL;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
2007-05-27 05:50:39 +02:00
|
|
|
pgStatXactStack = NULL;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/* Make sure any stats snapshot is thrown away */
|
|
|
|
pgstat_clear_snapshot();
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ----------
|
2007-05-27 05:50:39 +02:00
|
|
|
* AtEOSubXact_PgStat
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Called from access/transam/xact.c at subtransaction commit/abort.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
2007-05-27 05:50:39 +02:00
|
|
|
AtEOSubXact_PgStat(bool isCommit, int nestDepth)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_SubXactStatus *xact_state;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Transfer transactional insert/update counts into the next higher
|
|
|
|
* subtransaction state.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2007-05-27 05:50:39 +02:00
|
|
|
xact_state = pgStatXactStack;
|
|
|
|
if (xact_state != NULL &&
|
|
|
|
xact_state->nest_level >= nestDepth)
|
|
|
|
{
|
|
|
|
PgStat_TableXactStatus *trans;
|
|
|
|
PgStat_TableXactStatus *next_trans;
|
|
|
|
|
|
|
|
/* delink xact_state from stack immediately to simplify reuse case */
|
|
|
|
pgStatXactStack = xact_state->prev;
|
|
|
|
|
|
|
|
for (trans = xact_state->first; trans != NULL; trans = next_trans)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *tabstat;
|
|
|
|
|
|
|
|
next_trans = trans->next;
|
|
|
|
Assert(trans->nest_level == nestDepth);
|
|
|
|
tabstat = trans->parent;
|
|
|
|
Assert(tabstat->trans == trans);
|
|
|
|
if (isCommit)
|
|
|
|
{
|
|
|
|
if (trans->upper && trans->upper->nest_level == nestDepth - 1)
|
|
|
|
{
|
2015-02-20 16:10:01 +01:00
|
|
|
if (trans->truncated)
|
|
|
|
{
|
|
|
|
/* propagate the truncate status one level up */
|
|
|
|
pgstat_truncate_save_counters(trans->upper);
|
|
|
|
/* replace upper xact stats with ours */
|
|
|
|
trans->upper->tuples_inserted = trans->tuples_inserted;
|
|
|
|
trans->upper->tuples_updated = trans->tuples_updated;
|
|
|
|
trans->upper->tuples_deleted = trans->tuples_deleted;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
trans->upper->tuples_inserted += trans->tuples_inserted;
|
|
|
|
trans->upper->tuples_updated += trans->tuples_updated;
|
|
|
|
trans->upper->tuples_deleted += trans->tuples_deleted;
|
|
|
|
}
|
2007-05-27 05:50:39 +02:00
|
|
|
tabstat->trans = trans->upper;
|
|
|
|
pfree(trans);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* When there isn't an immediate parent state, we can just
|
|
|
|
* reuse the record instead of going through a
|
2007-05-27 05:50:39 +02:00
|
|
|
* palloc/pfree pushup (this works since it's all in
|
2007-11-15 22:14:46 +01:00
|
|
|
* TopTransactionContext anyway). We have to re-link it
|
|
|
|
* into the parent level, though, and that might mean
|
2007-05-27 05:50:39 +02:00
|
|
|
* pushing a new entry into the pgStatXactStack.
|
|
|
|
*/
|
|
|
|
PgStat_SubXactStatus *upper_xact_state;
|
|
|
|
|
|
|
|
upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
|
|
|
|
trans->next = upper_xact_state->first;
|
|
|
|
upper_xact_state->first = trans;
|
|
|
|
trans->nest_level = nestDepth - 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* On abort, update top-level tabstat counts, then forget the
|
|
|
|
* subtransaction
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
|
2015-02-20 16:10:01 +01:00
|
|
|
/* first restore values obliterated by truncate */
|
|
|
|
pgstat_truncate_restore_counters(trans);
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* count attempted actions regardless of commit/abort */
|
|
|
|
tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
|
|
|
|
tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
|
|
|
|
tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
|
|
|
|
/* inserted tuples are dead, deleted tuples are unaffected */
|
|
|
|
tabstat->t_counts.t_delta_dead_tuples +=
|
|
|
|
trans->tuples_inserted + trans->tuples_updated;
|
2007-05-27 05:50:39 +02:00
|
|
|
tabstat->trans = trans->upper;
|
|
|
|
pfree(trans);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
pfree(xact_state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtPrepare_PgStat
|
|
|
|
* Save the transactional stats state at 2PC transaction prepare.
|
|
|
|
*
|
|
|
|
* In this phase we just generate 2PC records for all the pending
|
|
|
|
* transaction-dependent stats work.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtPrepare_PgStat(void)
|
|
|
|
{
|
|
|
|
PgStat_SubXactStatus *xact_state;
|
2004-10-28 03:38:41 +02:00
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
xact_state = pgStatXactStack;
|
|
|
|
if (xact_state != NULL)
|
2003-08-12 18:21:18 +02:00
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
PgStat_TableXactStatus *trans;
|
|
|
|
|
|
|
|
Assert(xact_state->nest_level == 1);
|
|
|
|
Assert(xact_state->prev == NULL);
|
|
|
|
for (trans = xact_state->first; trans != NULL; trans = trans->next)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *tabstat;
|
|
|
|
TwoPhasePgStatRecord record;
|
|
|
|
|
|
|
|
Assert(trans->nest_level == 1);
|
|
|
|
Assert(trans->upper == NULL);
|
|
|
|
tabstat = trans->parent;
|
|
|
|
Assert(tabstat->trans == trans);
|
|
|
|
|
|
|
|
record.tuples_inserted = trans->tuples_inserted;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
record.tuples_updated = trans->tuples_updated;
|
2007-05-27 05:50:39 +02:00
|
|
|
record.tuples_deleted = trans->tuples_deleted;
|
2015-02-20 16:10:01 +01:00
|
|
|
record.inserted_pre_trunc = trans->inserted_pre_trunc;
|
|
|
|
record.updated_pre_trunc = trans->updated_pre_trunc;
|
|
|
|
record.deleted_pre_trunc = trans->deleted_pre_trunc;
|
2007-05-27 05:50:39 +02:00
|
|
|
record.t_id = tabstat->t_id;
|
|
|
|
record.t_shared = tabstat->t_shared;
|
2015-02-20 16:10:01 +01:00
|
|
|
record.t_truncated = trans->truncated;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
|
|
|
RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
|
|
|
|
&record, sizeof(TwoPhasePgStatRecord));
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/*
|
|
|
|
* PostPrepare_PgStat
|
|
|
|
* Clean up after successful PREPARE.
|
|
|
|
*
|
|
|
|
* All we need do here is unlink the transaction stats state from the
|
2014-05-06 18:12:18 +02:00
|
|
|
* nontransactional state. The nontransactional action counts will be
|
2007-05-27 05:50:39 +02:00
|
|
|
* reported to the stats collector immediately, while the effects on live
|
|
|
|
* and dead tuple counts are preserved in the 2PC state file.
|
|
|
|
*
|
|
|
|
* Note: AtEOXact_PgStat is not called during PREPARE.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
PostPrepare_PgStat(void)
|
|
|
|
{
|
|
|
|
PgStat_SubXactStatus *xact_state;
|
|
|
|
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* We don't bother to free any of the transactional state, since it's all
|
|
|
|
* in TopTransactionContext and will go away anyway.
|
2007-05-27 05:50:39 +02:00
|
|
|
*/
|
|
|
|
xact_state = pgStatXactStack;
|
|
|
|
if (xact_state != NULL)
|
|
|
|
{
|
|
|
|
PgStat_TableXactStatus *trans;
|
|
|
|
|
|
|
|
for (trans = xact_state->first; trans != NULL; trans = trans->next)
|
|
|
|
{
|
|
|
|
PgStat_TableStatus *tabstat;
|
|
|
|
|
|
|
|
tabstat = trans->parent;
|
|
|
|
tabstat->trans = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
pgStatXactStack = NULL;
|
|
|
|
|
|
|
|
/* Make sure any stats snapshot is thrown away */
|
|
|
|
pgstat_clear_snapshot();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 2PC processing routine for COMMIT PREPARED case.
|
|
|
|
*
|
|
|
|
* Load the saved counts into our local pgstats state.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_twophase_postcommit(TransactionId xid, uint16 info,
|
|
|
|
void *recdata, uint32 len)
|
|
|
|
{
|
|
|
|
TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
|
|
|
|
PgStat_TableStatus *pgstat_info;
|
|
|
|
|
|
|
|
/* Find or create a tabstat entry for the rel */
|
|
|
|
pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
|
|
|
|
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* Same math as in AtEOXact_PgStat, commit case */
|
|
|
|
pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
|
|
|
|
pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
|
|
|
|
pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
|
2015-02-20 16:10:01 +01:00
|
|
|
pgstat_info->t_counts.t_truncated = rec->t_truncated;
|
2016-10-14 01:45:58 +02:00
|
|
|
if (rec->t_truncated)
|
|
|
|
{
|
|
|
|
/* forget live/dead stats seen by backend thus far */
|
|
|
|
pgstat_info->t_counts.t_delta_live_tuples = 0;
|
|
|
|
pgstat_info->t_counts.t_delta_dead_tuples = 0;
|
|
|
|
}
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
pgstat_info->t_counts.t_delta_live_tuples +=
|
2007-05-27 19:28:36 +02:00
|
|
|
rec->tuples_inserted - rec->tuples_deleted;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
pgstat_info->t_counts.t_delta_dead_tuples +=
|
|
|
|
rec->tuples_updated + rec->tuples_deleted;
|
|
|
|
pgstat_info->t_counts.t_changed_tuples +=
|
|
|
|
rec->tuples_inserted + rec->tuples_updated +
|
|
|
|
rec->tuples_deleted;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 2PC processing routine for ROLLBACK PREPARED case.
|
|
|
|
*
|
|
|
|
* Load the saved counts into our local pgstats state, but treat them
|
|
|
|
* as aborted.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_twophase_postabort(TransactionId xid, uint16 info,
|
|
|
|
void *recdata, uint32 len)
|
|
|
|
{
|
|
|
|
TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
|
|
|
|
PgStat_TableStatus *pgstat_info;
|
|
|
|
|
|
|
|
/* Find or create a tabstat entry for the rel */
|
|
|
|
pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
|
|
|
|
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* Same math as in AtEOXact_PgStat, abort case */
|
2015-02-20 16:10:01 +01:00
|
|
|
if (rec->t_truncated)
|
|
|
|
{
|
|
|
|
rec->tuples_inserted = rec->inserted_pre_trunc;
|
|
|
|
rec->tuples_updated = rec->updated_pre_trunc;
|
|
|
|
rec->tuples_deleted = rec->deleted_pre_trunc;
|
|
|
|
}
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
|
|
|
|
pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
|
|
|
|
pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
|
|
|
|
pgstat_info->t_counts.t_delta_dead_tuples +=
|
|
|
|
rec->tuples_inserted + rec->tuples_updated;
|
2007-05-27 05:50:39 +02:00
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_fetch_stat_dbentry() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* the collected statistics for one database or NULL. NULL doesn't mean
|
|
|
|
* that the database doesn't exist, it is just not yet known by the
|
|
|
|
* collector, so the caller is better off to report ZERO instead.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
PgStat_StatDBEntry *
|
|
|
|
pgstat_fetch_stat_dbentry(Oid dbid)
|
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* If not done for this transaction, read the statistics collector stats
|
|
|
|
* file into some hash tables.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2004-07-01 02:52:04 +02:00
|
|
|
backend_read_statsfile();
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-05-11 03:41:41 +02:00
|
|
|
* Lookup the requested database; return NULL if not found
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-05-11 03:41:41 +02:00
|
|
|
return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
|
|
|
|
(void *) &dbid,
|
|
|
|
HASH_FIND, NULL);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_fetch_stat_tabentry() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* the collected statistics for one table or NULL. NULL doesn't mean
|
|
|
|
* that the table doesn't exist, it is just not yet known by the
|
|
|
|
* collector, so the caller is better off to report ZERO instead.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
PgStat_StatTabEntry *
|
|
|
|
pgstat_fetch_stat_tabentry(Oid relid)
|
|
|
|
{
|
2005-07-29 21:30:09 +02:00
|
|
|
Oid dbid;
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* If not done for this transaction, read the statistics collector stats
|
|
|
|
* file into some hash tables.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2004-07-01 02:52:04 +02:00
|
|
|
backend_read_statsfile();
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-07-29 21:30:09 +02:00
|
|
|
* Lookup our database, then look in its table hash table.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-07-29 21:30:09 +02:00
|
|
|
dbid = MyDatabaseId;
|
2001-10-01 07:36:17 +02:00
|
|
|
dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
|
2005-07-29 21:30:09 +02:00
|
|
|
(void *) &dbid,
|
2001-10-05 19:28:13 +02:00
|
|
|
HASH_FIND, NULL);
|
2005-07-29 21:30:09 +02:00
|
|
|
if (dbentry != NULL && dbentry->tables != NULL)
|
|
|
|
{
|
|
|
|
tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
|
|
|
|
(void *) &relid,
|
|
|
|
HASH_FIND, NULL);
|
|
|
|
if (tabentry)
|
|
|
|
return tabentry;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-07-29 21:30:09 +02:00
|
|
|
* If we didn't find it, maybe it's a shared table.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-07-29 21:30:09 +02:00
|
|
|
dbid = InvalidOid;
|
|
|
|
dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
|
|
|
|
(void *) &dbid,
|
|
|
|
HASH_FIND, NULL);
|
|
|
|
if (dbentry != NULL && dbentry->tables != NULL)
|
|
|
|
{
|
|
|
|
tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
|
|
|
|
(void *) &relid,
|
|
|
|
HASH_FIND, NULL);
|
|
|
|
if (tabentry)
|
|
|
|
return tabentry;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
return NULL;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_fetch_stat_funcentry() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* the collected statistics for one function or NULL.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
PgStat_StatFuncEntry *
|
|
|
|
pgstat_fetch_stat_funcentry(Oid func_id)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatFuncEntry *funcentry = NULL;
|
|
|
|
|
|
|
|
/* load the stats file if needed */
|
|
|
|
backend_read_statsfile();
|
|
|
|
|
|
|
|
/* Lookup our database, then find the requested function. */
|
|
|
|
dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
|
|
|
|
if (dbentry != NULL && dbentry->functions != NULL)
|
|
|
|
{
|
|
|
|
funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
|
|
|
|
(void *) &func_id,
|
|
|
|
HASH_FIND, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
return funcentry;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_fetch_stat_beentry() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
2006-06-19 03:51:22 +02:00
|
|
|
* our local copy of the current-activity entry for one backend.
|
|
|
|
*
|
|
|
|
* NB: caller is responsible for a check if the user is permitted to see
|
|
|
|
* this info (especially the querystring).
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
2006-06-19 03:51:22 +02:00
|
|
|
PgBackendStatus *
|
2001-06-22 21:18:36 +02:00
|
|
|
pgstat_fetch_stat_beentry(int beid)
|
|
|
|
{
|
2006-06-19 03:51:22 +02:00
|
|
|
pgstat_read_current_status();
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2014-02-25 18:34:04 +01:00
|
|
|
if (beid < 1 || beid > localNumBackends)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return &localBackendStatusTable[beid - 1].backendStatus;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_fetch_stat_local_beentry() -
|
|
|
|
*
|
2016-10-26 10:12:31 +02:00
|
|
|
* Like pgstat_fetch_stat_beentry() but with locally computed additions (like
|
2014-05-06 18:12:18 +02:00
|
|
|
* xid and xmin values of the backend)
|
2014-02-25 18:34:04 +01:00
|
|
|
*
|
|
|
|
* NB: caller is responsible for a check if the user is permitted to see
|
|
|
|
* this info (especially the querystring).
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
LocalPgBackendStatus *
|
|
|
|
pgstat_fetch_stat_local_beentry(int beid)
|
|
|
|
{
|
|
|
|
pgstat_read_current_status();
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
if (beid < 1 || beid > localNumBackends)
|
2001-06-22 21:18:36 +02:00
|
|
|
return NULL;
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
return &localBackendStatusTable[beid - 1];
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_fetch_stat_numbackends() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* the maximum current backend id.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pgstat_fetch_stat_numbackends(void)
|
|
|
|
{
|
2006-06-19 03:51:22 +02:00
|
|
|
pgstat_read_current_status();
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
return localNumBackends;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
/*
|
|
|
|
* ---------
|
|
|
|
* pgstat_fetch_stat_archiver() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* a pointer to the archiver statistics struct.
|
|
|
|
* ---------
|
|
|
|
*/
|
|
|
|
PgStat_ArchiverStats *
|
|
|
|
pgstat_fetch_stat_archiver(void)
|
|
|
|
{
|
|
|
|
backend_read_statsfile();
|
|
|
|
|
|
|
|
return &archiverStats;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/*
|
|
|
|
* ---------
|
|
|
|
* pgstat_fetch_global() -
|
|
|
|
*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* a pointer to the global statistics struct.
|
2007-03-30 20:34:56 +02:00
|
|
|
* ---------
|
|
|
|
*/
|
|
|
|
PgStat_GlobalStats *
|
|
|
|
pgstat_fetch_global(void)
|
|
|
|
{
|
|
|
|
backend_read_statsfile();
|
|
|
|
|
|
|
|
return &globalStats;
|
|
|
|
}
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/*
|
|
|
|
* ---------
|
|
|
|
* pgstat_fetch_stat_wal() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* a pointer to the WAL statistics struct.
|
|
|
|
* ---------
|
|
|
|
*/
|
|
|
|
PgStat_WalStats *
|
|
|
|
pgstat_fetch_stat_wal(void)
|
|
|
|
{
|
|
|
|
backend_read_statsfile();
|
|
|
|
|
|
|
|
return &walStats;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
|
|
|
* ---------
|
|
|
|
* pgstat_fetch_slru() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* a pointer to the slru statistics struct.
|
|
|
|
* ---------
|
|
|
|
*/
|
|
|
|
PgStat_SLRUStats *
|
|
|
|
pgstat_fetch_slru(void)
|
|
|
|
{
|
|
|
|
backend_read_statsfile();
|
|
|
|
|
|
|
|
return slruStats;
|
|
|
|
}
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/*
|
|
|
|
* ---------
|
|
|
|
* pgstat_fetch_replslot() -
|
|
|
|
*
|
|
|
|
* Support function for the SQL-callable pgstat* functions. Returns
|
|
|
|
* a pointer to the replication slot statistics struct and sets the
|
|
|
|
* number of entries in nslots_p.
|
|
|
|
* ---------
|
|
|
|
*/
|
|
|
|
PgStat_ReplSlotStats *
|
|
|
|
pgstat_fetch_replslot(int *nslots_p)
|
|
|
|
{
|
|
|
|
backend_read_statsfile();
|
|
|
|
|
|
|
|
*nslots_p = nReplSlotStats;
|
|
|
|
return replSlotStats;
|
|
|
|
}
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ------------------------------------------------------------
|
2006-06-19 03:51:22 +02:00
|
|
|
* Functions for management of the shared-memory PgBackendStatus array
|
2001-06-22 21:18:36 +02:00
|
|
|
* ------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
static PgBackendStatus *BackendStatusArray = NULL;
|
|
|
|
static PgBackendStatus *MyBEEntry = NULL;
|
2009-11-29 00:38:08 +01:00
|
|
|
static char *BackendAppnameBuffer = NULL;
|
2015-07-27 22:29:14 +02:00
|
|
|
static char *BackendClientHostnameBuffer = NULL;
|
2009-06-11 16:49:15 +02:00
|
|
|
static char *BackendActivityBuffer = NULL;
|
2011-10-21 19:26:40 +02:00
|
|
|
static Size BackendActivityBufferSize = 0;
|
2015-04-12 19:07:46 +02:00
|
|
|
#ifdef USE_SSL
|
|
|
|
static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
|
|
|
|
#endif
|
GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 21:02:33 +02:00
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
static PgBackendGSSStatus *BackendGssStatusBuffer = NULL;
|
|
|
|
#endif
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Report shared-memory space needed by CreateSharedBackendStatus.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2006-06-19 03:51:22 +02:00
|
|
|
Size
|
|
|
|
BackendStatusShmemSize(void)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2006-06-19 03:51:22 +02:00
|
|
|
Size size;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2015-07-27 22:29:14 +02:00
|
|
|
/* BackendStatusArray: */
|
2017-03-27 04:02:22 +02:00
|
|
|
size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
|
2015-07-27 22:29:14 +02:00
|
|
|
/* BackendAppnameBuffer: */
|
|
|
|
size = add_size(size,
|
2017-03-27 04:02:22 +02:00
|
|
|
mul_size(NAMEDATALEN, NumBackendStatSlots));
|
2015-07-27 22:29:14 +02:00
|
|
|
/* BackendClientHostnameBuffer: */
|
2009-11-29 00:38:08 +01:00
|
|
|
size = add_size(size,
|
2017-03-27 04:02:22 +02:00
|
|
|
mul_size(NAMEDATALEN, NumBackendStatSlots));
|
2015-07-27 22:29:14 +02:00
|
|
|
/* BackendActivityBuffer: */
|
2009-11-29 00:38:08 +01:00
|
|
|
size = add_size(size,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
|
2015-07-27 22:29:14 +02:00
|
|
|
#ifdef USE_SSL
|
|
|
|
/* BackendSslStatusBuffer: */
|
2011-02-17 22:03:28 +01:00
|
|
|
size = add_size(size,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
|
2020-12-28 23:44:17 +01:00
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
/* BackendGssStatusBuffer: */
|
|
|
|
size = add_size(size,
|
|
|
|
mul_size(sizeof(PgBackendGSSStatus), NumBackendStatSlots));
|
2015-07-27 22:29:14 +02:00
|
|
|
#endif
|
2006-06-19 03:51:22 +02:00
|
|
|
return size;
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
/*
|
2011-02-17 22:03:28 +01:00
|
|
|
* Initialize the shared status array and several string buffers
|
2009-11-29 00:38:08 +01:00
|
|
|
* during postmaster startup.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2006-06-19 03:51:22 +02:00
|
|
|
void
|
|
|
|
CreateSharedBackendStatus(void)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2008-06-30 12:58:47 +02:00
|
|
|
Size size;
|
2006-06-19 03:51:22 +02:00
|
|
|
bool found;
|
2008-06-30 12:58:47 +02:00
|
|
|
int i;
|
|
|
|
char *buffer;
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/* Create or attach to the shared array */
|
2017-03-27 04:02:22 +02:00
|
|
|
size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
|
2006-06-19 03:51:22 +02:00
|
|
|
BackendStatusArray = (PgBackendStatus *)
|
|
|
|
ShmemInitStruct("Backend Status Array", size, &found);
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We're the first - initialize.
|
|
|
|
*/
|
|
|
|
MemSet(BackendStatusArray, 0, size);
|
|
|
|
}
|
2008-06-30 12:58:47 +02:00
|
|
|
|
2009-11-29 00:38:08 +01:00
|
|
|
/* Create or attach to the shared appname buffer */
|
2018-04-11 22:39:49 +02:00
|
|
|
size = mul_size(NAMEDATALEN, NumBackendStatSlots);
|
2009-11-29 00:38:08 +01:00
|
|
|
BackendAppnameBuffer = (char *)
|
|
|
|
ShmemInitStruct("Backend Application Name Buffer", size, &found);
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
MemSet(BackendAppnameBuffer, 0, size);
|
|
|
|
|
|
|
|
/* Initialize st_appname pointers. */
|
|
|
|
buffer = BackendAppnameBuffer;
|
2017-03-27 04:02:22 +02:00
|
|
|
for (i = 0; i < NumBackendStatSlots; i++)
|
2009-11-29 00:38:08 +01:00
|
|
|
{
|
|
|
|
BackendStatusArray[i].st_appname = buffer;
|
|
|
|
buffer += NAMEDATALEN;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-02-17 22:03:28 +01:00
|
|
|
/* Create or attach to the shared client hostname buffer */
|
2018-04-11 22:39:49 +02:00
|
|
|
size = mul_size(NAMEDATALEN, NumBackendStatSlots);
|
2011-02-17 22:03:28 +01:00
|
|
|
BackendClientHostnameBuffer = (char *)
|
|
|
|
ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
MemSet(BackendClientHostnameBuffer, 0, size);
|
|
|
|
|
|
|
|
/* Initialize st_clienthostname pointers. */
|
|
|
|
buffer = BackendClientHostnameBuffer;
|
2017-03-27 04:02:22 +02:00
|
|
|
for (i = 0; i < NumBackendStatSlots; i++)
|
2011-02-17 22:03:28 +01:00
|
|
|
{
|
|
|
|
BackendStatusArray[i].st_clienthostname = buffer;
|
|
|
|
buffer += NAMEDATALEN;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-06-30 12:58:47 +02:00
|
|
|
/* Create or attach to the shared activity buffer */
|
2011-10-21 19:26:40 +02:00
|
|
|
BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
|
2017-03-27 04:02:22 +02:00
|
|
|
NumBackendStatSlots);
|
2009-06-11 16:49:15 +02:00
|
|
|
BackendActivityBuffer = (char *)
|
2011-10-21 19:26:40 +02:00
|
|
|
ShmemInitStruct("Backend Activity Buffer",
|
|
|
|
BackendActivityBufferSize,
|
|
|
|
&found);
|
2008-06-30 12:58:47 +02:00
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
2018-08-07 22:00:44 +02:00
|
|
|
MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
|
2008-06-30 12:58:47 +02:00
|
|
|
|
|
|
|
/* Initialize st_activity pointers. */
|
|
|
|
buffer = BackendActivityBuffer;
|
2017-03-27 04:02:22 +02:00
|
|
|
for (i = 0; i < NumBackendStatSlots; i++)
|
2009-06-11 16:49:15 +02:00
|
|
|
{
|
2017-09-19 20:46:07 +02:00
|
|
|
BackendStatusArray[i].st_activity_raw = buffer;
|
2008-06-30 12:58:47 +02:00
|
|
|
buffer += pgstat_track_activity_query_size;
|
|
|
|
}
|
|
|
|
}
|
2015-07-27 22:29:14 +02:00
|
|
|
|
|
|
|
#ifdef USE_SSL
|
|
|
|
/* Create or attach to the shared SSL status buffer */
|
2017-03-27 04:02:22 +02:00
|
|
|
size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
|
2015-07-27 22:29:14 +02:00
|
|
|
BackendSslStatusBuffer = (PgBackendSSLStatus *)
|
|
|
|
ShmemInitStruct("Backend SSL Status Buffer", size, &found);
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
PgBackendSSLStatus *ptr;
|
|
|
|
|
|
|
|
MemSet(BackendSslStatusBuffer, 0, size);
|
|
|
|
|
|
|
|
/* Initialize st_sslstatus pointers. */
|
|
|
|
ptr = BackendSslStatusBuffer;
|
2017-03-27 04:02:22 +02:00
|
|
|
for (i = 0; i < NumBackendStatSlots; i++)
|
2015-07-27 22:29:14 +02:00
|
|
|
{
|
|
|
|
BackendStatusArray[i].st_sslstatus = ptr;
|
|
|
|
ptr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 21:02:33 +02:00
|
|
|
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
/* Create or attach to the shared GSSAPI status buffer */
|
|
|
|
size = mul_size(sizeof(PgBackendGSSStatus), NumBackendStatSlots);
|
|
|
|
BackendGssStatusBuffer = (PgBackendGSSStatus *)
|
|
|
|
ShmemInitStruct("Backend GSS Status Buffer", size, &found);
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
PgBackendGSSStatus *ptr;
|
|
|
|
|
|
|
|
MemSet(BackendGssStatusBuffer, 0, size);
|
|
|
|
|
|
|
|
/* Initialize st_gssstatus pointers. */
|
|
|
|
ptr = BackendGssStatusBuffer;
|
|
|
|
for (i = 0; i < NumBackendStatSlots; i++)
|
|
|
|
{
|
|
|
|
BackendStatusArray[i].st_gssstatus = ptr;
|
|
|
|
ptr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2007-05-27 07:37:50 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_initialize() -
|
|
|
|
*
|
|
|
|
* Initialize pgstats state, and set up our on-proc-exit hook.
|
2017-03-27 04:02:22 +02:00
|
|
|
* Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
|
|
|
|
* MyBackendId is invalid. Otherwise, MyBackendId must be set,
|
2007-05-27 07:37:50 +02:00
|
|
|
* but we must not have started any transaction yet (since the
|
|
|
|
* exit hook must run after the last transaction exit).
|
2009-08-12 22:53:31 +02:00
|
|
|
* NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
|
2007-05-27 07:37:50 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_initialize(void)
|
|
|
|
{
|
|
|
|
/* Initialize MyBEEntry */
|
2017-03-27 04:02:22 +02:00
|
|
|
if (MyBackendId != InvalidBackendId)
|
|
|
|
{
|
|
|
|
Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
|
|
|
|
MyBEEntry = &BackendStatusArray[MyBackendId - 1];
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Must be an auxiliary process */
|
|
|
|
Assert(MyAuxProcType != NotAnAuxProcess);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Assign the MyBEEntry for an auxiliary process. Since it doesn't
|
|
|
|
* have a BackendId, the slot is statically allocated based on the
|
|
|
|
* auxiliary process type (MyAuxProcType). Backends use slots indexed
|
|
|
|
* in the range from 1 to MaxBackends (inclusive), so we use
|
|
|
|
* MaxBackends + AuxBackendType + 1 as the index of the slot for an
|
|
|
|
* auxiliary process.
|
|
|
|
*/
|
|
|
|
MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
|
|
|
|
}
|
2007-05-27 07:37:50 +02:00
|
|
|
|
2020-12-02 05:00:15 +01:00
|
|
|
/*
|
|
|
|
* Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
|
|
|
|
* calculate how much pgWalUsage counters are increased by substracting
|
|
|
|
* prevWalUsage from pgWalUsage.
|
|
|
|
*/
|
|
|
|
prevWalUsage = pgWalUsage;
|
|
|
|
|
2007-05-27 07:37:50 +02:00
|
|
|
/* Set up a process-exit hook to clean up */
|
|
|
|
on_shmem_exit(pgstat_beshutdown_hook, 0);
|
|
|
|
}
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_bestart() -
|
|
|
|
*
|
2007-05-27 07:37:50 +02:00
|
|
|
* Initialize this backend's entry in the PgBackendStatus array.
|
2009-11-29 00:38:08 +01:00
|
|
|
* Called from InitPostgres.
|
2017-03-27 04:02:22 +02:00
|
|
|
*
|
|
|
|
* Apart from auxiliary processes, MyBackendId, MyDatabaseId,
|
|
|
|
* session userid, and application_name must be set for a
|
|
|
|
* backend (hence, this cannot be combined with pgstat_initialize).
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
* Note also that we must be inside a transaction if this isn't an aux
|
|
|
|
* process, as we may need to do encoding conversion on some strings.
|
2006-06-19 03:51:22 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_bestart(void)
|
|
|
|
{
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
volatile PgBackendStatus *vbeentry = MyBEEntry;
|
|
|
|
PgBackendStatus lbeentry;
|
|
|
|
#ifdef USE_SSL
|
|
|
|
PgBackendSSLStatus lsslstatus;
|
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
PgBackendGSSStatus lgssstatus;
|
|
|
|
#endif
|
2006-06-19 03:51:22 +02:00
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
/* pgstats state must be initialized from pgstat_initialize() */
|
|
|
|
Assert(vbeentry != NULL);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/*
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
* To minimize the time spent modifying the PgBackendStatus entry, and
|
|
|
|
* avoid risk of errors inside the critical section, we first copy the
|
|
|
|
* shared-memory struct to a local variable, then modify the data in the
|
|
|
|
* local variable, then copy the local variable back to shared memory.
|
|
|
|
* Only the last step has to be inside the critical section.
|
|
|
|
*
|
|
|
|
* Most of the data we copy from shared memory is just going to be
|
|
|
|
* overwritten, but the struct's not so large that it's worth the
|
|
|
|
* maintenance hassle to copy only the needful fields.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
memcpy(&lbeentry,
|
|
|
|
unvolatize(PgBackendStatus *, vbeentry),
|
|
|
|
sizeof(PgBackendStatus));
|
|
|
|
|
|
|
|
/* These structs can just start from zeroes each time, though */
|
|
|
|
#ifdef USE_SSL
|
|
|
|
memset(&lsslstatus, 0, sizeof(lsslstatus));
|
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
memset(&lgssstatus, 0, sizeof(lgssstatus));
|
|
|
|
#endif
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/*
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
* Now fill in all the fields of lbeentry, except for strings that are
|
|
|
|
* out-of-line data. Those have to be handled separately, below.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_procpid = MyProcPid;
|
2020-03-11 16:36:40 +01:00
|
|
|
lbeentry.st_backendType = MyBackendType;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_proc_start_timestamp = MyStartTimestamp;
|
|
|
|
lbeentry.st_activity_start_timestamp = 0;
|
|
|
|
lbeentry.st_state_start_timestamp = 0;
|
|
|
|
lbeentry.st_xact_start_timestamp = 0;
|
|
|
|
lbeentry.st_databaseid = MyDatabaseId;
|
2017-03-27 04:02:22 +02:00
|
|
|
|
|
|
|
/* We have userid for client-backends, wal-sender and bgworker processes */
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
if (lbeentry.st_backendType == B_BACKEND
|
|
|
|
|| lbeentry.st_backendType == B_WAL_SENDER
|
|
|
|
|| lbeentry.st_backendType == B_BG_WORKER)
|
|
|
|
lbeentry.st_userid = GetSessionUserId();
|
2017-03-27 04:02:22 +02:00
|
|
|
else
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_userid = InvalidOid;
|
2017-03-27 04:02:22 +02:00
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
/*
|
|
|
|
* We may not have a MyProcPort (eg, if this is the autovacuum process).
|
|
|
|
* If so, use all-zeroes client address, which is dealt with specially in
|
|
|
|
* pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
|
|
|
|
*/
|
|
|
|
if (MyProcPort)
|
|
|
|
memcpy(&lbeentry.st_clientaddr, &MyProcPort->raddr,
|
|
|
|
sizeof(lbeentry.st_clientaddr));
|
2014-04-02 03:30:08 +02:00
|
|
|
else
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
MemSet(&lbeentry.st_clientaddr, 0, sizeof(lbeentry.st_clientaddr));
|
|
|
|
|
2015-04-12 19:07:46 +02:00
|
|
|
#ifdef USE_SSL
|
2020-07-07 16:57:27 +02:00
|
|
|
if (MyProcPort && MyProcPort->ssl_in_use)
|
2015-04-12 19:07:46 +02:00
|
|
|
{
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_ssl = true;
|
|
|
|
lsslstatus.ssl_bits = be_tls_get_cipher_bits(MyProcPort);
|
|
|
|
lsslstatus.ssl_compression = be_tls_get_compression(MyProcPort);
|
|
|
|
strlcpy(lsslstatus.ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
|
|
|
|
strlcpy(lsslstatus.ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
|
|
|
|
be_tls_get_peer_subject_name(MyProcPort, lsslstatus.ssl_client_dn, NAMEDATALEN);
|
|
|
|
be_tls_get_peer_serial(MyProcPort, lsslstatus.ssl_client_serial, NAMEDATALEN);
|
|
|
|
be_tls_get_peer_issuer_name(MyProcPort, lsslstatus.ssl_issuer_dn, NAMEDATALEN);
|
2015-04-12 19:07:46 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_ssl = false;
|
2015-04-12 19:07:46 +02:00
|
|
|
}
|
|
|
|
#else
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_ssl = false;
|
2015-04-12 19:07:46 +02:00
|
|
|
#endif
|
GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 21:02:33 +02:00
|
|
|
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
if (MyProcPort && MyProcPort->gss != NULL)
|
|
|
|
{
|
2020-12-28 23:44:17 +01:00
|
|
|
const char *princ = be_gssapi_get_princ(MyProcPort);
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_gss = true;
|
|
|
|
lgssstatus.gss_auth = be_gssapi_get_auth(MyProcPort);
|
|
|
|
lgssstatus.gss_enc = be_gssapi_get_enc(MyProcPort);
|
2020-12-28 23:44:17 +01:00
|
|
|
if (princ)
|
|
|
|
strlcpy(lgssstatus.gss_princ, princ, NAMEDATALEN);
|
GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 21:02:33 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_gss = false;
|
GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 21:02:33 +02:00
|
|
|
}
|
|
|
|
#else
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
lbeentry.st_gss = false;
|
GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 21:02:33 +02:00
|
|
|
#endif
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
|
|
|
|
lbeentry.st_state = STATE_UNDEFINED;
|
|
|
|
lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
|
|
|
|
lbeentry.st_progress_command_target = InvalidOid;
|
2016-06-10 00:02:36 +02:00
|
|
|
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
/*
|
|
|
|
* we don't zero st_progress_param here to save cycles; nobody should
|
|
|
|
* examine it until st_progress_command has been set to something other
|
|
|
|
* than PROGRESS_COMMAND_INVALID
|
|
|
|
*/
|
2006-06-19 03:51:22 +02:00
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
/*
|
|
|
|
* We're ready to enter the critical section that fills the shared-memory
|
|
|
|
* status entry. We follow the protocol of bumping st_changecount before
|
|
|
|
* and after; and make sure it's even afterwards. We use a volatile
|
|
|
|
* pointer here to ensure the compiler doesn't try to get cute.
|
|
|
|
*/
|
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(vbeentry);
|
|
|
|
|
|
|
|
/* make sure we'll memcpy the same st_changecount back */
|
|
|
|
lbeentry.st_changecount = vbeentry->st_changecount;
|
|
|
|
|
|
|
|
memcpy(unvolatize(PgBackendStatus *, vbeentry),
|
|
|
|
&lbeentry,
|
|
|
|
sizeof(PgBackendStatus));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can write the out-of-line strings and structs using the pointers
|
|
|
|
* that are in lbeentry; this saves some de-volatilizing messiness.
|
|
|
|
*/
|
|
|
|
lbeentry.st_appname[0] = '\0';
|
|
|
|
if (MyProcPort && MyProcPort->remote_hostname)
|
|
|
|
strlcpy(lbeentry.st_clienthostname, MyProcPort->remote_hostname,
|
|
|
|
NAMEDATALEN);
|
|
|
|
else
|
|
|
|
lbeentry.st_clienthostname[0] = '\0';
|
|
|
|
lbeentry.st_activity_raw[0] = '\0';
|
|
|
|
/* Also make sure the last byte in each string area is always 0 */
|
|
|
|
lbeentry.st_appname[NAMEDATALEN - 1] = '\0';
|
|
|
|
lbeentry.st_clienthostname[NAMEDATALEN - 1] = '\0';
|
|
|
|
lbeentry.st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
|
|
|
|
|
|
|
|
#ifdef USE_SSL
|
|
|
|
memcpy(lbeentry.st_sslstatus, &lsslstatus, sizeof(PgBackendSSLStatus));
|
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
memcpy(lbeentry.st_gssstatus, &lgssstatus, sizeof(PgBackendGSSStatus));
|
|
|
|
#endif
|
|
|
|
|
|
|
|
PGSTAT_END_WRITE_ACTIVITY(vbeentry);
|
2009-11-29 00:38:08 +01:00
|
|
|
|
|
|
|
/* Update app name to current GUC setting */
|
|
|
|
if (application_name)
|
|
|
|
pgstat_report_appname(application_name);
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Shut down a single backend's statistics reporting at process exit.
|
|
|
|
*
|
|
|
|
* Flush any remaining statistics counts out to the collector.
|
|
|
|
* Without this, operations triggered during backend exit (such as
|
|
|
|
* temp table deletions) won't be counted.
|
|
|
|
*
|
|
|
|
* Lastly, clear out our entry in the PgBackendStatus array.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_beshutdown_hook(int code, Datum arg)
|
|
|
|
{
|
2006-08-28 21:38:09 +02:00
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2009-08-12 22:53:31 +02:00
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* If we got as far as discovering our own database ID, we can report what
|
|
|
|
* we did to the collector. Otherwise, we'd be sending an invalid
|
2009-08-12 22:53:31 +02:00
|
|
|
* database ID, so forget it. (This means that accesses to pg_database
|
|
|
|
* during failed backend starts might never get counted.)
|
|
|
|
*/
|
|
|
|
if (OidIsValid(MyDatabaseId))
|
|
|
|
pgstat_report_stat(true);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* Clear my status entry, following the protocol of bumping st_changecount
|
|
|
|
* before and after. We use a volatile pointer here to ensure the
|
|
|
|
* compiler doesn't try to get cute.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
beentry->st_procpid = 0; /* mark invalid */
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_report_activity() -
|
|
|
|
*
|
|
|
|
* Called from tcop/postgres.c to report what the backend is actually doing
|
2013-04-03 20:13:28 +02:00
|
|
|
* (but note cmd_str can be NULL for certain cases).
|
2012-01-19 14:19:20 +01:00
|
|
|
*
|
|
|
|
* All updates of the status entry follow the protocol of bumping
|
|
|
|
* st_changecount before and after. We use a volatile pointer here to
|
|
|
|
* ensure the compiler doesn't try to get cute.
|
2006-06-19 03:51:22 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
2012-01-19 14:19:20 +01:00
|
|
|
pgstat_report_activity(BackendState state, const char *cmd_str)
|
2006-06-19 03:51:22 +02:00
|
|
|
{
|
2006-08-28 21:38:09 +02:00
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
2006-06-19 03:51:22 +02:00
|
|
|
TimestampTz start_timestamp;
|
2012-01-19 14:19:20 +01:00
|
|
|
TimestampTz current_timestamp;
|
2012-01-24 19:40:26 +01:00
|
|
|
int len = 0;
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2008-08-01 15:16:09 +02:00
|
|
|
TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
|
|
|
|
|
2012-01-19 14:19:20 +01:00
|
|
|
if (!beentry)
|
2006-06-19 03:51:22 +02:00
|
|
|
return;
|
|
|
|
|
2013-04-03 20:13:28 +02:00
|
|
|
if (!pgstat_track_activities)
|
2012-01-19 14:19:20 +01:00
|
|
|
{
|
2013-04-03 20:13:28 +02:00
|
|
|
if (beentry->st_state != STATE_DISABLED)
|
|
|
|
{
|
2016-03-10 18:44:09 +01:00
|
|
|
volatile PGPROC *proc = MyProc;
|
|
|
|
|
2013-04-03 20:13:28 +02:00
|
|
|
/*
|
|
|
|
* track_activities is disabled, but we last reported a
|
2014-05-06 18:12:18 +02:00
|
|
|
* non-disabled state. As our final update, change the state and
|
2013-04-03 20:13:28 +02:00
|
|
|
* clear fields we will not be updating anymore.
|
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
2013-04-03 20:13:28 +02:00
|
|
|
beentry->st_state = STATE_DISABLED;
|
|
|
|
beentry->st_state_start_timestamp = 0;
|
2017-09-19 20:46:07 +02:00
|
|
|
beentry->st_activity_raw[0] = '\0';
|
2013-04-03 20:13:28 +02:00
|
|
|
beentry->st_activity_start_timestamp = 0;
|
2016-03-10 18:44:09 +01:00
|
|
|
/* st_xact_start_timestamp and wait_event_info are also disabled */
|
2013-04-03 20:13:28 +02:00
|
|
|
beentry->st_xact_start_timestamp = 0;
|
2016-03-10 18:44:09 +01:00
|
|
|
proc->wait_event_info = 0;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
2013-04-03 20:13:28 +02:00
|
|
|
}
|
2012-01-19 14:19:20 +01:00
|
|
|
return;
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/*
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
* To minimize the time spent modifying the entry, and avoid risk of
|
|
|
|
* errors inside the critical section, fetch all the needed data first.
|
2012-01-19 14:19:20 +01:00
|
|
|
*/
|
|
|
|
start_timestamp = GetCurrentStatementStartTimestamp();
|
|
|
|
if (cmd_str != NULL)
|
|
|
|
{
|
2017-09-19 20:46:07 +02:00
|
|
|
/*
|
|
|
|
* Compute length of to-be-stored string unaware of multi-byte
|
|
|
|
* characters. For speed reasons that'll get corrected on read, rather
|
|
|
|
* than computed every write.
|
|
|
|
*/
|
|
|
|
len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
|
2012-01-19 14:19:20 +01:00
|
|
|
}
|
2013-04-03 20:13:28 +02:00
|
|
|
current_timestamp = GetCurrentTimestamp();
|
2012-01-19 14:19:20 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Now update the status entry
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2012-01-19 14:19:20 +01:00
|
|
|
beentry->st_state = state;
|
|
|
|
beentry->st_state_start_timestamp = current_timestamp;
|
|
|
|
|
|
|
|
if (cmd_str != NULL)
|
|
|
|
{
|
2017-09-19 20:46:07 +02:00
|
|
|
memcpy((char *) beentry->st_activity_raw, cmd_str, len);
|
|
|
|
beentry->st_activity_raw[len] = '\0';
|
2012-01-19 14:19:20 +01:00
|
|
|
beentry->st_activity_start_timestamp = start_timestamp;
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
|
|
|
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
/*-----------
|
|
|
|
* pgstat_progress_start_command() -
|
|
|
|
*
|
2016-03-10 12:07:57 +01:00
|
|
|
* Set st_progress_command (and st_progress_command_target) in own backend
|
|
|
|
* entry. Also, zero-initialize st_progress_param array.
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
*-----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
|
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
|
|
|
|
|
|
|
if (!beentry || !pgstat_track_activities)
|
|
|
|
return;
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
beentry->st_progress_command = cmdtype;
|
|
|
|
beentry->st_progress_command_target = relid;
|
|
|
|
MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*-----------
|
|
|
|
* pgstat_progress_update_param() -
|
|
|
|
*
|
|
|
|
* Update index'th member in st_progress_param[] of own backend entry.
|
|
|
|
*-----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_progress_update_param(int index, int64 val)
|
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
|
|
|
|
|
|
|
Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
|
|
|
|
|
|
|
|
if (!beentry || !pgstat_track_activities)
|
|
|
|
return;
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
beentry->st_progress_param[index] = val;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
2016-03-15 18:31:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*-----------
|
2016-03-16 18:54:04 +01:00
|
|
|
* pgstat_progress_update_multi_param() -
|
2016-03-15 18:31:18 +01:00
|
|
|
*
|
2016-03-16 19:28:25 +01:00
|
|
|
* Update multiple members in st_progress_param[] of own backend entry.
|
|
|
|
* This is atomic; readers won't see intermediate states.
|
2016-03-15 18:31:18 +01:00
|
|
|
*-----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_progress_update_multi_param(int nparam, const int *index,
|
|
|
|
const int64 *val)
|
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
2016-06-10 00:02:36 +02:00
|
|
|
int i;
|
2016-03-15 18:31:18 +01:00
|
|
|
|
|
|
|
if (!beentry || !pgstat_track_activities || nparam == 0)
|
|
|
|
return;
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
2016-03-15 18:31:18 +01:00
|
|
|
|
|
|
|
for (i = 0; i < nparam; ++i)
|
|
|
|
{
|
|
|
|
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
|
|
|
|
|
|
|
|
beentry->st_progress_param[index[i]] = val[i];
|
|
|
|
}
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*-----------
|
|
|
|
* pgstat_progress_end_command() -
|
|
|
|
*
|
2016-03-10 12:07:57 +01:00
|
|
|
* Reset st_progress_command (and st_progress_command_target) in own backend
|
|
|
|
* entry. This signals the end of the command.
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
*-----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_progress_end_command(void)
|
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
|
|
|
|
2019-09-04 08:46:37 +02:00
|
|
|
if (!beentry || !pgstat_track_activities)
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
return;
|
2019-09-04 08:46:37 +02:00
|
|
|
|
|
|
|
if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
return;
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
|
|
|
|
beentry->st_progress_command_target = InvalidOid;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 18:08:58 +01:00
|
|
|
}
|
|
|
|
|
2009-11-29 00:38:08 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_report_appname() -
|
|
|
|
*
|
|
|
|
* Called to update our application name.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_report_appname(const char *appname)
|
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
|
|
|
int len;
|
|
|
|
|
|
|
|
if (!beentry)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* This should be unnecessary if GUC did its job, but be safe */
|
|
|
|
len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update my status entry, following the protocol of bumping
|
|
|
|
* st_changecount before and after. We use a volatile pointer here to
|
|
|
|
* ensure the compiler doesn't try to get cute.
|
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
2009-11-29 00:38:08 +01:00
|
|
|
|
|
|
|
memcpy((char *) beentry->st_appname, appname, len);
|
|
|
|
beentry->st_appname[len] = '\0';
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
2009-11-29 00:38:08 +01:00
|
|
|
}
|
|
|
|
|
2006-12-06 19:06:48 +01:00
|
|
|
/*
|
2007-09-11 05:28:05 +02:00
|
|
|
* Report current transaction start timestamp as the specified value.
|
|
|
|
* Zero means there is no active transaction.
|
2006-12-06 19:06:48 +01:00
|
|
|
*/
|
|
|
|
void
|
2007-09-11 05:28:05 +02:00
|
|
|
pgstat_report_xact_timestamp(TimestampTz tstamp)
|
2006-12-06 19:06:48 +01:00
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry = MyBEEntry;
|
|
|
|
|
2007-09-24 05:12:23 +02:00
|
|
|
if (!pgstat_track_activities || !beentry)
|
2006-12-06 19:06:48 +01:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update my status entry, following the protocol of bumping
|
2007-11-15 22:14:46 +01:00
|
|
|
* st_changecount before and after. We use a volatile pointer here to
|
|
|
|
* ensure the compiler doesn't try to get cute.
|
2006-12-06 19:06:48 +01:00
|
|
|
*/
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
|
|
|
|
|
2007-09-11 05:28:05 +02:00
|
|
|
beentry->st_xact_start_timestamp = tstamp;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
|
|
|
|
PGSTAT_END_WRITE_ACTIVITY(beentry);
|
2006-12-06 19:06:48 +01:00
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_read_current_status() -
|
|
|
|
*
|
|
|
|
* Copy the current contents of the PgBackendStatus array to local memory,
|
|
|
|
* if not already done in this transaction.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_read_current_status(void)
|
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry;
|
2014-02-25 18:34:04 +01:00
|
|
|
LocalPgBackendStatus *localtable;
|
|
|
|
LocalPgBackendStatus *localentry;
|
2015-07-27 22:29:14 +02:00
|
|
|
char *localappname,
|
2018-04-11 22:39:48 +02:00
|
|
|
*localclienthostname,
|
2015-07-27 22:29:14 +02:00
|
|
|
*localactivity;
|
2015-04-12 19:07:46 +02:00
|
|
|
#ifdef USE_SSL
|
|
|
|
PgBackendSSLStatus *localsslstatus;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
PgBackendGSSStatus *localgssstatus;
|
2015-04-12 19:07:46 +02:00
|
|
|
#endif
|
2006-06-19 03:51:22 +02:00
|
|
|
int i;
|
|
|
|
|
|
|
|
Assert(!pgStatRunningInCollector);
|
2007-02-08 00:11:30 +01:00
|
|
|
if (localBackendStatusTable)
|
2006-06-19 03:51:22 +02:00
|
|
|
return; /* already done */
|
|
|
|
|
2007-02-08 00:11:30 +01:00
|
|
|
pgstat_setup_memcxt();
|
|
|
|
|
2019-05-07 17:41:37 +02:00
|
|
|
/*
|
|
|
|
* Allocate storage for local copy of state data. We can presume that
|
|
|
|
* none of these requests overflow size_t, because we already calculated
|
|
|
|
* the same values using mul_size during shmem setup. However, with
|
|
|
|
* probably-silly values of pgstat_track_activity_query_size and
|
|
|
|
* max_connections, the localactivity buffer could exceed 1GB, so use
|
|
|
|
* "huge" allocation for that one.
|
|
|
|
*/
|
2014-02-25 18:34:04 +01:00
|
|
|
localtable = (LocalPgBackendStatus *)
|
2007-02-08 00:11:30 +01:00
|
|
|
MemoryContextAlloc(pgStatLocalContext,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
|
2009-11-29 00:38:08 +01:00
|
|
|
localappname = (char *)
|
|
|
|
MemoryContextAlloc(pgStatLocalContext,
|
2017-03-27 04:02:22 +02:00
|
|
|
NAMEDATALEN * NumBackendStatSlots);
|
2018-04-11 22:39:48 +02:00
|
|
|
localclienthostname = (char *)
|
|
|
|
MemoryContextAlloc(pgStatLocalContext,
|
|
|
|
NAMEDATALEN * NumBackendStatSlots);
|
2015-07-27 22:29:14 +02:00
|
|
|
localactivity = (char *)
|
2019-05-07 17:41:37 +02:00
|
|
|
MemoryContextAllocHuge(pgStatLocalContext,
|
|
|
|
pgstat_track_activity_query_size * NumBackendStatSlots);
|
2015-04-12 19:07:46 +02:00
|
|
|
#ifdef USE_SSL
|
|
|
|
localsslstatus = (PgBackendSSLStatus *)
|
|
|
|
MemoryContextAlloc(pgStatLocalContext,
|
2017-03-27 04:02:22 +02:00
|
|
|
sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
|
2015-04-12 19:07:46 +02:00
|
|
|
#endif
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
localgssstatus = (PgBackendGSSStatus *)
|
|
|
|
MemoryContextAlloc(pgStatLocalContext,
|
|
|
|
sizeof(PgBackendGSSStatus) * NumBackendStatSlots);
|
|
|
|
#endif
|
2015-04-12 19:07:46 +02:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
localNumBackends = 0;
|
|
|
|
|
|
|
|
beentry = BackendStatusArray;
|
2007-02-08 00:11:30 +01:00
|
|
|
localentry = localtable;
|
2017-03-27 04:02:22 +02:00
|
|
|
for (i = 1; i <= NumBackendStatSlots; i++)
|
2006-06-19 03:51:22 +02:00
|
|
|
{
|
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* Follow the protocol of retrying if st_changecount changes while we
|
|
|
|
* copy the entry, or if it's odd. (The check for odd is needed to
|
|
|
|
* cover the case where we are able to completely copy the entry while
|
|
|
|
* the source backend is between increment steps.) We use a volatile
|
|
|
|
* pointer here to ensure the compiler doesn't try to get cute.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
|
|
|
for (;;)
|
|
|
|
{
|
2014-12-18 15:07:51 +01:00
|
|
|
int before_changecount;
|
|
|
|
int after_changecount;
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
pgstat_begin_read_activity(beentry, before_changecount);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2014-02-25 18:34:04 +01:00
|
|
|
localentry->backendStatus.st_procpid = beentry->st_procpid;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
/* Skip all the data-copying work if entry is not in use */
|
2014-02-25 18:34:04 +01:00
|
|
|
if (localentry->backendStatus.st_procpid > 0)
|
2008-06-30 12:58:47 +02:00
|
|
|
{
|
2019-03-25 09:35:29 +01:00
|
|
|
memcpy(&localentry->backendStatus, unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2008-06-30 12:58:47 +02:00
|
|
|
/*
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
* For each PgBackendStatus field that is a pointer, copy the
|
|
|
|
* pointed-to data, then adjust the local copy of the pointer
|
|
|
|
* field to point at the local copy of the data.
|
|
|
|
*
|
2008-06-30 12:58:47 +02:00
|
|
|
* strcpy is safe even if the string is modified concurrently,
|
|
|
|
* because there's always a \0 at the end of the buffer.
|
|
|
|
*/
|
2009-11-29 00:38:08 +01:00
|
|
|
strcpy(localappname, (char *) beentry->st_appname);
|
2014-02-25 18:34:04 +01:00
|
|
|
localentry->backendStatus.st_appname = localappname;
|
2018-04-11 22:39:48 +02:00
|
|
|
strcpy(localclienthostname, (char *) beentry->st_clienthostname);
|
|
|
|
localentry->backendStatus.st_clienthostname = localclienthostname;
|
2017-09-19 20:46:07 +02:00
|
|
|
strcpy(localactivity, (char *) beentry->st_activity_raw);
|
|
|
|
localentry->backendStatus.st_activity_raw = localactivity;
|
2015-04-12 19:07:46 +02:00
|
|
|
#ifdef USE_SSL
|
|
|
|
if (beentry->st_ssl)
|
|
|
|
{
|
|
|
|
memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
|
|
|
|
localentry->backendStatus.st_sslstatus = localsslstatus;
|
|
|
|
}
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
if (beentry->st_gss)
|
|
|
|
{
|
|
|
|
memcpy(localgssstatus, beentry->st_gssstatus, sizeof(PgBackendGSSStatus));
|
|
|
|
localentry->backendStatus.st_gssstatus = localgssstatus;
|
|
|
|
}
|
2015-04-12 19:07:46 +02:00
|
|
|
#endif
|
2008-06-30 12:58:47 +02:00
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
pgstat_end_read_activity(beentry, after_changecount);
|
|
|
|
|
|
|
|
if (pgstat_read_activity_complete(before_changecount,
|
|
|
|
after_changecount))
|
2006-06-19 03:51:22 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
/* Make sure we can break out of loop if stuck... */
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
}
|
|
|
|
|
|
|
|
beentry++;
|
|
|
|
/* Only valid entries get included into the local array */
|
2014-02-25 18:34:04 +01:00
|
|
|
if (localentry->backendStatus.st_procpid > 0)
|
2006-06-19 03:51:22 +02:00
|
|
|
{
|
2014-02-25 18:34:04 +01:00
|
|
|
BackendIdGetTransactionIds(i,
|
|
|
|
&localentry->backend_xid,
|
|
|
|
&localentry->backend_xmin);
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
localentry++;
|
2009-11-29 00:38:08 +01:00
|
|
|
localappname += NAMEDATALEN;
|
2018-04-11 22:39:48 +02:00
|
|
|
localclienthostname += NAMEDATALEN;
|
2008-06-30 12:58:47 +02:00
|
|
|
localactivity += pgstat_track_activity_query_size;
|
2015-04-12 19:07:46 +02:00
|
|
|
#ifdef USE_SSL
|
2015-07-27 21:58:46 +02:00
|
|
|
localsslstatus++;
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
#endif
|
|
|
|
#ifdef ENABLE_GSS
|
|
|
|
localgssstatus++;
|
2015-04-12 19:07:46 +02:00
|
|
|
#endif
|
2006-06-19 03:51:22 +02:00
|
|
|
localNumBackends++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-02-08 00:11:30 +01:00
|
|
|
/* Set the pointer only after completion of a valid table */
|
|
|
|
localBackendStatusTable = localtable;
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
|
|
|
|
2016-03-10 18:44:09 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_event_type() -
|
|
|
|
*
|
|
|
|
* Return a string representing the current wait event type, backend is
|
|
|
|
* waiting on.
|
|
|
|
*/
|
|
|
|
const char *
|
|
|
|
pgstat_get_wait_event_type(uint32 wait_event_info)
|
|
|
|
{
|
2016-10-04 16:50:13 +02:00
|
|
|
uint32 classId;
|
2016-03-10 18:44:09 +01:00
|
|
|
const char *event_type;
|
|
|
|
|
|
|
|
/* report process as not waiting. */
|
|
|
|
if (wait_event_info == 0)
|
|
|
|
return NULL;
|
|
|
|
|
2016-10-04 16:50:13 +02:00
|
|
|
classId = wait_event_info & 0xFF000000;
|
2016-03-10 18:44:09 +01:00
|
|
|
|
|
|
|
switch (classId)
|
|
|
|
{
|
2016-12-16 17:29:23 +01:00
|
|
|
case PG_WAIT_LWLOCK:
|
|
|
|
event_type = "LWLock";
|
2016-03-10 18:44:09 +01:00
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_LOCK:
|
2016-03-10 18:44:09 +01:00
|
|
|
event_type = "Lock";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_BUFFER_PIN:
|
2016-03-10 18:44:09 +01:00
|
|
|
event_type = "BufferPin";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_ACTIVITY:
|
2016-10-04 16:50:13 +02:00
|
|
|
event_type = "Activity";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_CLIENT:
|
2016-10-04 16:50:13 +02:00
|
|
|
event_type = "Client";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_EXTENSION:
|
2016-10-04 16:50:13 +02:00
|
|
|
event_type = "Extension";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_IPC:
|
2016-10-04 16:50:13 +02:00
|
|
|
event_type = "IPC";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_TIMEOUT:
|
2016-10-04 16:50:13 +02:00
|
|
|
event_type = "Timeout";
|
|
|
|
break;
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
case PG_WAIT_IO:
|
|
|
|
event_type = "IO";
|
|
|
|
break;
|
2016-03-10 18:44:09 +01:00
|
|
|
default:
|
|
|
|
event_type = "???";
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return event_type;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_event() -
|
|
|
|
*
|
|
|
|
* Return a string representing the current wait event, backend is
|
|
|
|
* waiting on.
|
|
|
|
*/
|
|
|
|
const char *
|
|
|
|
pgstat_get_wait_event(uint32 wait_event_info)
|
|
|
|
{
|
2016-10-04 16:50:13 +02:00
|
|
|
uint32 classId;
|
2016-03-10 18:44:09 +01:00
|
|
|
uint16 eventId;
|
|
|
|
const char *event_name;
|
|
|
|
|
|
|
|
/* report process as not waiting. */
|
|
|
|
if (wait_event_info == 0)
|
|
|
|
return NULL;
|
|
|
|
|
2016-10-04 16:50:13 +02:00
|
|
|
classId = wait_event_info & 0xFF000000;
|
|
|
|
eventId = wait_event_info & 0x0000FFFF;
|
2016-03-10 18:44:09 +01:00
|
|
|
|
|
|
|
switch (classId)
|
|
|
|
{
|
2016-12-16 17:29:23 +01:00
|
|
|
case PG_WAIT_LWLOCK:
|
2016-03-10 18:44:09 +01:00
|
|
|
event_name = GetLWLockIdentifier(classId, eventId);
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_LOCK:
|
2016-03-10 18:44:09 +01:00
|
|
|
event_name = GetLockNameFromTagType(eventId);
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_BUFFER_PIN:
|
2016-03-10 18:44:09 +01:00
|
|
|
event_name = "BufferPin";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_ACTIVITY:
|
2016-10-04 16:50:13 +02:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
WaitEventActivity w = (WaitEventActivity) wait_event_info;
|
2016-10-04 16:50:13 +02:00
|
|
|
|
|
|
|
event_name = pgstat_get_wait_activity(w);
|
|
|
|
break;
|
|
|
|
}
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_CLIENT:
|
2016-10-04 16:50:13 +02:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
WaitEventClient w = (WaitEventClient) wait_event_info;
|
2016-10-04 16:50:13 +02:00
|
|
|
|
|
|
|
event_name = pgstat_get_wait_client(w);
|
|
|
|
break;
|
|
|
|
}
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_EXTENSION:
|
2016-10-04 16:50:13 +02:00
|
|
|
event_name = "Extension";
|
|
|
|
break;
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_IPC:
|
2016-10-04 16:50:13 +02:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
WaitEventIPC w = (WaitEventIPC) wait_event_info;
|
2016-10-04 16:50:13 +02:00
|
|
|
|
|
|
|
event_name = pgstat_get_wait_ipc(w);
|
|
|
|
break;
|
|
|
|
}
|
2016-10-05 14:04:52 +02:00
|
|
|
case PG_WAIT_TIMEOUT:
|
2016-10-04 16:50:13 +02:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
|
2016-10-04 16:50:13 +02:00
|
|
|
|
|
|
|
event_name = pgstat_get_wait_timeout(w);
|
|
|
|
break;
|
|
|
|
}
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
case PG_WAIT_IO:
|
|
|
|
{
|
|
|
|
WaitEventIO w = (WaitEventIO) wait_event_info;
|
|
|
|
|
|
|
|
event_name = pgstat_get_wait_io(w);
|
|
|
|
break;
|
|
|
|
}
|
2016-03-10 18:44:09 +01:00
|
|
|
default:
|
|
|
|
event_name = "unknown wait event";
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return event_name;
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2016-10-04 16:50:13 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_activity() -
|
|
|
|
*
|
|
|
|
* Convert WaitEventActivity to string.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static const char *
|
|
|
|
pgstat_get_wait_activity(WaitEventActivity w)
|
|
|
|
{
|
|
|
|
const char *event_name = "unknown wait event";
|
|
|
|
|
|
|
|
switch (w)
|
|
|
|
{
|
|
|
|
case WAIT_EVENT_ARCHIVER_MAIN:
|
|
|
|
event_name = "ArchiverMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_AUTOVACUUM_MAIN:
|
|
|
|
event_name = "AutoVacuumMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_BGWRITER_HIBERNATE:
|
|
|
|
event_name = "BgWriterHibernate";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_BGWRITER_MAIN:
|
|
|
|
event_name = "BgWriterMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CHECKPOINTER_MAIN:
|
|
|
|
event_name = "CheckpointerMain";
|
|
|
|
break;
|
2017-08-08 21:37:44 +02:00
|
|
|
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
|
|
|
|
event_name = "LogicalApplyMain";
|
|
|
|
break;
|
2018-10-24 10:02:37 +02:00
|
|
|
case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
|
|
|
|
event_name = "LogicalLauncherMain";
|
|
|
|
break;
|
2016-10-04 16:50:13 +02:00
|
|
|
case WAIT_EVENT_PGSTAT_MAIN:
|
|
|
|
event_name = "PgStatMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_RECOVERY_WAL_STREAM:
|
|
|
|
event_name = "RecoveryWalStream";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SYSLOGGER_MAIN:
|
|
|
|
event_name = "SysLoggerMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_RECEIVER_MAIN:
|
|
|
|
event_name = "WalReceiverMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_SENDER_MAIN:
|
|
|
|
event_name = "WalSenderMain";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_WRITER_MAIN:
|
|
|
|
event_name = "WalWriterMain";
|
|
|
|
break;
|
2017-05-17 22:31:56 +02:00
|
|
|
/* no default case, so that compiler will warn */
|
2016-10-04 16:50:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return event_name;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_client() -
|
|
|
|
*
|
|
|
|
* Convert WaitEventClient to string.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static const char *
|
|
|
|
pgstat_get_wait_client(WaitEventClient w)
|
|
|
|
{
|
|
|
|
const char *event_name = "unknown wait event";
|
|
|
|
|
|
|
|
switch (w)
|
|
|
|
{
|
|
|
|
case WAIT_EVENT_CLIENT_READ:
|
|
|
|
event_name = "ClientRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CLIENT_WRITE:
|
|
|
|
event_name = "ClientWrite";
|
|
|
|
break;
|
2020-02-17 08:16:08 +01:00
|
|
|
case WAIT_EVENT_GSS_OPEN_SERVER:
|
|
|
|
event_name = "GSSOpenServer";
|
|
|
|
break;
|
2017-08-08 21:37:44 +02:00
|
|
|
case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
|
|
|
|
event_name = "LibPQWalReceiverConnect";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
|
|
|
|
event_name = "LibPQWalReceiverReceive";
|
|
|
|
break;
|
2016-10-04 16:50:13 +02:00
|
|
|
case WAIT_EVENT_SSL_OPEN_SERVER:
|
|
|
|
event_name = "SSLOpenServer";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
|
|
|
|
event_name = "WalReceiverWaitStart";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
|
|
|
|
event_name = "WalSenderWaitForWAL";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
|
|
|
|
event_name = "WalSenderWriteData";
|
|
|
|
break;
|
2017-05-17 22:31:56 +02:00
|
|
|
/* no default case, so that compiler will warn */
|
2016-10-04 16:50:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return event_name;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_ipc() -
|
|
|
|
*
|
|
|
|
* Convert WaitEventIPC to string.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static const char *
|
|
|
|
pgstat_get_wait_ipc(WaitEventIPC w)
|
|
|
|
{
|
|
|
|
const char *event_name = "unknown wait event";
|
|
|
|
|
|
|
|
switch (w)
|
|
|
|
{
|
2020-03-24 03:12:21 +01:00
|
|
|
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
|
|
|
|
event_name = "BackupWaitWalArchive";
|
|
|
|
break;
|
2016-10-04 16:50:13 +02:00
|
|
|
case WAIT_EVENT_BGWORKER_SHUTDOWN:
|
|
|
|
event_name = "BgWorkerShutdown";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_BGWORKER_STARTUP:
|
|
|
|
event_name = "BgWorkerStartup";
|
|
|
|
break;
|
2017-02-15 13:41:14 +01:00
|
|
|
case WAIT_EVENT_BTREE_PAGE:
|
|
|
|
event_name = "BtreePage";
|
|
|
|
break;
|
2019-03-13 22:25:27 +01:00
|
|
|
case WAIT_EVENT_CHECKPOINT_DONE:
|
|
|
|
event_name = "CheckpointDone";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CHECKPOINT_START:
|
|
|
|
event_name = "CheckpointStart";
|
|
|
|
break;
|
2016-10-04 16:50:13 +02:00
|
|
|
case WAIT_EVENT_EXECUTE_GATHER:
|
|
|
|
event_name = "ExecuteGather";
|
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BATCH_ALLOCATE:
|
|
|
|
event_name = "HashBatchAllocate";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BATCH_ELECT:
|
|
|
|
event_name = "HashBatchElect";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BATCH_LOAD:
|
|
|
|
event_name = "HashBatchLoad";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BUILD_ALLOCATE:
|
|
|
|
event_name = "HashBuildAllocate";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BUILD_ELECT:
|
|
|
|
event_name = "HashBuildElect";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BUILD_HASH_INNER:
|
|
|
|
event_name = "HashBuildHashInner";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_BUILD_HASH_OUTER:
|
|
|
|
event_name = "HashBuildHashOuter";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATE:
|
|
|
|
event_name = "HashGrowBatchesAllocate";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BATCHES_DECIDE:
|
|
|
|
event_name = "HashGrowBatchesDecide";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BATCHES_ELECT:
|
|
|
|
event_name = "HashGrowBatchesElect";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BATCHES_FINISH:
|
|
|
|
event_name = "HashGrowBatchesFinish";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITION:
|
|
|
|
event_name = "HashGrowBatchesRepartition";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATE:
|
|
|
|
event_name = "HashGrowBucketsAllocate";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BUCKETS_ELECT:
|
|
|
|
event_name = "HashGrowBucketsElect";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERT:
|
|
|
|
event_name = "HashGrowBucketsReinsert";
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
break;
|
2017-08-08 21:37:44 +02:00
|
|
|
case WAIT_EVENT_LOGICAL_SYNC_DATA:
|
|
|
|
event_name = "LogicalSyncData";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
|
|
|
|
event_name = "LogicalSyncStateChange";
|
|
|
|
break;
|
2016-10-04 16:50:13 +02:00
|
|
|
case WAIT_EVENT_MQ_INTERNAL:
|
|
|
|
event_name = "MessageQueueInternal";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_MQ_PUT_MESSAGE:
|
|
|
|
event_name = "MessageQueuePutMessage";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_MQ_RECEIVE:
|
|
|
|
event_name = "MessageQueueReceive";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_MQ_SEND:
|
|
|
|
event_name = "MessageQueueSend";
|
|
|
|
break;
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
|
|
|
|
event_name = "ParallelBitmapScan";
|
|
|
|
break;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
|
|
|
|
event_name = "ParallelCreateIndexScan";
|
|
|
|
break;
|
2018-10-24 10:02:37 +02:00
|
|
|
case WAIT_EVENT_PARALLEL_FINISH:
|
|
|
|
event_name = "ParallelFinish";
|
|
|
|
break;
|
2017-04-07 19:41:47 +02:00
|
|
|
case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
|
|
|
|
event_name = "ProcArrayGroupUpdate";
|
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_PROC_SIGNAL_BARRIER:
|
|
|
|
event_name = "ProcSignalBarrier";
|
|
|
|
break;
|
2018-10-25 02:46:00 +02:00
|
|
|
case WAIT_EVENT_PROMOTE:
|
|
|
|
event_name = "Promote";
|
|
|
|
break;
|
2020-04-03 05:15:56 +02:00
|
|
|
case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT:
|
|
|
|
event_name = "RecoveryConflictSnapshot";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE:
|
|
|
|
event_name = "RecoveryConflictTablespace";
|
|
|
|
break;
|
2020-03-24 03:12:21 +01:00
|
|
|
case WAIT_EVENT_RECOVERY_PAUSE:
|
|
|
|
event_name = "RecoveryPause";
|
|
|
|
break;
|
2017-08-08 22:07:46 +02:00
|
|
|
case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
|
|
|
|
event_name = "ReplicationOriginDrop";
|
|
|
|
break;
|
2017-08-08 21:37:44 +02:00
|
|
|
case WAIT_EVENT_REPLICATION_SLOT_DROP:
|
|
|
|
event_name = "ReplicationSlotDrop";
|
|
|
|
break;
|
2016-10-04 16:50:13 +02:00
|
|
|
case WAIT_EVENT_SAFE_SNAPSHOT:
|
|
|
|
event_name = "SafeSnapshot";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SYNC_REP:
|
|
|
|
event_name = "SyncRep";
|
|
|
|
break;
|
2020-05-17 03:00:05 +02:00
|
|
|
case WAIT_EVENT_XACT_GROUP_UPDATE:
|
|
|
|
event_name = "XactGroupUpdate";
|
|
|
|
break;
|
2017-05-17 22:31:56 +02:00
|
|
|
/* no default case, so that compiler will warn */
|
2016-10-04 16:50:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return event_name;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_timeout() -
|
|
|
|
*
|
|
|
|
* Convert WaitEventTimeout to string.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static const char *
|
|
|
|
pgstat_get_wait_timeout(WaitEventTimeout w)
|
|
|
|
{
|
|
|
|
const char *event_name = "unknown wait event";
|
|
|
|
|
|
|
|
switch (w)
|
|
|
|
{
|
|
|
|
case WAIT_EVENT_BASE_BACKUP_THROTTLE:
|
|
|
|
event_name = "BaseBackupThrottle";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_PG_SLEEP:
|
|
|
|
event_name = "PgSleep";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_RECOVERY_APPLY_DELAY:
|
|
|
|
event_name = "RecoveryApplyDelay";
|
|
|
|
break;
|
2020-03-19 07:32:55 +01:00
|
|
|
case WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL:
|
|
|
|
event_name = "RecoveryRetrieveRetryInterval";
|
|
|
|
break;
|
2020-03-24 06:19:56 +01:00
|
|
|
case WAIT_EVENT_VACUUM_DELAY:
|
|
|
|
event_name = "VacuumDelay";
|
|
|
|
break;
|
2017-05-17 22:31:56 +02:00
|
|
|
/* no default case, so that compiler will warn */
|
2016-10-04 16:50:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return event_name;
|
|
|
|
}
|
|
|
|
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_get_wait_io() -
|
|
|
|
*
|
|
|
|
* Convert WaitEventIO to string.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static const char *
|
|
|
|
pgstat_get_wait_io(WaitEventIO w)
|
|
|
|
{
|
|
|
|
const char *event_name = "unknown wait event";
|
|
|
|
|
|
|
|
switch (w)
|
|
|
|
{
|
2020-06-17 17:39:17 +02:00
|
|
|
case WAIT_EVENT_BASEBACKUP_READ:
|
|
|
|
event_name = "BaseBackupRead";
|
|
|
|
break;
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
case WAIT_EVENT_BUFFILE_READ:
|
|
|
|
event_name = "BufFileRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_BUFFILE_WRITE:
|
|
|
|
event_name = "BufFileWrite";
|
|
|
|
break;
|
2020-08-26 04:06:43 +02:00
|
|
|
case WAIT_EVENT_BUFFILE_TRUNCATE:
|
|
|
|
event_name = "BufFileTruncate";
|
|
|
|
break;
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
case WAIT_EVENT_CONTROL_FILE_READ:
|
|
|
|
event_name = "ControlFileRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CONTROL_FILE_SYNC:
|
|
|
|
event_name = "ControlFileSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
|
|
|
|
event_name = "ControlFileSyncUpdate";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CONTROL_FILE_WRITE:
|
|
|
|
event_name = "ControlFileWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
|
|
|
|
event_name = "ControlFileWriteUpdate";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_COPY_FILE_READ:
|
|
|
|
event_name = "CopyFileRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_COPY_FILE_WRITE:
|
|
|
|
event_name = "CopyFileWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_EXTEND:
|
|
|
|
event_name = "DataFileExtend";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_FLUSH:
|
|
|
|
event_name = "DataFileFlush";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
|
|
|
|
event_name = "DataFileImmediateSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_PREFETCH:
|
|
|
|
event_name = "DataFilePrefetch";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_READ:
|
|
|
|
event_name = "DataFileRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_SYNC:
|
|
|
|
event_name = "DataFileSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_TRUNCATE:
|
|
|
|
event_name = "DataFileTruncate";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DATA_FILE_WRITE:
|
|
|
|
event_name = "DataFileWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
|
|
|
|
event_name = "DSMFillZeroWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
|
|
|
|
event_name = "LockFileAddToDataDirRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
|
|
|
|
event_name = "LockFileAddToDataDirSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
|
|
|
|
event_name = "LockFileAddToDataDirWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_CREATE_READ:
|
|
|
|
event_name = "LockFileCreateRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
|
|
|
|
event_name = "LockFileCreateSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
|
2019-01-15 00:47:01 +01:00
|
|
|
event_name = "LockFileCreateWrite";
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
|
|
|
|
event_name = "LockFileReCheckDataDirRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
|
|
|
|
event_name = "LogicalRewriteCheckpointSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
|
|
|
|
event_name = "LogicalRewriteMappingSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
|
|
|
|
event_name = "LogicalRewriteMappingWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
|
|
|
|
event_name = "LogicalRewriteSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
|
|
|
|
event_name = "LogicalRewriteTruncate";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
|
|
|
|
event_name = "LogicalRewriteWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_RELATION_MAP_READ:
|
|
|
|
event_name = "RelationMapRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_RELATION_MAP_SYNC:
|
|
|
|
event_name = "RelationMapSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_RELATION_MAP_WRITE:
|
|
|
|
event_name = "RelationMapWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REORDER_BUFFER_READ:
|
|
|
|
event_name = "ReorderBufferRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REORDER_BUFFER_WRITE:
|
|
|
|
event_name = "ReorderBufferWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
|
|
|
|
event_name = "ReorderLogicalMappingRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REPLICATION_SLOT_READ:
|
|
|
|
event_name = "ReplicationSlotRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
|
|
|
|
event_name = "ReplicationSlotRestoreSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REPLICATION_SLOT_SYNC:
|
|
|
|
event_name = "ReplicationSlotSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_REPLICATION_SLOT_WRITE:
|
|
|
|
event_name = "ReplicationSlotWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SLRU_FLUSH_SYNC:
|
|
|
|
event_name = "SLRUFlushSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SLRU_READ:
|
|
|
|
event_name = "SLRURead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SLRU_SYNC:
|
|
|
|
event_name = "SLRUSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SLRU_WRITE:
|
|
|
|
event_name = "SLRUWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SNAPBUILD_READ:
|
|
|
|
event_name = "SnapbuildRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SNAPBUILD_SYNC:
|
|
|
|
event_name = "SnapbuildSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_SNAPBUILD_WRITE:
|
|
|
|
event_name = "SnapbuildWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
|
|
|
|
event_name = "TimelineHistoryFileSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
|
|
|
|
event_name = "TimelineHistoryFileWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TIMELINE_HISTORY_READ:
|
|
|
|
event_name = "TimelineHistoryRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
|
|
|
|
event_name = "TimelineHistorySync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
|
|
|
|
event_name = "TimelineHistoryWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TWOPHASE_FILE_READ:
|
|
|
|
event_name = "TwophaseFileRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TWOPHASE_FILE_SYNC:
|
|
|
|
event_name = "TwophaseFileSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_TWOPHASE_FILE_WRITE:
|
|
|
|
event_name = "TwophaseFileWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
|
|
|
|
event_name = "WALSenderTimelineHistoryRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
|
|
|
|
event_name = "WALBootstrapSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
|
|
|
|
event_name = "WALBootstrapWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_COPY_READ:
|
|
|
|
event_name = "WALCopyRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_COPY_SYNC:
|
|
|
|
event_name = "WALCopySync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_COPY_WRITE:
|
|
|
|
event_name = "WALCopyWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_INIT_SYNC:
|
|
|
|
event_name = "WALInitSync";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_INIT_WRITE:
|
|
|
|
event_name = "WALInitWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_READ:
|
|
|
|
event_name = "WALRead";
|
|
|
|
break;
|
2018-07-02 15:19:46 +02:00
|
|
|
case WAIT_EVENT_WAL_SYNC:
|
|
|
|
event_name = "WALSync";
|
|
|
|
break;
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
|
|
|
|
event_name = "WALSyncMethodAssign";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_WAL_WRITE:
|
|
|
|
event_name = "WALWrite";
|
|
|
|
break;
|
Add support for streaming to built-in logical replication.
To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:
* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).
* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.
* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.
We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-09-03 04:24:07 +02:00
|
|
|
case WAIT_EVENT_LOGICAL_CHANGES_READ:
|
|
|
|
event_name = "LogicalChangesRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
|
|
|
|
event_name = "LogicalChangesWrite";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_SUBXACT_READ:
|
|
|
|
event_name = "LogicalSubxactRead";
|
|
|
|
break;
|
|
|
|
case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
|
|
|
|
event_name = "LogicalSubxactWrite";
|
|
|
|
break;
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
|
|
|
|
/* no default case, so that compiler will warn */
|
|
|
|
}
|
|
|
|
|
|
|
|
return event_name;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-03-21 22:08:31 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_get_backend_current_activity() -
|
|
|
|
*
|
|
|
|
* Return a string representing the current activity of the backend with
|
2014-05-06 18:12:18 +02:00
|
|
|
* the specified PID. This looks directly at the BackendStatusArray,
|
2008-03-21 22:08:31 +01:00
|
|
|
* and so will provide current information regardless of the age of our
|
|
|
|
* transaction's snapshot of the status array.
|
|
|
|
*
|
|
|
|
* It is the caller's responsibility to invoke this only for backends whose
|
2014-05-06 18:12:18 +02:00
|
|
|
* state is expected to remain stable while the result is in use. The
|
2008-03-21 22:08:31 +01:00
|
|
|
* only current use is in deadlock reporting, where we can expect that
|
|
|
|
* the target backend is blocked on a lock. (There are corner cases
|
|
|
|
* where the target's wait could get aborted while we are looking at it,
|
|
|
|
* but the very worst consequence is to return a pointer to a string
|
|
|
|
* that's been changed, so we won't worry too much.)
|
|
|
|
*
|
|
|
|
* Note: return strings for special cases match pg_stat_get_backend_activity.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
const char *
|
2008-03-24 19:22:36 +01:00
|
|
|
pgstat_get_backend_current_activity(int pid, bool checkUser)
|
2008-03-21 22:08:31 +01:00
|
|
|
{
|
|
|
|
PgBackendStatus *beentry;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
beentry = BackendStatusArray;
|
|
|
|
for (i = 1; i <= MaxBackends; i++)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Although we expect the target backend's entry to be stable, that
|
|
|
|
* doesn't imply that anyone else's is. To avoid identifying the
|
|
|
|
* wrong backend, while we check for a match to the desired PID we
|
|
|
|
* must follow the protocol of retrying if st_changecount changes
|
|
|
|
* while we examine the entry, or if it's odd. (This might be
|
|
|
|
* unnecessary, since fetching or storing an int is almost certainly
|
2009-06-11 16:49:15 +02:00
|
|
|
* atomic, but let's play it safe.) We use a volatile pointer here to
|
|
|
|
* ensure the compiler doesn't try to get cute.
|
2008-03-21 22:08:31 +01:00
|
|
|
*/
|
|
|
|
volatile PgBackendStatus *vbeentry = beentry;
|
2009-06-11 16:49:15 +02:00
|
|
|
bool found;
|
2008-03-21 22:08:31 +01:00
|
|
|
|
|
|
|
for (;;)
|
|
|
|
{
|
2014-12-18 15:07:51 +01:00
|
|
|
int before_changecount;
|
|
|
|
int after_changecount;
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
pgstat_begin_read_activity(vbeentry, before_changecount);
|
2008-03-21 22:08:31 +01:00
|
|
|
|
|
|
|
found = (vbeentry->st_procpid == pid);
|
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
pgstat_end_read_activity(vbeentry, after_changecount);
|
2014-12-18 15:07:51 +01:00
|
|
|
|
Rearrange pgstat_bestart() to avoid failures within its critical section.
We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
2019-05-12 03:27:13 +02:00
|
|
|
if (pgstat_read_activity_complete(before_changecount,
|
|
|
|
after_changecount))
|
2008-03-21 22:08:31 +01:00
|
|
|
break;
|
|
|
|
|
|
|
|
/* Make sure we can break out of loop if stuck... */
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (found)
|
|
|
|
{
|
|
|
|
/* Now it is safe to use the non-volatile pointer */
|
2008-03-24 19:22:36 +01:00
|
|
|
if (checkUser && !superuser() && beentry->st_userid != GetUserId())
|
2008-03-21 22:08:31 +01:00
|
|
|
return "<insufficient privilege>";
|
2017-09-19 20:46:07 +02:00
|
|
|
else if (*(beentry->st_activity_raw) == '\0')
|
2008-03-21 22:08:31 +01:00
|
|
|
return "<command string not enabled>";
|
|
|
|
else
|
2017-09-19 20:46:07 +02:00
|
|
|
{
|
|
|
|
/* this'll leak a bit of memory, but that seems acceptable */
|
|
|
|
return pgstat_clip_activity(beentry->st_activity_raw);
|
|
|
|
}
|
2008-03-21 22:08:31 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
beentry++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If we get here, caller is in error ... */
|
|
|
|
return "<backend information not available>";
|
|
|
|
}
|
|
|
|
|
2011-10-21 19:26:40 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_get_crashed_backend_activity() -
|
|
|
|
*
|
|
|
|
* Return a string representing the current activity of the backend with
|
2014-05-06 18:12:18 +02:00
|
|
|
* the specified PID. Like the function above, but reads shared memory with
|
2011-10-21 22:36:04 +02:00
|
|
|
* the expectation that it may be corrupt. On success, copy the string
|
|
|
|
* into the "buffer" argument and return that pointer. On failure,
|
|
|
|
* return NULL.
|
2011-10-21 19:26:40 +02:00
|
|
|
*
|
2011-10-21 22:36:04 +02:00
|
|
|
* This function is only intended to be used by the postmaster to report the
|
|
|
|
* query that crashed a backend. In particular, no attempt is made to
|
2011-10-21 19:26:40 +02:00
|
|
|
* follow the correct concurrency protocol when accessing the
|
2011-10-21 22:36:04 +02:00
|
|
|
* BackendStatusArray. But that's OK, in the worst case we'll return a
|
2014-05-06 18:12:18 +02:00
|
|
|
* corrupted message. We also must take care not to trip on ereport(ERROR).
|
2011-10-21 19:26:40 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
const char *
|
2011-10-21 22:36:04 +02:00
|
|
|
pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
|
2011-10-21 19:26:40 +02:00
|
|
|
{
|
|
|
|
volatile PgBackendStatus *beentry;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
beentry = BackendStatusArray;
|
2011-10-21 22:36:04 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We probably shouldn't get here before shared memory has been set up,
|
|
|
|
* but be safe.
|
|
|
|
*/
|
|
|
|
if (beentry == NULL || BackendActivityBuffer == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
2011-10-21 19:26:40 +02:00
|
|
|
for (i = 1; i <= MaxBackends; i++)
|
|
|
|
{
|
|
|
|
if (beentry->st_procpid == pid)
|
|
|
|
{
|
|
|
|
/* Read pointer just once, so it can't change after validation */
|
2017-09-19 20:46:07 +02:00
|
|
|
const char *activity = beentry->st_activity_raw;
|
2011-10-21 19:26:40 +02:00
|
|
|
const char *activity_last;
|
|
|
|
|
|
|
|
/*
|
2011-10-21 22:36:04 +02:00
|
|
|
* We mustn't access activity string before we verify that it
|
|
|
|
* falls within the BackendActivityBuffer. To make sure that the
|
|
|
|
* entire string including its ending is contained within the
|
|
|
|
* buffer, subtract one activity length from the buffer size.
|
2011-10-21 19:26:40 +02:00
|
|
|
*/
|
|
|
|
activity_last = BackendActivityBuffer + BackendActivityBufferSize
|
2011-10-21 22:36:04 +02:00
|
|
|
- pgstat_track_activity_query_size;
|
2011-10-21 19:26:40 +02:00
|
|
|
|
|
|
|
if (activity < BackendActivityBuffer ||
|
|
|
|
activity > activity_last)
|
2011-10-21 22:36:04 +02:00
|
|
|
return NULL;
|
2011-10-21 19:26:40 +02:00
|
|
|
|
2011-10-21 22:36:04 +02:00
|
|
|
/* If no string available, no point in a report */
|
|
|
|
if (activity[0] == '\0')
|
|
|
|
return NULL;
|
2011-10-21 19:26:40 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy only ASCII-safe characters so we don't run into encoding
|
2012-06-10 21:20:04 +02:00
|
|
|
* problems when reporting the message; and be sure not to run off
|
2017-09-19 20:46:07 +02:00
|
|
|
* the end of memory. As only ASCII characters are reported, it
|
|
|
|
* doesn't seem necessary to perform multibyte aware clipping.
|
2011-10-21 19:26:40 +02:00
|
|
|
*/
|
2011-10-21 22:36:04 +02:00
|
|
|
ascii_safe_strlcpy(buffer, activity,
|
|
|
|
Min(buflen, pgstat_track_activity_query_size));
|
2011-10-21 19:26:40 +02:00
|
|
|
|
|
|
|
return buffer;
|
|
|
|
}
|
|
|
|
|
|
|
|
beentry++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* PID not found */
|
2011-10-21 22:36:04 +02:00
|
|
|
return NULL;
|
2011-10-21 19:26:40 +02:00
|
|
|
}
|
2008-03-21 22:08:31 +01:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
/* ------------------------------------------------------------
|
|
|
|
* Local support functions follow
|
|
|
|
* ------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_setheader() -
|
|
|
|
*
|
|
|
|
* Set common header fields in a statistics message
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
|
|
|
|
{
|
|
|
|
hdr->m_type = mtype;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_send() -
|
|
|
|
*
|
|
|
|
* Send out one statistics message to the collector
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_send(void *msg, int len)
|
|
|
|
{
|
2006-07-16 20:17:14 +02:00
|
|
|
int rc;
|
|
|
|
|
2010-01-31 18:39:34 +01:00
|
|
|
if (pgStatSock == PGINVALID_SOCKET)
|
2006-06-19 03:51:22 +02:00
|
|
|
return;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2001-10-25 07:50:21 +02:00
|
|
|
((PgStat_MsgHdr *) msg)->m_size = len;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-07-16 20:17:14 +02:00
|
|
|
/* We'll retry after EINTR, but ignore all other failures */
|
|
|
|
do
|
|
|
|
{
|
|
|
|
rc = send(pgStatSock, msg, len, 0);
|
|
|
|
} while (rc < 0 && errno == EINTR);
|
|
|
|
|
2005-08-30 04:47:37 +02:00
|
|
|
#ifdef USE_ASSERT_CHECKING
|
2006-07-16 20:17:14 +02:00
|
|
|
/* In debug builds, log send failures ... */
|
|
|
|
if (rc < 0)
|
2005-08-30 04:47:37 +02:00
|
|
|
elog(LOG, "could not send to statistics collector: %m");
|
|
|
|
#endif
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_send_archiver() -
|
|
|
|
*
|
|
|
|
* Tell the collector about the WAL file that we successfully
|
|
|
|
* archived or failed to archive.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_send_archiver(const char *xlog, bool failed)
|
|
|
|
{
|
2014-05-06 18:12:18 +02:00
|
|
|
PgStat_MsgArchiver msg;
|
2014-01-28 18:58:22 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare and send the message
|
|
|
|
*/
|
|
|
|
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
|
|
|
|
msg.m_failed = failed;
|
2020-08-10 18:51:31 +02:00
|
|
|
strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
|
2014-01-28 18:58:22 +01:00
|
|
|
msg.m_timestamp = GetCurrentTimestamp();
|
|
|
|
pgstat_send(&msg, sizeof(msg));
|
|
|
|
}
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_send_bgwriter() -
|
|
|
|
*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Send bgwriter statistics to the collector
|
2007-03-30 20:34:56 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_send_bgwriter(void)
|
|
|
|
{
|
2007-05-27 05:50:39 +02:00
|
|
|
/* We assume this initializes to zeroes */
|
|
|
|
static const PgStat_MsgBgWriter all_zeroes;
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* This function can be called even if nothing at all has happened. In
|
|
|
|
* this case, avoid sending a completely empty message to the stats
|
|
|
|
* collector.
|
2007-03-30 20:34:56 +02:00
|
|
|
*/
|
2007-05-27 05:50:39 +02:00
|
|
|
if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
|
2007-03-30 20:34:56 +02:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare and send the message
|
|
|
|
*/
|
|
|
|
pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
|
|
|
|
pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
|
|
|
|
|
|
|
|
/*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Clear out the statistics buffer, so it can be re-used.
|
2007-03-30 20:34:56 +02:00
|
|
|
*/
|
2007-05-27 05:50:39 +02:00
|
|
|
MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
|
2007-03-30 20:34:56 +02:00
|
|
|
}
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_send_wal() -
|
|
|
|
*
|
|
|
|
* Send WAL statistics to the collector
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_send_wal(void)
|
|
|
|
{
|
|
|
|
/* We assume this initializes to zeroes */
|
|
|
|
static const PgStat_MsgWal all_zeroes;
|
|
|
|
|
2020-12-02 05:00:15 +01:00
|
|
|
WalUsage walusage;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate how much WAL usage counters are increased by substracting the
|
|
|
|
* previous counters from the current ones. Fill the results in WAL stats
|
|
|
|
* message.
|
|
|
|
*/
|
|
|
|
MemSet(&walusage, 0, sizeof(WalUsage));
|
|
|
|
WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
|
|
|
|
|
|
|
|
WalStats.m_wal_records = walusage.wal_records;
|
|
|
|
WalStats.m_wal_fpi = walusage.wal_fpi;
|
|
|
|
WalStats.m_wal_bytes = walusage.wal_bytes;
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/*
|
|
|
|
* This function can be called even if nothing at all has happened. In
|
|
|
|
* this case, avoid sending a completely empty message to the stats
|
|
|
|
* collector.
|
|
|
|
*/
|
|
|
|
if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare and send the message
|
|
|
|
*/
|
|
|
|
pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
|
|
|
|
pgstat_send(&WalStats, sizeof(WalStats));
|
|
|
|
|
2020-12-02 05:00:15 +01:00
|
|
|
/*
|
|
|
|
* Save the current counters for the subsequent calculation of WAL usage.
|
|
|
|
*/
|
|
|
|
prevWalUsage = pgWalUsage;
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/*
|
|
|
|
* Clear out the statistics buffer, so it can be re-used.
|
|
|
|
*/
|
|
|
|
MemSet(&WalStats, 0, sizeof(WalStats));
|
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_send_slru() -
|
|
|
|
*
|
|
|
|
* Send SLRU statistics to the collector
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_send_slru(void)
|
|
|
|
{
|
|
|
|
/* We assume this initializes to zeroes */
|
|
|
|
static const PgStat_MsgSLRU all_zeroes;
|
|
|
|
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This function can be called even if nothing at all has happened. In
|
|
|
|
* this case, avoid sending a completely empty message to the stats
|
|
|
|
* collector.
|
|
|
|
*/
|
|
|
|
if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* set the SLRU type before each send */
|
|
|
|
SLRUStats[i].m_index = i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare and send the message
|
|
|
|
*/
|
|
|
|
pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
|
|
|
|
pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear out the statistics buffer, so it can be re-used.
|
|
|
|
*/
|
|
|
|
MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2004-05-28 07:13:32 +02:00
|
|
|
/* ----------
|
|
|
|
* PgstatCollectorMain() -
|
|
|
|
*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Start up the statistics collector process. This is the body of the
|
2006-06-29 22:00:08 +02:00
|
|
|
* postmaster child process.
|
2004-05-28 07:13:32 +02:00
|
|
|
*
|
|
|
|
* The argc/argv parameters are valid only in EXEC_BACKEND case.
|
|
|
|
* ----------
|
|
|
|
*/
|
2003-12-25 04:52:51 +01:00
|
|
|
NON_EXEC_STATIC void
|
2004-05-28 07:13:32 +02:00
|
|
|
PgstatCollectorMain(int argc, char *argv[])
|
2003-12-25 04:52:51 +01:00
|
|
|
{
|
2006-06-29 22:00:08 +02:00
|
|
|
int len;
|
2003-12-25 04:52:51 +01:00
|
|
|
PgStat_Msg msg;
|
2012-05-13 20:44:39 +02:00
|
|
|
int wr;
|
2020-07-30 07:25:48 +02:00
|
|
|
WaitEvent event;
|
|
|
|
WaitEventSet *wes;
|
2006-06-29 22:00:08 +02:00
|
|
|
|
2004-05-28 07:13:32 +02:00
|
|
|
/*
|
2006-06-29 22:00:08 +02:00
|
|
|
* Ignore all signals usually bound to some action in the postmaster,
|
2012-05-13 20:44:39 +02:00
|
|
|
* except SIGHUP and SIGQUIT. Note we don't need a SIGUSR1 handler to
|
2015-01-14 18:45:22 +01:00
|
|
|
* support latch operations, because we only use a local latch.
|
2004-05-28 07:13:32 +02:00
|
|
|
*/
|
2019-12-17 19:14:28 +01:00
|
|
|
pqsignal(SIGHUP, SignalHandlerForConfigReload);
|
2004-05-28 07:13:32 +02:00
|
|
|
pqsignal(SIGINT, SIG_IGN);
|
|
|
|
pqsignal(SIGTERM, SIG_IGN);
|
2019-12-17 19:14:28 +01:00
|
|
|
pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
|
2008-11-03 02:17:08 +01:00
|
|
|
pqsignal(SIGALRM, SIG_IGN);
|
2004-05-28 07:13:32 +02:00
|
|
|
pqsignal(SIGPIPE, SIG_IGN);
|
|
|
|
pqsignal(SIGUSR1, SIG_IGN);
|
|
|
|
pqsignal(SIGUSR2, SIG_IGN);
|
Leave SIGTTIN/SIGTTOU signal handling alone in postmaster child processes.
For reasons lost in the mists of time, most postmaster child processes
reset SIGTTIN/SIGTTOU signal handling to SIG_DFL, with the major exception
that backend sessions do not. It seems like a pretty bad idea for any
postmaster children to do that: if stderr is connected to the terminal,
and the user has put the postmaster in background, any log output would
result in the child process freezing up. Hence, switch them all to
doing what backends do, ie, nothing. This allows them to inherit the
postmaster's SIG_IGN setting. On the other hand, manually-launched
processes such as standalone backends will have default processing,
which seems fine.
In passing, also remove useless resets of SIGCONT and SIGWINCH signal
processing. Perhaps the postmaster once changed those to something
besides SIG_DFL, but it doesn't now, so these are just wasted (and
confusing) syscalls.
Basically, this propagates the changes made in commit 8e2998d8a from
backends to other postmaster children. Probably the only reason these
calls now exist elsewhere is that I missed changing pgstat.c along with
postgres.c at the time.
Given the lack of field complaints that can be traced to this, I don't
presently feel a need to back-patch.
Discussion: https://postgr.es/m/5627.1542477392@sss.pgh.pa.us
2018-11-17 22:23:55 +01:00
|
|
|
/* Reset some signals that are accepted by postmaster but not here */
|
2004-05-28 07:13:32 +02:00
|
|
|
pqsignal(SIGCHLD, SIG_DFL);
|
|
|
|
PG_SETMASK(&UnBlockSig);
|
|
|
|
|
2020-03-11 16:36:40 +01:00
|
|
|
MyBackendType = B_STATS_COLLECTOR;
|
|
|
|
init_ps_display(NULL);
|
2001-08-04 02:14:43 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Read in existing stats files or initialize the stats to zero.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2006-01-03 17:42:17 +01:00
|
|
|
pgStatRunningInCollector = true;
|
2013-02-18 21:56:08 +01:00
|
|
|
pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2020-07-30 07:25:48 +02:00
|
|
|
/* Prepare to wait for our latch or data in our socket. */
|
|
|
|
wes = CreateWaitEventSet(CurrentMemoryContext, 3);
|
|
|
|
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
|
|
|
|
AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
|
|
|
|
AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2006-06-29 22:00:08 +02:00
|
|
|
* Loop to process messages until we get SIGQUIT or detect ungraceful
|
|
|
|
* death of our parent postmaster.
|
|
|
|
*
|
2012-05-13 20:44:39 +02:00
|
|
|
* For performance reasons, we don't want to do ResetLatch/WaitLatch after
|
|
|
|
* every message; instead, do that only after a recv() fails to obtain a
|
|
|
|
* message. (This effectively means that if backends are sending us stuff
|
|
|
|
* like mad, we won't notice postmaster death until things slack off a
|
2012-06-10 21:20:04 +02:00
|
|
|
* bit; which seems fine.) To do that, we have an inner loop that
|
2019-12-17 19:03:57 +01:00
|
|
|
* iterates as long as recv() succeeds. We do check ConfigReloadPending
|
|
|
|
* inside the inner loop, which means that such interrupts will get
|
|
|
|
* serviced but the latch won't get cleared until next time there is a
|
|
|
|
* break in the action.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
for (;;)
|
|
|
|
{
|
2012-05-13 20:44:39 +02:00
|
|
|
/* Clear any already-pending wakeups */
|
2015-01-14 18:45:22 +01:00
|
|
|
ResetLatch(MyLatch);
|
2006-06-29 22:00:08 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Quit if we get SIGQUIT from the postmaster.
|
|
|
|
*/
|
2019-12-17 19:14:28 +01:00
|
|
|
if (ShutdownRequestPending)
|
2006-06-29 22:00:08 +02:00
|
|
|
break;
|
|
|
|
|
2008-08-25 17:11:01 +02:00
|
|
|
/*
|
2012-05-13 20:44:39 +02:00
|
|
|
* Inner loop iterates as long as we keep getting messages, or until
|
2019-12-17 19:14:28 +01:00
|
|
|
* ShutdownRequestPending becomes set.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2019-12-17 19:14:28 +01:00
|
|
|
while (!ShutdownRequestPending)
|
2012-05-10 23:26:08 +02:00
|
|
|
{
|
2012-05-13 20:44:39 +02:00
|
|
|
/*
|
|
|
|
* Reload configuration if we got SIGHUP from the postmaster.
|
|
|
|
*/
|
2019-12-17 19:03:57 +01:00
|
|
|
if (ConfigReloadPending)
|
2012-05-13 20:44:39 +02:00
|
|
|
{
|
2019-12-17 19:03:57 +01:00
|
|
|
ConfigReloadPending = false;
|
2012-05-13 20:44:39 +02:00
|
|
|
ProcessConfigFile(PGC_SIGHUP);
|
|
|
|
}
|
2012-05-10 23:26:08 +02:00
|
|
|
|
2012-05-13 20:44:39 +02:00
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Write the stats file(s) if a new request has arrived that is
|
|
|
|
* not satisfied by existing file(s).
|
2012-05-13 20:44:39 +02:00
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
if (pgstat_write_statsfile_needed())
|
|
|
|
pgstat_write_statsfiles(false, false);
|
2012-05-10 23:26:08 +02:00
|
|
|
|
2012-05-13 20:44:39 +02:00
|
|
|
/*
|
|
|
|
* Try to receive and process a message. This will not block,
|
|
|
|
* since the socket is set to non-blocking mode.
|
2012-05-14 16:57:07 +02:00
|
|
|
*
|
2012-05-15 05:51:34 +02:00
|
|
|
* XXX On Windows, we have to force pgwin32_recv to cooperate,
|
|
|
|
* despite the previous use of pg_set_noblock() on the socket.
|
2012-05-14 16:57:07 +02:00
|
|
|
* This is extremely broken and should be fixed someday.
|
2012-05-13 20:44:39 +02:00
|
|
|
*/
|
2012-05-14 16:57:07 +02:00
|
|
|
#ifdef WIN32
|
|
|
|
pgwin32_noblock = 1;
|
|
|
|
#endif
|
|
|
|
|
2006-06-29 22:00:08 +02:00
|
|
|
len = recv(pgStatSock, (char *) &msg,
|
|
|
|
sizeof(PgStat_Msg), 0);
|
2012-05-14 16:57:07 +02:00
|
|
|
|
|
|
|
#ifdef WIN32
|
|
|
|
pgwin32_noblock = 0;
|
|
|
|
#endif
|
|
|
|
|
2006-06-29 22:00:08 +02:00
|
|
|
if (len < 0)
|
2006-07-16 20:17:14 +02:00
|
|
|
{
|
2012-05-13 20:44:39 +02:00
|
|
|
if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
|
|
|
|
break; /* out of inner loop */
|
2006-06-29 22:00:08 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_socket_access(),
|
|
|
|
errmsg("could not read statistics message: %m")));
|
2006-07-16 20:17:14 +02:00
|
|
|
}
|
2006-06-29 22:00:08 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2006-06-29 22:00:08 +02:00
|
|
|
* We ignore messages that are smaller than our common header
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2006-06-29 22:00:08 +02:00
|
|
|
if (len < sizeof(PgStat_MsgHdr))
|
|
|
|
continue;
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2001-08-05 04:06:50 +02:00
|
|
|
/*
|
2006-06-29 22:00:08 +02:00
|
|
|
* The received length must match the length in the header
|
2001-08-05 04:06:50 +02:00
|
|
|
*/
|
2006-06-29 22:00:08 +02:00
|
|
|
if (msg.msg_hdr.m_size != len)
|
|
|
|
continue;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2006-06-29 22:00:08 +02:00
|
|
|
* O.K. - we accept this message. Process it.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
switch (msg.msg_hdr.m_type)
|
|
|
|
{
|
|
|
|
case PGSTAT_MTYPE_DUMMY:
|
|
|
|
break;
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
case PGSTAT_MTYPE_INQUIRY:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_inquiry(&msg.msg_inquiry, len);
|
2008-11-03 02:17:08 +01:00
|
|
|
break;
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
case PGSTAT_MTYPE_TABSTAT:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_tabstat(&msg.msg_tabstat, len);
|
2001-06-22 21:18:36 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case PGSTAT_MTYPE_TABPURGE:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
|
2001-06-22 21:18:36 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case PGSTAT_MTYPE_DROPDB:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_dropdb(&msg.msg_dropdb, len);
|
2001-06-22 21:18:36 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case PGSTAT_MTYPE_RESETCOUNTER:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
|
2001-06-22 21:18:36 +02:00
|
|
|
break;
|
|
|
|
|
2010-01-19 15:11:32 +01:00
|
|
|
case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
|
2020-01-30 17:42:14 +01:00
|
|
|
pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
|
2010-02-26 03:01:40 +01:00
|
|
|
len);
|
2010-01-19 15:11:32 +01:00
|
|
|
break;
|
|
|
|
|
2010-01-28 15:25:41 +01:00
|
|
|
case PGSTAT_MTYPE_RESETSINGLECOUNTER:
|
2020-01-30 17:42:14 +01:00
|
|
|
pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
|
2010-02-26 03:01:40 +01:00
|
|
|
len);
|
2010-01-28 15:25:41 +01:00
|
|
|
break;
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
case PGSTAT_MTYPE_RESETSLRUCOUNTER:
|
|
|
|
pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
|
|
|
|
len);
|
|
|
|
break;
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
|
|
|
|
pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
|
|
|
|
len);
|
|
|
|
break;
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
case PGSTAT_MTYPE_AUTOVAC_START:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
|
2005-07-14 07:13:45 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case PGSTAT_MTYPE_VACUUM:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_vacuum(&msg.msg_vacuum, len);
|
2005-07-14 07:13:45 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
case PGSTAT_MTYPE_ANALYZE:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_analyze(&msg.msg_analyze, len);
|
2005-07-14 07:13:45 +02:00
|
|
|
break;
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
case PGSTAT_MTYPE_ARCHIVER:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_archiver(&msg.msg_archiver, len);
|
2014-01-28 18:58:22 +01:00
|
|
|
break;
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
case PGSTAT_MTYPE_BGWRITER:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
|
2007-03-30 20:34:56 +02:00
|
|
|
break;
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
case PGSTAT_MTYPE_WAL:
|
|
|
|
pgstat_recv_wal(&msg.msg_wal, len);
|
|
|
|
break;
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
case PGSTAT_MTYPE_SLRU:
|
|
|
|
pgstat_recv_slru(&msg.msg_slru, len);
|
|
|
|
break;
|
|
|
|
|
2009-06-11 16:49:15 +02:00
|
|
|
case PGSTAT_MTYPE_FUNCSTAT:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_funcstat(&msg.msg_funcstat, len);
|
2009-06-11 16:49:15 +02:00
|
|
|
break;
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
case PGSTAT_MTYPE_FUNCPURGE:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
|
2008-05-15 02:17:41 +02:00
|
|
|
break;
|
|
|
|
|
2011-01-03 12:46:03 +01:00
|
|
|
case PGSTAT_MTYPE_RECOVERYCONFLICT:
|
2020-01-30 17:42:14 +01:00
|
|
|
pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
|
2019-05-01 12:30:44 +02:00
|
|
|
len);
|
2011-01-03 12:46:03 +01:00
|
|
|
break;
|
|
|
|
|
2012-01-26 15:58:19 +01:00
|
|
|
case PGSTAT_MTYPE_DEADLOCK:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_deadlock(&msg.msg_deadlock, len);
|
2012-01-26 15:58:19 +01:00
|
|
|
break;
|
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
case PGSTAT_MTYPE_TEMPFILE:
|
2019-05-01 12:30:44 +02:00
|
|
|
pgstat_recv_tempfile(&msg.msg_tempfile, len);
|
2012-01-26 14:41:19 +01:00
|
|
|
break;
|
|
|
|
|
2019-03-09 19:45:17 +01:00
|
|
|
case PGSTAT_MTYPE_CHECKSUMFAILURE:
|
2020-01-30 17:42:14 +01:00
|
|
|
pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
|
2019-05-01 12:30:44 +02:00
|
|
|
len);
|
2019-03-09 19:45:17 +01:00
|
|
|
break;
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
case PGSTAT_MTYPE_REPLSLOT:
|
|
|
|
pgstat_recv_replslot(&msg.msg_replslot, len);
|
|
|
|
break;
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2012-05-13 20:44:39 +02:00
|
|
|
} /* end of inner message-processing loop */
|
|
|
|
|
|
|
|
/* Sleep until there's something to do */
|
2012-05-15 05:51:34 +02:00
|
|
|
#ifndef WIN32
|
2020-07-30 07:25:48 +02:00
|
|
|
wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
|
2012-05-15 05:51:34 +02:00
|
|
|
#else
|
2012-06-10 21:20:04 +02:00
|
|
|
|
2012-05-15 05:51:34 +02:00
|
|
|
/*
|
|
|
|
* Windows, at least in its Windows Server 2003 R2 incarnation,
|
2014-05-06 18:12:18 +02:00
|
|
|
* sometimes loses FD_READ events. Waking up and retrying the recv()
|
2012-05-15 05:51:34 +02:00
|
|
|
* fixes that, so don't sleep indefinitely. This is a crock of the
|
|
|
|
* first water, but until somebody wants to debug exactly what's
|
|
|
|
* happening there, this is the best we can do. The two-second
|
|
|
|
* timeout matches our pre-9.2 behavior, and needs to be short enough
|
2015-01-20 05:01:33 +01:00
|
|
|
* to not provoke "using stale statistics" complaints from
|
2012-05-15 05:51:34 +02:00
|
|
|
* backend_read_statsfile.
|
|
|
|
*/
|
2020-07-30 07:25:48 +02:00
|
|
|
wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
|
|
|
|
WAIT_EVENT_PGSTAT_MAIN);
|
2012-05-15 05:51:34 +02:00
|
|
|
#endif
|
2012-05-13 20:44:39 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Emergency bailout if postmaster has died. This is to avoid the
|
|
|
|
* necessity for manual cleanup of all postmaster children.
|
|
|
|
*/
|
2020-07-30 07:25:48 +02:00
|
|
|
if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
|
2012-05-13 20:44:39 +02:00
|
|
|
break;
|
|
|
|
} /* end of outer loop */
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-06-29 22:00:08 +02:00
|
|
|
/*
|
|
|
|
* Save the final stats to reuse at next startup.
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_write_statsfiles(true, true);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2020-07-30 07:25:48 +02:00
|
|
|
FreeWaitEventSet(wes);
|
|
|
|
|
2006-06-29 22:00:08 +02:00
|
|
|
exit(0);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/*
|
|
|
|
* Subroutine to clear stats in a database entry
|
|
|
|
*
|
|
|
|
* Tables and functions hashes are initialized to empty.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
|
|
|
|
{
|
|
|
|
HASHCTL hash_ctl;
|
|
|
|
|
|
|
|
dbentry->n_xact_commit = 0;
|
|
|
|
dbentry->n_xact_rollback = 0;
|
|
|
|
dbentry->n_blocks_fetched = 0;
|
|
|
|
dbentry->n_blocks_hit = 0;
|
|
|
|
dbentry->n_tuples_returned = 0;
|
|
|
|
dbentry->n_tuples_fetched = 0;
|
|
|
|
dbentry->n_tuples_inserted = 0;
|
|
|
|
dbentry->n_tuples_updated = 0;
|
|
|
|
dbentry->n_tuples_deleted = 0;
|
|
|
|
dbentry->last_autovac_time = 0;
|
|
|
|
dbentry->n_conflict_tablespace = 0;
|
|
|
|
dbentry->n_conflict_lock = 0;
|
|
|
|
dbentry->n_conflict_snapshot = 0;
|
|
|
|
dbentry->n_conflict_bufferpin = 0;
|
|
|
|
dbentry->n_conflict_startup_deadlock = 0;
|
|
|
|
dbentry->n_temp_files = 0;
|
|
|
|
dbentry->n_temp_bytes = 0;
|
|
|
|
dbentry->n_deadlocks = 0;
|
2019-03-09 19:45:17 +01:00
|
|
|
dbentry->n_checksum_failures = 0;
|
2019-04-12 14:04:50 +02:00
|
|
|
dbentry->last_checksum_failure = 0;
|
2013-02-18 21:56:08 +01:00
|
|
|
dbentry->n_block_read_time = 0;
|
|
|
|
dbentry->n_block_write_time = 0;
|
|
|
|
|
|
|
|
dbentry->stat_reset_timestamp = GetCurrentTimestamp();
|
|
|
|
dbentry->stats_timestamp = 0;
|
|
|
|
|
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
|
|
|
hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
|
|
|
|
dbentry->tables = hash_create("Per-database table",
|
|
|
|
PGSTAT_TAB_HASH_SIZE,
|
|
|
|
&hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS);
|
2013-02-18 21:56:08 +01:00
|
|
|
|
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
|
|
|
hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
|
|
|
|
dbentry->functions = hash_create("Per-database function",
|
|
|
|
PGSTAT_FUNCTION_HASH_SIZE,
|
|
|
|
&hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS);
|
2013-02-18 21:56:08 +01:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2005-05-11 03:41:41 +02:00
|
|
|
/*
|
|
|
|
* Lookup the hash table entry for the specified database. If no hash
|
2005-07-29 21:30:09 +02:00
|
|
|
* table entry exists, initialize it, if the create parameter is true.
|
|
|
|
* Else, return NULL.
|
2005-05-11 03:41:41 +02:00
|
|
|
*/
|
|
|
|
static PgStat_StatDBEntry *
|
2005-07-29 21:30:09 +02:00
|
|
|
pgstat_get_db_entry(Oid databaseid, bool create)
|
2005-05-11 03:41:41 +02:00
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *result;
|
2005-10-15 04:49:52 +02:00
|
|
|
bool found;
|
|
|
|
HASHACTION action = (create ? HASH_ENTER : HASH_FIND);
|
2005-05-11 03:41:41 +02:00
|
|
|
|
|
|
|
/* Lookup or create the hash table entry for this database */
|
|
|
|
result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
|
|
|
|
&databaseid,
|
2005-07-29 21:30:09 +02:00
|
|
|
action, &found);
|
|
|
|
|
|
|
|
if (!create && !found)
|
|
|
|
return NULL;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/*
|
|
|
|
* If not found, initialize the new one. This creates empty hash tables
|
|
|
|
* for tables and functions, too.
|
|
|
|
*/
|
2001-06-22 21:18:36 +02:00
|
|
|
if (!found)
|
2013-02-18 21:56:08 +01:00
|
|
|
reset_dbentry_counters(result);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2005-05-11 03:41:41 +02:00
|
|
|
return result;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2009-09-05 00:32:33 +02:00
|
|
|
/*
|
|
|
|
* Lookup the hash table entry for the specified table. If no hash
|
|
|
|
* table entry exists, initialize it, if the create parameter is true.
|
|
|
|
* Else, return NULL.
|
|
|
|
*/
|
|
|
|
static PgStat_StatTabEntry *
|
|
|
|
pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
|
|
|
|
{
|
|
|
|
PgStat_StatTabEntry *result;
|
|
|
|
bool found;
|
|
|
|
HASHACTION action = (create ? HASH_ENTER : HASH_FIND);
|
|
|
|
|
|
|
|
/* Lookup or create the hash table entry for this table */
|
|
|
|
result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
|
|
|
|
&tableoid,
|
|
|
|
action, &found);
|
|
|
|
|
|
|
|
if (!create && !found)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* If not found, initialize the new one. */
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
result->numscans = 0;
|
|
|
|
result->tuples_returned = 0;
|
|
|
|
result->tuples_fetched = 0;
|
|
|
|
result->tuples_inserted = 0;
|
|
|
|
result->tuples_updated = 0;
|
|
|
|
result->tuples_deleted = 0;
|
|
|
|
result->tuples_hot_updated = 0;
|
|
|
|
result->n_live_tuples = 0;
|
|
|
|
result->n_dead_tuples = 0;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
result->changes_since_analyze = 0;
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
result->inserts_since_vacuum = 0;
|
2009-09-05 00:32:33 +02:00
|
|
|
result->blocks_fetched = 0;
|
|
|
|
result->blocks_hit = 0;
|
|
|
|
result->vacuum_timestamp = 0;
|
2010-08-21 12:59:17 +02:00
|
|
|
result->vacuum_count = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
result->autovac_vacuum_timestamp = 0;
|
2010-08-21 12:59:17 +02:00
|
|
|
result->autovac_vacuum_count = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
result->analyze_timestamp = 0;
|
2010-08-21 12:59:17 +02:00
|
|
|
result->analyze_count = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
result->autovac_analyze_timestamp = 0;
|
2010-08-21 12:59:17 +02:00
|
|
|
result->autovac_analyze_count = 0;
|
2009-09-05 00:32:33 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
2013-02-18 21:56:08 +01:00
|
|
|
* pgstat_write_statsfiles() -
|
|
|
|
* Write the global statistics file, as well as requested DB files.
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* 'permanent' specifies writing to the permanent files not temporary ones.
|
|
|
|
* When true (happens only when the collector is shutting down), also remove
|
|
|
|
* the temporary files so that backends starting up under a new postmaster
|
|
|
|
* can't read old data before the new collector is ready.
|
2013-02-18 21:56:08 +01:00
|
|
|
*
|
|
|
|
* When 'allDbs' is false, only the requested databases (listed in
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* pending_write_requests) will be written; otherwise, all databases
|
|
|
|
* will be written.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_write_statsfiles(bool permanent, bool allDbs)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
HASH_SEQ_STATUS hstat;
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
FILE *fpout;
|
2005-07-14 07:13:45 +02:00
|
|
|
int32 format_id;
|
2009-06-11 16:49:15 +02:00
|
|
|
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
|
|
|
|
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
|
2011-10-19 03:37:51 +02:00
|
|
|
int rc;
|
2020-10-08 05:39:08 +02:00
|
|
|
int i;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "writing stats file \"%s\"", statfile);
|
2013-02-18 21:56:08 +01:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2001-10-25 07:50:21 +02:00
|
|
|
* Open the statistics temp file to write out the current values.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2008-11-03 02:17:08 +01:00
|
|
|
fpout = AllocateFile(tmpfile, PG_BINARY_W);
|
2001-06-22 21:18:36 +02:00
|
|
|
if (fpout == NULL)
|
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("could not open temporary statistics file \"%s\": %m",
|
2008-08-05 14:09:30 +02:00
|
|
|
tmpfile)));
|
2001-06-22 21:18:36 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
/*
|
|
|
|
* Set the timestamp of the stats file.
|
|
|
|
*/
|
|
|
|
globalStats.stats_timestamp = GetCurrentTimestamp();
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
|
|
|
* Write the file header --- currently just a format ID.
|
|
|
|
*/
|
|
|
|
format_id = PGSTAT_FILE_FORMAT_ID;
|
2011-10-19 03:37:51 +02:00
|
|
|
rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/*
|
|
|
|
* Write global stats struct
|
|
|
|
*/
|
2011-10-19 03:37:51 +02:00
|
|
|
rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
2007-03-30 20:34:56 +02:00
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
/*
|
|
|
|
* Write archiver stats struct
|
|
|
|
*/
|
|
|
|
rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/*
|
|
|
|
* Write WAL stats struct
|
|
|
|
*/
|
|
|
|
rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
|
|
|
* Write SLRU stats struct
|
|
|
|
*/
|
|
|
|
rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
|
|
|
* Walk through the database table.
|
|
|
|
*/
|
|
|
|
hash_seq_init(&hstat, pgStatDBHash);
|
2001-10-05 19:28:13 +02:00
|
|
|
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Write out the table and function stats for this DB into the
|
|
|
|
* appropriate per-DB stat file, if required.
|
2008-05-15 02:17:41 +02:00
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
if (allDbs || pgstat_db_requested(dbentry->databaseid))
|
2008-05-15 02:17:41 +02:00
|
|
|
{
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
/* Make DB's timestamp consistent with the global stats */
|
2013-02-18 21:56:08 +01:00
|
|
|
dbentry->stats_timestamp = globalStats.stats_timestamp;
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_write_db_statsfile(dbentry, permanent);
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2013-02-18 21:56:08 +01:00
|
|
|
* Write out the DB entry. We don't write the tables or functions
|
|
|
|
* pointers, since they're of no use to any other process.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
fputc('D', fpout);
|
|
|
|
rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/*
|
|
|
|
* Write replication slot stats struct
|
|
|
|
*/
|
|
|
|
for (i = 0; i < nReplSlotStats; i++)
|
|
|
|
{
|
|
|
|
fputc('R', fpout);
|
|
|
|
rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2001-10-25 07:50:21 +02:00
|
|
|
* No more output to be done. Close the temp file and replace the old
|
2006-01-18 21:35:06 +01:00
|
|
|
* pgstat.stat with it. The ferror() check replaces testing for error
|
|
|
|
* after each individual fputc or fwrite above.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
fputc('E', fpout);
|
2006-01-18 21:35:06 +01:00
|
|
|
|
|
|
|
if (ferror(fpout))
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not write temporary statistics file \"%s\": %m",
|
|
|
|
tmpfile)));
|
2008-11-03 02:17:08 +01:00
|
|
|
FreeFile(fpout);
|
2008-08-05 14:09:30 +02:00
|
|
|
unlink(tmpfile);
|
2006-01-18 21:35:06 +01:00
|
|
|
}
|
2008-11-03 02:17:08 +01:00
|
|
|
else if (FreeFile(fpout) < 0)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not close temporary statistics file \"%s\": %m",
|
|
|
|
tmpfile)));
|
2008-08-05 14:09:30 +02:00
|
|
|
unlink(tmpfile);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
2008-08-05 14:09:30 +02:00
|
|
|
else if (rename(tmpfile, statfile) < 0)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2006-01-18 21:35:06 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
|
2008-08-05 14:09:30 +02:00
|
|
|
tmpfile, statfile)));
|
|
|
|
unlink(tmpfile);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
2013-02-18 21:56:08 +01:00
|
|
|
|
|
|
|
if (permanent)
|
|
|
|
unlink(pgstat_stat_filename);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now throw away the list of requests. Note that requests sent after we
|
|
|
|
* started the write are still waiting on the network socket.
|
|
|
|
*/
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
list_free(pending_write_requests);
|
|
|
|
pending_write_requests = NIL;
|
2013-02-18 21:56:08 +01:00
|
|
|
}
|
2008-08-05 14:09:30 +02:00
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/*
|
|
|
|
* return the filename for a DB stat file; filename is the output buffer,
|
|
|
|
* of length len.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
|
|
|
|
char *filename, int len)
|
|
|
|
{
|
|
|
|
int printed;
|
|
|
|
|
2013-08-19 23:48:17 +02:00
|
|
|
/* NB -- pgstat_reset_remove_files knows about the pattern this uses */
|
2013-02-18 21:56:08 +01:00
|
|
|
printed = snprintf(filename, len, "%s/db_%u.%s",
|
|
|
|
permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
|
|
|
|
pgstat_stat_directory,
|
|
|
|
databaseid,
|
|
|
|
tempname ? "tmp" : "stat");
|
Clean up assorted misuses of snprintf()'s result value.
Fix a small number of places that were testing the result of snprintf()
but doing so incorrectly. The right test for buffer overrun, per C99,
is "result >= bufsize" not "result > bufsize". Some places were also
checking for failure with "result == -1", but the standard only says
that a negative value is delivered on failure.
(Note that this only makes these places correct if snprintf() delivers
C99-compliant results. But at least now these places are consistent
with all the other places where we assume that.)
Also, make psql_start_test() and isolation_start_test() check for
buffer overrun while constructing their shell commands. There seems
like a higher risk of overrun, with more severe consequences, here
than there is for the individual file paths that are made elsewhere
in the same functions, so this seemed like a worthwhile change.
Also fix guc.c's do_serialize() to initialize errno = 0 before
calling vsnprintf. In principle, this should be unnecessary because
vsnprintf should have set errno if it returns a failure indication ...
but the other two places this coding pattern is cribbed from don't
assume that, so let's be consistent.
These errors are all very old, so back-patch as appropriate. I think
that only the shell command overrun cases are even theoretically
reachable in practice, but there's not much point in erroneous error
checks.
Discussion: https://postgr.es/m/17245.1534289329@sss.pgh.pa.us
2018-08-15 22:29:31 +02:00
|
|
|
if (printed >= len)
|
2013-02-18 21:56:08 +01:00
|
|
|
elog(ERROR, "overlength pgstat path");
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_write_db_statsfile() -
|
|
|
|
* Write the stat file for a single database.
|
|
|
|
*
|
|
|
|
* If writing to the permanent file (happens when the collector is
|
|
|
|
* shutting down only), remove the temporary file so that backends
|
|
|
|
* starting up under a new postmaster can't read the old data before
|
|
|
|
* the new collector is ready.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
|
|
|
|
{
|
|
|
|
HASH_SEQ_STATUS tstat;
|
|
|
|
HASH_SEQ_STATUS fstat;
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
PgStat_StatFuncEntry *funcentry;
|
|
|
|
FILE *fpout;
|
|
|
|
int32 format_id;
|
|
|
|
Oid dbid = dbentry->databaseid;
|
|
|
|
int rc;
|
|
|
|
char tmpfile[MAXPGPATH];
|
|
|
|
char statfile[MAXPGPATH];
|
|
|
|
|
|
|
|
get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
|
|
|
|
get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
|
|
|
|
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "writing stats file \"%s\"", statfile);
|
2013-02-18 21:56:08 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the statistics temp file to write out the current values.
|
|
|
|
*/
|
|
|
|
fpout = AllocateFile(tmpfile, PG_BINARY_W);
|
|
|
|
if (fpout == NULL)
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open temporary statistics file \"%s\": %m",
|
|
|
|
tmpfile)));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write the file header --- currently just a format ID.
|
|
|
|
*/
|
|
|
|
format_id = PGSTAT_FILE_FORMAT_ID;
|
|
|
|
rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Walk through the database's access stats per table.
|
|
|
|
*/
|
|
|
|
hash_seq_init(&tstat, dbentry->tables);
|
|
|
|
while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
|
|
|
|
{
|
|
|
|
fputc('T', fpout);
|
|
|
|
rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Walk through the database's function stats table.
|
|
|
|
*/
|
|
|
|
hash_seq_init(&fstat, dbentry->functions);
|
|
|
|
while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
|
|
|
|
{
|
|
|
|
fputc('F', fpout);
|
|
|
|
rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
|
|
|
|
(void) rc; /* we'll check for error with ferror */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No more output to be done. Close the temp file and replace the old
|
|
|
|
* pgstat.stat with it. The ferror() check replaces testing for error
|
|
|
|
* after each individual fputc or fwrite above.
|
|
|
|
*/
|
|
|
|
fputc('E', fpout);
|
|
|
|
|
|
|
|
if (ferror(fpout))
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not write temporary statistics file \"%s\": %m",
|
|
|
|
tmpfile)));
|
2013-02-18 21:56:08 +01:00
|
|
|
FreeFile(fpout);
|
|
|
|
unlink(tmpfile);
|
|
|
|
}
|
|
|
|
else if (FreeFile(fpout) < 0)
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not close temporary statistics file \"%s\": %m",
|
|
|
|
tmpfile)));
|
2013-02-18 21:56:08 +01:00
|
|
|
unlink(tmpfile);
|
|
|
|
}
|
|
|
|
else if (rename(tmpfile, statfile) < 0)
|
|
|
|
{
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
|
|
|
|
tmpfile, statfile)));
|
|
|
|
unlink(tmpfile);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (permanent)
|
|
|
|
{
|
|
|
|
get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
|
|
|
|
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
|
2013-02-18 21:56:08 +01:00
|
|
|
unlink(statfile);
|
|
|
|
}
|
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/* ----------
|
2013-02-18 21:56:08 +01:00
|
|
|
* pgstat_read_statsfiles() -
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Reads in some existing statistics collector files and returns the
|
|
|
|
* databases hash table that is the top level of the data.
|
|
|
|
*
|
|
|
|
* If 'onlydb' is not InvalidOid, it means we only want data for that DB
|
|
|
|
* plus the shared catalogs ("DB 0"). We'll still populate the DB hash
|
|
|
|
* table for all databases, but we don't bother even creating table/function
|
|
|
|
* hash tables for other databases.
|
2013-02-18 21:56:08 +01:00
|
|
|
*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* 'permanent' specifies reading from the permanent files not temporary ones.
|
|
|
|
* When true (happens only when the collector is starting up), remove the
|
|
|
|
* files after reading; the in-memory status is now authoritative, and the
|
|
|
|
* files would be out of date in case somebody else reads them.
|
|
|
|
*
|
|
|
|
* If a 'deep' read is requested, table/function stats are read, otherwise
|
2013-02-18 21:56:08 +01:00
|
|
|
* the table/function hash tables remain empty.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
2007-02-08 00:11:30 +01:00
|
|
|
static HTAB *
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatDBEntry dbbuf;
|
|
|
|
HASHCTL hash_ctl;
|
2007-02-08 00:11:30 +01:00
|
|
|
HTAB *dbhash;
|
2001-10-25 07:50:21 +02:00
|
|
|
FILE *fpin;
|
2005-07-14 07:13:45 +02:00
|
|
|
int32 format_id;
|
2001-10-25 07:50:21 +02:00
|
|
|
bool found;
|
2009-06-11 16:49:15 +02:00
|
|
|
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
int i;
|
2001-10-25 07:50:21 +02:00
|
|
|
|
|
|
|
/*
|
2007-02-08 00:11:30 +01:00
|
|
|
* The tables will live in pgStatLocalContext.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2007-02-08 00:11:30 +01:00
|
|
|
pgstat_setup_memcxt();
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create the DB hashtable
|
|
|
|
*/
|
2001-10-25 07:50:21 +02:00
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
2001-10-01 07:36:17 +02:00
|
|
|
hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
|
2007-02-08 00:11:30 +01:00
|
|
|
hash_ctl.hcxt = pgStatLocalContext;
|
|
|
|
dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/* Allocate the space for replication slot statistics */
|
|
|
|
replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
|
|
|
|
nReplSlotStats = 0;
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/*
|
2020-10-02 03:17:11 +02:00
|
|
|
* Clear out global, archiver, WAL and SLRU statistics so they start from
|
|
|
|
* zero in case we can't load an existing statsfile.
|
2007-03-30 20:34:56 +02:00
|
|
|
*/
|
|
|
|
memset(&globalStats, 0, sizeof(globalStats));
|
2014-01-28 18:58:22 +01:00
|
|
|
memset(&archiverStats, 0, sizeof(archiverStats));
|
2020-10-02 03:17:11 +02:00
|
|
|
memset(&walStats, 0, sizeof(walStats));
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
memset(&slruStats, 0, sizeof(slruStats));
|
2007-03-30 20:34:56 +02:00
|
|
|
|
2011-02-10 15:09:35 +01:00
|
|
|
/*
|
|
|
|
* Set the current timestamp (will be kept only in case we can't load an
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
* existing statsfile).
|
2011-02-10 15:09:35 +01:00
|
|
|
*/
|
|
|
|
globalStats.stat_reset_timestamp = GetCurrentTimestamp();
|
2014-01-28 18:58:22 +01:00
|
|
|
archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
|
2020-10-02 03:17:11 +02:00
|
|
|
walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
|
2011-02-10 15:09:35 +01:00
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
|
|
|
* Set the same reset timestamp for all SLRU items too.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
|
|
|
|
slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/*
|
|
|
|
* Set the same reset timestamp for all replication slots too.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < max_replication_slots; i++)
|
|
|
|
replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2013-02-18 21:56:08 +01:00
|
|
|
* Try to open the stats file. If it doesn't exist, the backends simply
|
2005-10-15 04:49:52 +02:00
|
|
|
* return zero for anything and the collector simply starts from scratch
|
|
|
|
* with empty counters.
|
2010-03-12 23:19:19 +01:00
|
|
|
*
|
|
|
|
* ENOENT is a possibility if the stats collector is not running or has
|
|
|
|
* not yet written the stats file the first time. Any other failure
|
|
|
|
* condition is suspicious.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2008-08-05 14:09:30 +02:00
|
|
|
if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
|
2010-03-12 23:19:19 +01:00
|
|
|
{
|
|
|
|
if (errno != ENOENT)
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open statistics file \"%s\": %m",
|
|
|
|
statfile)));
|
2007-02-08 00:11:30 +01:00
|
|
|
return dbhash;
|
2010-03-12 23:19:19 +01:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
|
|
|
* Verify it's of the expected format.
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
|
|
|
|
format_id != PGSTAT_FILE_FORMAT_ID)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
2005-07-14 07:13:45 +02:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/*
|
|
|
|
* Read global stats struct
|
|
|
|
*/
|
|
|
|
if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
Ignore old stats file timestamps when starting the stats collector.
The stats collector disregards inquiry messages that bear a cutoff_time
before when it last wrote the relevant stats file. That's fine, but at
startup when it reads the "permanent" stats files, it absorbed their
timestamps as if they were the times at which the corresponding temporary
stats files had been written. In reality, of course, there's no data
out there at all. This led to disregarding inquiry messages soon after
startup if the postmaster had been shut down and restarted within less
than PGSTAT_STAT_INTERVAL; which is a pretty common scenario, both for
testing and in the field. Requesting backends would hang for 10 seconds
and then report failure to read statistics, unless they got bailed out
by some other backend coming along and making a newer request within
that interval.
I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there because the autovacuum
launcher hangs for 10 seconds when it can't get statistics at startup,
thus preventing a second shutdown from occurring promptly. We might
want to do some things in the autovac code to make it less prone to
getting stuck that way, but this change is a good bug fix regardless.
In passing, also fix pgstat_read_statsfiles() to ensure that it
re-zeroes its global stats variables if they are corrupted by a
short read from the stats file. (Other reads in that function
go into temp variables, so that the issue doesn't arise.)
This has been broken since we created the separation between permanent
and temporary stats files in 8.4, so back-patch to all supported branches.
Discussion: https://postgr.es/m/16860.1498442626@sss.pgh.pa.us
2017-06-26 22:17:05 +02:00
|
|
|
memset(&globalStats, 0, sizeof(globalStats));
|
2007-03-30 20:34:56 +02:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
Ignore old stats file timestamps when starting the stats collector.
The stats collector disregards inquiry messages that bear a cutoff_time
before when it last wrote the relevant stats file. That's fine, but at
startup when it reads the "permanent" stats files, it absorbed their
timestamps as if they were the times at which the corresponding temporary
stats files had been written. In reality, of course, there's no data
out there at all. This led to disregarding inquiry messages soon after
startup if the postmaster had been shut down and restarted within less
than PGSTAT_STAT_INTERVAL; which is a pretty common scenario, both for
testing and in the field. Requesting backends would hang for 10 seconds
and then report failure to read statistics, unless they got bailed out
by some other backend coming along and making a newer request within
that interval.
I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there because the autovacuum
launcher hangs for 10 seconds when it can't get statistics at startup,
thus preventing a second shutdown from occurring promptly. We might
want to do some things in the autovac code to make it less prone to
getting stuck that way, but this change is a good bug fix regardless.
In passing, also fix pgstat_read_statsfiles() to ensure that it
re-zeroes its global stats variables if they are corrupted by a
short read from the stats file. (Other reads in that function
go into temp variables, so that the issue doesn't arise.)
This has been broken since we created the separation between permanent
and temporary stats files in 8.4, so back-patch to all supported branches.
Discussion: https://postgr.es/m/16860.1498442626@sss.pgh.pa.us
2017-06-26 22:17:05 +02:00
|
|
|
/*
|
|
|
|
* In the collector, disregard the timestamp we read from the permanent
|
|
|
|
* stats file; we should be willing to write a temp stats file immediately
|
|
|
|
* upon the first request from any backend. This only matters if the old
|
|
|
|
* file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
|
|
|
|
* an unusual scenario.
|
|
|
|
*/
|
|
|
|
if (pgStatRunningInCollector)
|
|
|
|
globalStats.stats_timestamp = 0;
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
/*
|
|
|
|
* Read archiver stats struct
|
|
|
|
*/
|
|
|
|
if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
Ignore old stats file timestamps when starting the stats collector.
The stats collector disregards inquiry messages that bear a cutoff_time
before when it last wrote the relevant stats file. That's fine, but at
startup when it reads the "permanent" stats files, it absorbed their
timestamps as if they were the times at which the corresponding temporary
stats files had been written. In reality, of course, there's no data
out there at all. This led to disregarding inquiry messages soon after
startup if the postmaster had been shut down and restarted within less
than PGSTAT_STAT_INTERVAL; which is a pretty common scenario, both for
testing and in the field. Requesting backends would hang for 10 seconds
and then report failure to read statistics, unless they got bailed out
by some other backend coming along and making a newer request within
that interval.
I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there because the autovacuum
launcher hangs for 10 seconds when it can't get statistics at startup,
thus preventing a second shutdown from occurring promptly. We might
want to do some things in the autovac code to make it less prone to
getting stuck that way, but this change is a good bug fix regardless.
In passing, also fix pgstat_read_statsfiles() to ensure that it
re-zeroes its global stats variables if they are corrupted by a
short read from the stats file. (Other reads in that function
go into temp variables, so that the issue doesn't arise.)
This has been broken since we created the separation between permanent
and temporary stats files in 8.4, so back-patch to all supported branches.
Discussion: https://postgr.es/m/16860.1498442626@sss.pgh.pa.us
2017-06-26 22:17:05 +02:00
|
|
|
memset(&archiverStats, 0, sizeof(archiverStats));
|
2014-01-28 18:58:22 +01:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/*
|
|
|
|
* Read WAL stats struct
|
|
|
|
*/
|
|
|
|
if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
|
|
|
memset(&walStats, 0, sizeof(walStats));
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
|
|
|
* Read SLRU stats struct
|
|
|
|
*/
|
|
|
|
if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
|
|
|
memset(&slruStats, 0, sizeof(slruStats));
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2001-10-25 07:50:21 +02:00
|
|
|
* We found an existing collector stats file. Read it and put all the
|
|
|
|
* hashtable entries into place.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
for (;;)
|
|
|
|
{
|
|
|
|
switch (fgetc(fpin))
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
/*
|
|
|
|
* 'D' A PgStat_StatDBEntry struct describing a database
|
2013-02-18 21:56:08 +01:00
|
|
|
* follows.
|
2001-10-25 07:50:21 +02:00
|
|
|
*/
|
2001-06-22 21:18:36 +02:00
|
|
|
case 'D':
|
2006-01-18 21:35:06 +01:00
|
|
|
if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
|
|
|
|
fpin) != offsetof(PgStat_StatDBEntry, tables))
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2004-10-28 03:38:41 +02:00
|
|
|
goto done;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add to the DB hash
|
|
|
|
*/
|
2007-02-08 00:11:30 +01:00
|
|
|
dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(void *) &dbbuf.databaseid,
|
2003-07-22 21:00:12 +02:00
|
|
|
HASH_ENTER,
|
|
|
|
&found);
|
2001-06-22 21:18:36 +02:00
|
|
|
if (found)
|
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2004-10-28 03:38:41 +02:00
|
|
|
goto done;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
|
2001-10-25 07:50:21 +02:00
|
|
|
dbentry->tables = NULL;
|
2008-05-15 02:17:41 +02:00
|
|
|
dbentry->functions = NULL;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
Ignore old stats file timestamps when starting the stats collector.
The stats collector disregards inquiry messages that bear a cutoff_time
before when it last wrote the relevant stats file. That's fine, but at
startup when it reads the "permanent" stats files, it absorbed their
timestamps as if they were the times at which the corresponding temporary
stats files had been written. In reality, of course, there's no data
out there at all. This led to disregarding inquiry messages soon after
startup if the postmaster had been shut down and restarted within less
than PGSTAT_STAT_INTERVAL; which is a pretty common scenario, both for
testing and in the field. Requesting backends would hang for 10 seconds
and then report failure to read statistics, unless they got bailed out
by some other backend coming along and making a newer request within
that interval.
I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there because the autovacuum
launcher hangs for 10 seconds when it can't get statistics at startup,
thus preventing a second shutdown from occurring promptly. We might
want to do some things in the autovac code to make it less prone to
getting stuck that way, but this change is a good bug fix regardless.
In passing, also fix pgstat_read_statsfiles() to ensure that it
re-zeroes its global stats variables if they are corrupted by a
short read from the stats file. (Other reads in that function
go into temp variables, so that the issue doesn't arise.)
This has been broken since we created the separation between permanent
and temporary stats files in 8.4, so back-patch to all supported branches.
Discussion: https://postgr.es/m/16860.1498442626@sss.pgh.pa.us
2017-06-26 22:17:05 +02:00
|
|
|
/*
|
|
|
|
* In the collector, disregard the timestamp we read from the
|
|
|
|
* permanent stats file; we should be willing to write a temp
|
|
|
|
* stats file immediately upon the first request from any
|
|
|
|
* backend.
|
|
|
|
*/
|
|
|
|
if (pgStatRunningInCollector)
|
|
|
|
dbentry->stats_timestamp = 0;
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Don't create tables/functions hashtables for uninteresting
|
|
|
|
* databases.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-07-29 21:30:09 +02:00
|
|
|
if (onlydb != InvalidOid)
|
|
|
|
{
|
|
|
|
if (dbbuf.databaseid != onlydb &&
|
|
|
|
dbbuf.databaseid != InvalidOid)
|
2005-10-15 04:49:52 +02:00
|
|
|
break;
|
2005-07-29 21:30:09 +02:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2001-10-25 07:50:21 +02:00
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
2001-10-01 07:36:17 +02:00
|
|
|
hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
|
2007-02-08 00:11:30 +01:00
|
|
|
hash_ctl.hcxt = pgStatLocalContext;
|
2004-10-28 03:38:41 +02:00
|
|
|
dbentry->tables = hash_create("Per-database table",
|
|
|
|
PGSTAT_TAB_HASH_SIZE,
|
|
|
|
&hash_ctl,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
|
|
|
hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
|
|
|
|
hash_ctl.hcxt = pgStatLocalContext;
|
|
|
|
dbentry->functions = hash_create("Per-database function",
|
|
|
|
PGSTAT_FUNCTION_HASH_SIZE,
|
|
|
|
&hash_ctl,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/*
|
2013-02-18 21:56:08 +01:00
|
|
|
* If requested, read the data from the database-specific
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* file. Otherwise we just leave the hashtables empty.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
if (deep)
|
|
|
|
pgstat_read_db_statsfile(dbentry->databaseid,
|
|
|
|
dbentry->tables,
|
|
|
|
dbentry->functions,
|
|
|
|
permanent);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
break;
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/*
|
|
|
|
* 'R' A PgStat_ReplSlotStats struct describing a replication
|
|
|
|
* slot follows.
|
|
|
|
*/
|
|
|
|
case 'R':
|
|
|
|
if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
|
|
|
|
!= sizeof(PgStat_ReplSlotStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
|
|
|
memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
nReplSlotStats++;
|
|
|
|
break;
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
case 'E':
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
default:
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
done:
|
|
|
|
FreeFile(fpin);
|
|
|
|
|
|
|
|
/* If requested to read the permanent file, also get rid of it. */
|
|
|
|
if (permanent)
|
|
|
|
{
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
|
2013-02-18 21:56:08 +01:00
|
|
|
unlink(statfile);
|
|
|
|
}
|
|
|
|
|
|
|
|
return dbhash;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_read_db_statsfile() -
|
|
|
|
*
|
|
|
|
* Reads in the existing statistics collector file for the given database,
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* filling the passed-in tables and functions hash tables.
|
2013-02-18 21:56:08 +01:00
|
|
|
*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* As in pgstat_read_statsfiles, if the permanent file is requested, it is
|
2013-02-18 21:56:08 +01:00
|
|
|
* removed after reading.
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
*
|
|
|
|
* Note: this code has the ability to skip storing per-table or per-function
|
|
|
|
* data, if NULL is passed for the corresponding hashtable. That's not used
|
|
|
|
* at the moment though.
|
2013-02-18 21:56:08 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
|
|
|
|
bool permanent)
|
|
|
|
{
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
PgStat_StatTabEntry tabbuf;
|
|
|
|
PgStat_StatFuncEntry funcbuf;
|
|
|
|
PgStat_StatFuncEntry *funcentry;
|
|
|
|
FILE *fpin;
|
|
|
|
int32 format_id;
|
|
|
|
bool found;
|
|
|
|
char statfile[MAXPGPATH];
|
|
|
|
|
|
|
|
get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to open the stats file. If it doesn't exist, the backends simply
|
|
|
|
* return zero for anything and the collector simply starts from scratch
|
|
|
|
* with empty counters.
|
|
|
|
*
|
|
|
|
* ENOENT is a possibility if the stats collector is not running or has
|
|
|
|
* not yet written the stats file the first time. Any other failure
|
|
|
|
* condition is suspicious.
|
|
|
|
*/
|
|
|
|
if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
|
|
|
|
{
|
|
|
|
if (errno != ENOENT)
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open statistics file \"%s\": %m",
|
|
|
|
statfile)));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify it's of the expected format.
|
|
|
|
*/
|
|
|
|
if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
|
|
|
|
format_id != PGSTAT_FILE_FORMAT_ID)
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We found an existing collector stats file. Read it and put all the
|
|
|
|
* hashtable entries into place.
|
|
|
|
*/
|
|
|
|
for (;;)
|
|
|
|
{
|
|
|
|
switch (fgetc(fpin))
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
/*
|
|
|
|
* 'T' A PgStat_StatTabEntry follows.
|
|
|
|
*/
|
2001-06-22 21:18:36 +02:00
|
|
|
case 'T':
|
2006-01-18 21:35:06 +01:00
|
|
|
if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
|
|
|
|
fpin) != sizeof(PgStat_StatTabEntry))
|
2001-06-22 21:18:36 +02:00
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2004-10-28 03:38:41 +02:00
|
|
|
goto done;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Skip if table data not wanted.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
if (tabhash == NULL)
|
|
|
|
break;
|
|
|
|
|
2001-10-01 07:36:17 +02:00
|
|
|
tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(void *) &tabbuf.tableid,
|
|
|
|
HASH_ENTER, &found);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
if (found)
|
|
|
|
{
|
2003-07-22 21:00:12 +02:00
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2004-10-28 03:38:41 +02:00
|
|
|
goto done;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(tabentry, &tabbuf, sizeof(tabbuf));
|
|
|
|
break;
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/*
|
|
|
|
* 'F' A PgStat_StatFuncEntry follows.
|
|
|
|
*/
|
|
|
|
case 'F':
|
|
|
|
if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
|
|
|
|
fpin) != sizeof(PgStat_StatFuncEntry))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2008-05-15 02:17:41 +02:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Skip if function data not wanted.
|
2008-05-15 02:17:41 +02:00
|
|
|
*/
|
|
|
|
if (funchash == NULL)
|
|
|
|
break;
|
|
|
|
|
|
|
|
funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(void *) &funcbuf.functionid,
|
|
|
|
HASH_ENTER, &found);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
if (found)
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2008-05-15 02:17:41 +02:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(funcentry, &funcbuf, sizeof(funcbuf));
|
|
|
|
break;
|
|
|
|
|
2001-10-25 07:50:21 +02:00
|
|
|
/*
|
2006-06-19 03:51:22 +02:00
|
|
|
* 'E' The EOF marker of a complete stats file.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2006-06-19 03:51:22 +02:00
|
|
|
case 'E':
|
|
|
|
goto done;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
default:
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
2010-03-12 23:19:19 +01:00
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2006-06-19 03:51:22 +02:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
done:
|
|
|
|
FreeFile(fpin);
|
2007-02-08 00:11:30 +01:00
|
|
|
|
2008-08-05 14:09:30 +02:00
|
|
|
if (permanent)
|
2013-02-18 21:56:08 +01:00
|
|
|
{
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
|
2013-02-18 21:56:08 +01:00
|
|
|
unlink(statfile);
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
/* ----------
|
2013-02-18 21:56:08 +01:00
|
|
|
* pgstat_read_db_statsfile_timestamp() -
|
|
|
|
*
|
|
|
|
* Attempt to determine the timestamp of the last db statfile write.
|
2020-09-12 04:32:54 +02:00
|
|
|
* Returns true if successful; the timestamp is stored in *ts. The caller must
|
|
|
|
* rely on timestamp stored in *ts iff the function returns true.
|
2013-02-18 21:56:08 +01:00
|
|
|
*
|
|
|
|
* This needs to be careful about handling databases for which no stats file
|
|
|
|
* exists, such as databases without a stat entry or those not yet written:
|
2008-11-03 02:17:08 +01:00
|
|
|
*
|
2013-02-18 21:56:08 +01:00
|
|
|
* - if there's a database entry in the global file, return the corresponding
|
|
|
|
* stats_timestamp value.
|
|
|
|
*
|
|
|
|
* - if there's no db stat entry (e.g. for a new or inactive database),
|
2014-01-28 18:58:22 +01:00
|
|
|
* there's no stats_timestamp value, but also nothing to write so we return
|
2013-02-18 21:56:08 +01:00
|
|
|
* the timestamp of the global statfile.
|
2008-11-03 02:17:08 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static bool
|
2013-02-18 21:56:08 +01:00
|
|
|
pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
|
|
|
|
TimestampTz *ts)
|
2008-11-03 02:17:08 +01:00
|
|
|
{
|
2013-02-18 21:56:08 +01:00
|
|
|
PgStat_StatDBEntry dbentry;
|
2008-11-03 02:17:08 +01:00
|
|
|
PgStat_GlobalStats myGlobalStats;
|
2014-01-28 18:58:22 +01:00
|
|
|
PgStat_ArchiverStats myArchiverStats;
|
2020-10-02 03:17:11 +02:00
|
|
|
PgStat_WalStats myWalStats;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
|
2020-10-08 05:39:08 +02:00
|
|
|
PgStat_ReplSlotStats myReplSlotStats;
|
2008-11-03 02:17:08 +01:00
|
|
|
FILE *fpin;
|
|
|
|
int32 format_id;
|
2009-06-11 16:49:15 +02:00
|
|
|
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
|
2008-11-03 02:17:08 +01:00
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Try to open the stats file. As above, anything but ENOENT is worthy of
|
2013-02-18 21:56:08 +01:00
|
|
|
* complaining about.
|
2008-11-03 02:17:08 +01:00
|
|
|
*/
|
|
|
|
if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
|
2010-03-12 23:19:19 +01:00
|
|
|
{
|
|
|
|
if (errno != ENOENT)
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open statistics file \"%s\": %m",
|
|
|
|
statfile)));
|
2008-11-03 02:17:08 +01:00
|
|
|
return false;
|
2010-03-12 23:19:19 +01:00
|
|
|
}
|
2008-11-03 02:17:08 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify it's of the expected format.
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
|
|
|
|
format_id != PGSTAT_FILE_FORMAT_ID)
|
2008-11-03 02:17:08 +01:00
|
|
|
{
|
2010-03-12 23:19:19 +01:00
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
2008-11-03 02:17:08 +01:00
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read global stats struct
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
|
|
|
|
fpin) != sizeof(myGlobalStats))
|
2008-11-03 02:17:08 +01:00
|
|
|
{
|
2010-03-12 23:19:19 +01:00
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
2008-11-03 02:17:08 +01:00
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
/*
|
|
|
|
* Read archiver stats struct
|
|
|
|
*/
|
|
|
|
if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
|
|
|
|
fpin) != sizeof(myArchiverStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/*
|
|
|
|
* Read WAL stats struct
|
|
|
|
*/
|
|
|
|
if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
|
|
|
* Read SLRU stats struct
|
|
|
|
*/
|
|
|
|
if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"", statfile)));
|
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/* By default, we're going to return the timestamp of the global file. */
|
2008-11-03 02:17:08 +01:00
|
|
|
*ts = myGlobalStats.stats_timestamp;
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/*
|
|
|
|
* We found an existing collector stats file. Read it and look for a
|
|
|
|
* record for the requested database. If found, use its timestamp.
|
|
|
|
*/
|
|
|
|
for (;;)
|
|
|
|
{
|
|
|
|
switch (fgetc(fpin))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* 'D' A PgStat_StatDBEntry struct describing a database
|
|
|
|
* follows.
|
|
|
|
*/
|
|
|
|
case 'D':
|
|
|
|
if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
|
|
|
|
fpin) != offsetof(PgStat_StatDBEntry, tables))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
2020-09-12 04:32:54 +02:00
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
2013-02-18 21:56:08 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this is the DB we're looking for, save its timestamp and
|
|
|
|
* we're done.
|
|
|
|
*/
|
|
|
|
if (dbentry.databaseid == databaseid)
|
|
|
|
{
|
|
|
|
*ts = dbentry.stats_timestamp;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
break;
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/*
|
|
|
|
* 'R' A PgStat_ReplSlotStats struct describing a replication
|
|
|
|
* slot follows.
|
|
|
|
*/
|
|
|
|
case 'R':
|
|
|
|
if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
|
|
|
|
!= sizeof(PgStat_ReplSlotStats))
|
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
case 'E':
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
default:
|
2020-09-12 04:32:54 +02:00
|
|
|
{
|
|
|
|
ereport(pgStatRunningInCollector ? LOG : WARNING,
|
|
|
|
(errmsg("corrupted statistics file \"%s\"",
|
|
|
|
statfile)));
|
|
|
|
FreeFile(fpin);
|
|
|
|
return false;
|
|
|
|
}
|
2013-02-18 21:56:08 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
done:
|
2008-11-03 02:17:08 +01:00
|
|
|
FreeFile(fpin);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2004-07-01 02:52:04 +02:00
|
|
|
/*
|
2007-02-08 00:11:30 +01:00
|
|
|
* If not already done, read the statistics collector stats file into
|
|
|
|
* some hash tables. The results will be kept until pgstat_clear_snapshot()
|
|
|
|
* is called (typically, at end of transaction).
|
2004-07-01 02:52:04 +02:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
backend_read_statsfile(void)
|
|
|
|
{
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
TimestampTz min_ts = 0;
|
|
|
|
TimestampTz ref_ts = 0;
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
Oid inquiry_db;
|
2008-11-03 02:17:08 +01:00
|
|
|
int count;
|
|
|
|
|
2007-02-08 00:11:30 +01:00
|
|
|
/* already read it? */
|
|
|
|
if (pgStatDBHash)
|
|
|
|
return;
|
|
|
|
Assert(!pgStatRunningInCollector);
|
|
|
|
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
/*
|
|
|
|
* In a normal backend, we check staleness of the data for our own DB, and
|
|
|
|
* so we send MyDatabaseId in inquiry messages. In the autovac launcher,
|
|
|
|
* check staleness of the shared-catalog data, and send InvalidOid in
|
|
|
|
* inquiry messages so as not to force writing unnecessary data.
|
|
|
|
*/
|
|
|
|
if (IsAutoVacuumLauncherProcess())
|
|
|
|
inquiry_db = InvalidOid;
|
|
|
|
else
|
|
|
|
inquiry_db = MyDatabaseId;
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
/*
|
|
|
|
* Loop until fresh enough stats file is available or we ran out of time.
|
2009-06-11 16:49:15 +02:00
|
|
|
* The stats inquiry message is sent repeatedly in case collector drops
|
2011-09-17 00:25:27 +02:00
|
|
|
* it; but not every single time, as that just swamps the collector.
|
2008-11-03 02:17:08 +01:00
|
|
|
*/
|
|
|
|
for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
|
|
|
|
{
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
bool ok;
|
2008-11-04 12:04:06 +01:00
|
|
|
TimestampTz file_ts = 0;
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
TimestampTz cur_ts;
|
2008-11-03 02:17:08 +01:00
|
|
|
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
|
|
|
|
cur_ts = GetCurrentTimestamp();
|
|
|
|
/* Calculate min acceptable timestamp, if we didn't already */
|
|
|
|
if (count == 0 || cur_ts < ref_ts)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
|
|
|
|
* msec before now. This indirectly ensures that the collector
|
2013-02-18 21:56:08 +01:00
|
|
|
* needn't write the file more often than PGSTAT_STAT_INTERVAL. In
|
|
|
|
* an autovacuum worker, however, we want a lower delay to avoid
|
|
|
|
* using stale data, so we use PGSTAT_RETRY_DELAY (since the
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
* number of workers is low, this shouldn't be a problem).
|
|
|
|
*
|
|
|
|
* We don't recompute min_ts after sleeping, except in the
|
|
|
|
* unlikely case that cur_ts went backwards. So we might end up
|
2014-05-06 18:12:18 +02:00
|
|
|
* accepting a file a bit older than PGSTAT_STAT_INTERVAL. In
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
* practice that shouldn't happen, though, as long as the sleep
|
|
|
|
* time is less than PGSTAT_STAT_INTERVAL; and we don't want to
|
|
|
|
* tell the collector that our cutoff time is less than what we'd
|
|
|
|
* actually accept.
|
|
|
|
*/
|
|
|
|
ref_ts = cur_ts;
|
|
|
|
if (IsAutoVacuumWorkerProcess())
|
|
|
|
min_ts = TimestampTzPlusMilliseconds(ref_ts,
|
|
|
|
-PGSTAT_RETRY_DELAY);
|
|
|
|
else
|
|
|
|
min_ts = TimestampTzPlusMilliseconds(ref_ts,
|
|
|
|
-PGSTAT_STAT_INTERVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the file timestamp is actually newer than cur_ts, we must have
|
|
|
|
* had a clock glitch (system time went backwards) or there is clock
|
|
|
|
* skew between our processor and the stats collector's processor.
|
|
|
|
* Accept the file, but send an inquiry message anyway to make
|
|
|
|
* pgstat_recv_inquiry do a sanity check on the collector's time.
|
|
|
|
*/
|
|
|
|
if (ok && file_ts > cur_ts)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* A small amount of clock skew between processors isn't terribly
|
|
|
|
* surprising, but a large difference is worth logging. We
|
|
|
|
* arbitrarily define "large" as 1000 msec.
|
|
|
|
*/
|
|
|
|
if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
|
|
|
|
{
|
|
|
|
char *filetime;
|
|
|
|
char *mytime;
|
|
|
|
|
|
|
|
/* Copy because timestamptz_to_str returns a static buffer */
|
|
|
|
filetime = pstrdup(timestamptz_to_str(file_ts));
|
|
|
|
mytime = pstrdup(timestamptz_to_str(cur_ts));
|
2020-12-04 14:25:23 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("statistics collector's time %s is later than backend local time %s",
|
|
|
|
filetime, mytime)));
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
pfree(filetime);
|
|
|
|
pfree(mytime);
|
|
|
|
}
|
|
|
|
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Normal acceptance case: file is not older than cutoff time */
|
|
|
|
if (ok && file_ts >= min_ts)
|
2008-11-03 02:17:08 +01:00
|
|
|
break;
|
|
|
|
|
|
|
|
/* Not there or too old, so kick the collector and wait a bit */
|
2011-09-17 00:25:27 +02:00
|
|
|
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
|
2011-09-17 00:25:27 +02:00
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (count >= PGSTAT_POLL_LOOP_COUNT)
|
2015-01-20 05:01:33 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("using stale statistics instead of current ones "
|
|
|
|
"because stats collector is not responding")));
|
2008-11-03 02:17:08 +01:00
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
/*
|
|
|
|
* Autovacuum launcher wants stats about all databases, but a shallow read
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* is sufficient. Regular backends want a deep read for just the tables
|
|
|
|
* they can see (MyDatabaseId + shared catalogs).
|
2013-02-18 21:56:08 +01:00
|
|
|
*/
|
2007-02-16 00:23:23 +01:00
|
|
|
if (IsAutoVacuumLauncherProcess())
|
2013-02-18 21:56:08 +01:00
|
|
|
pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
|
2005-07-14 07:13:45 +02:00
|
|
|
else
|
2013-02-18 21:56:08 +01:00
|
|
|
pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
|
2007-02-08 00:11:30 +01:00
|
|
|
}
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-02-08 00:11:30 +01:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_setup_memcxt() -
|
|
|
|
*
|
|
|
|
* Create pgStatLocalContext, if not already done.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_setup_memcxt(void)
|
|
|
|
{
|
|
|
|
if (!pgStatLocalContext)
|
|
|
|
pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
|
|
|
|
"Statistics snapshot",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_SMALL_SIZES);
|
2004-07-01 02:52:04 +02:00
|
|
|
}
|
|
|
|
|
2007-02-08 00:11:30 +01:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_clear_snapshot() -
|
|
|
|
*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Discard any data collected in the current transaction. Any subsequent
|
2007-02-08 00:11:30 +01:00
|
|
|
* request will cause new snapshots to be read.
|
|
|
|
*
|
|
|
|
* This is also invoked during transaction commit or abort to discard
|
|
|
|
* the no-longer-wanted snapshot.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
pgstat_clear_snapshot(void)
|
|
|
|
{
|
|
|
|
/* Release memory, if any was allocated */
|
|
|
|
if (pgStatLocalContext)
|
|
|
|
MemoryContextDelete(pgStatLocalContext);
|
|
|
|
|
|
|
|
/* Reset variables */
|
|
|
|
pgStatLocalContext = NULL;
|
|
|
|
pgStatDBHash = NULL;
|
|
|
|
localBackendStatusTable = NULL;
|
|
|
|
localNumBackends = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-11-03 02:17:08 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_inquiry() -
|
|
|
|
*
|
|
|
|
* Process stat inquiry requests.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
|
|
|
|
{
|
2013-02-18 21:56:08 +01:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
|
2013-02-18 21:56:08 +01:00
|
|
|
|
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* If there's already a write request for this DB, there's nothing to do.
|
2013-02-21 15:46:46 +01:00
|
|
|
*
|
|
|
|
* Note that if a request is found, we return early and skip the below
|
2013-05-29 22:58:43 +02:00
|
|
|
* check for clock skew. This is okay, since the only way for a DB
|
|
|
|
* request to be present in the list is that we have been here since the
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* last write round. It seems sufficient to check for clock skew once per
|
|
|
|
* write round.
|
2013-02-18 21:56:08 +01:00
|
|
|
*/
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
if (list_member_oid(pending_write_requests, msg->databaseid))
|
2013-02-18 21:56:08 +01:00
|
|
|
return;
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
|
|
|
|
/*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Check to see if we last wrote this database at a time >= the requested
|
|
|
|
* cutoff time. If so, this is a stale request that was generated before
|
|
|
|
* we updated the DB file, and we don't need to do so again.
|
|
|
|
*
|
2013-02-18 21:56:08 +01:00
|
|
|
* If the requestor's local clock time is older than stats_timestamp, we
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
* should suspect a clock glitch, ie system time going backwards; though
|
|
|
|
* the more likely explanation is just delayed message receipt. It is
|
|
|
|
* worth expending a GetCurrentTimestamp call to be sure, since a large
|
|
|
|
* retreat in the system clock reading could otherwise cause us to neglect
|
|
|
|
* to update the stats file for a long time.
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
dbentry = pgstat_get_db_entry(msg->databaseid, false);
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
if (dbentry == NULL)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We have no data for this DB. Enter a write request anyway so that
|
|
|
|
* the global stats will get updated. This is needed to prevent
|
|
|
|
* backend_read_statsfile from waiting for data that we cannot supply,
|
|
|
|
* in the case of a new DB that nobody has yet reported any stats for.
|
|
|
|
* See the behavior of pgstat_read_db_statsfile_timestamp.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
else if (msg->clock_time < dbentry->stats_timestamp)
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
{
|
|
|
|
TimestampTz cur_ts = GetCurrentTimestamp();
|
|
|
|
|
2013-02-18 21:56:08 +01:00
|
|
|
if (cur_ts < dbentry->stats_timestamp)
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Sure enough, time went backwards. Force a new stats file write
|
|
|
|
* to get back in sync; but first, log a complaint.
|
|
|
|
*/
|
|
|
|
char *writetime;
|
|
|
|
char *mytime;
|
|
|
|
|
|
|
|
/* Copy because timestamptz_to_str returns a static buffer */
|
2013-02-18 21:56:08 +01:00
|
|
|
writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
mytime = pstrdup(timestamptz_to_str(cur_ts));
|
2020-12-04 14:25:23 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("stats_timestamp %s is later than collector's time %s for database %u",
|
|
|
|
writetime, mytime, dbentry->databaseid)));
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
pfree(writetime);
|
|
|
|
pfree(mytime);
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Nope, it's just an old request. Assuming msg's clock_time is
|
|
|
|
* >= its cutoff_time, it must be stale, so we can ignore it.
|
|
|
|
*/
|
|
|
|
return;
|
Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 23:11:07 +02:00
|
|
|
}
|
|
|
|
}
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
else if (msg->cutoff_time <= dbentry->stats_timestamp)
|
|
|
|
{
|
|
|
|
/* Stale request, ignore it */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to write this DB, so create a request.
|
|
|
|
*/
|
|
|
|
pending_write_requests = lappend_oid(pending_write_requests,
|
|
|
|
msg->databaseid);
|
2008-11-03 02:17:08 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2001-06-22 21:18:36 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_tabstat() -
|
|
|
|
*
|
|
|
|
* Count what the backend has done.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
int i;
|
|
|
|
bool found;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2006-04-06 22:38:00 +02:00
|
|
|
* Update database-wide stats.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2001-10-25 07:50:21 +02:00
|
|
|
dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
|
|
|
|
dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
|
2012-04-30 00:13:33 +02:00
|
|
|
dbentry->n_block_read_time += msg->m_block_read_time;
|
|
|
|
dbentry->n_block_write_time += msg->m_block_write_time;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Process all table entries in the message.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < msg->m_nentries; i++)
|
|
|
|
{
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
|
|
|
|
|
2001-10-01 07:36:17 +02:00
|
|
|
tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(void *) &(tabmsg->t_id),
|
2005-10-15 04:49:52 +02:00
|
|
|
HASH_ENTER, &found);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* If it's a new table entry, initialize counters to the values we
|
|
|
|
* just got.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->numscans = tabmsg->t_counts.t_numscans;
|
|
|
|
tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
|
|
|
|
tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
|
|
|
|
tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
|
|
|
|
tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
|
|
|
|
tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
|
|
|
|
tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
|
|
|
|
tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
|
|
|
|
tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
|
|
|
|
tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
|
|
|
|
tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
|
2007-05-27 05:50:39 +02:00
|
|
|
|
2006-05-19 21:08:27 +02:00
|
|
|
tabentry->vacuum_timestamp = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
tabentry->vacuum_count = 0;
|
2006-05-19 21:08:27 +02:00
|
|
|
tabentry->autovac_vacuum_timestamp = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
tabentry->autovac_vacuum_count = 0;
|
2006-05-19 21:08:27 +02:00
|
|
|
tabentry->analyze_timestamp = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
tabentry->analyze_count = 0;
|
2006-05-19 21:08:27 +02:00
|
|
|
tabentry->autovac_analyze_timestamp = 0;
|
2011-03-07 17:17:06 +01:00
|
|
|
tabentry->autovac_analyze_count = 0;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Otherwise add the values to the existing entry.
|
|
|
|
*/
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->numscans += tabmsg->t_counts.t_numscans;
|
|
|
|
tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
|
|
|
|
tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
|
|
|
|
tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
|
|
|
|
tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
|
|
|
|
tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
|
|
|
|
tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
|
2015-02-20 16:10:01 +01:00
|
|
|
/* If table was truncated, first reset the live/dead counters */
|
|
|
|
if (tabmsg->t_counts.t_truncated)
|
|
|
|
{
|
|
|
|
tabentry->n_live_tuples = 0;
|
|
|
|
tabentry->n_dead_tuples = 0;
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
tabentry->inserts_since_vacuum = 0;
|
2015-02-20 16:10:01 +01:00
|
|
|
}
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
|
|
|
|
tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
|
|
|
|
tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
|
|
|
|
tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
/* Clamp n_live_tuples in case of negative delta_live_tuples */
|
2007-05-27 19:28:36 +02:00
|
|
|
tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
|
2007-09-20 19:56:33 +02:00
|
|
|
/* Likewise for n_dead_tuples */
|
|
|
|
tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
|
2007-05-27 19:28:36 +02:00
|
|
|
|
2007-03-16 18:57:36 +01:00
|
|
|
/*
|
2007-05-27 05:50:39 +02:00
|
|
|
* Add per-table stats to the per-database entry, too.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
|
|
|
|
dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
|
|
|
|
dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
|
|
|
|
dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
|
|
|
|
dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
|
|
|
|
dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
|
|
|
|
dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_tabpurge() -
|
|
|
|
*
|
|
|
|
* Arrange for dead table removal.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
int i;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No need to purge if we don't even know the database.
|
|
|
|
*/
|
|
|
|
if (!dbentry || !dbentry->tables)
|
|
|
|
return;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Process all table entries in the message.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < msg->m_nentries; i++)
|
|
|
|
{
|
2006-04-06 22:38:00 +02:00
|
|
|
/* Remove from hashtable if present; we don't care if it's not. */
|
|
|
|
(void) hash_search(dbentry->tables,
|
|
|
|
(void *) &(msg->m_tableid[i]),
|
|
|
|
HASH_REMOVE, NULL);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_dropdb() -
|
|
|
|
*
|
|
|
|
* Arrange for dead database removal
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
|
|
|
|
{
|
2013-02-18 21:56:08 +01:00
|
|
|
Oid dbid = msg->m_databaseid;
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup the database in the hashtable.
|
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
dbentry = pgstat_get_db_entry(dbid, false);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2013-02-18 21:56:08 +01:00
|
|
|
* If found, remove it (along with the db statfile).
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-07-29 21:30:09 +02:00
|
|
|
if (dbentry)
|
2006-04-06 22:38:00 +02:00
|
|
|
{
|
2013-02-18 21:56:08 +01:00
|
|
|
char statfile[MAXPGPATH];
|
|
|
|
|
2014-03-05 17:03:29 +01:00
|
|
|
get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
|
2013-02-18 21:56:08 +01:00
|
|
|
|
2014-12-11 21:41:15 +01:00
|
|
|
elog(DEBUG2, "removing stats file \"%s\"", statfile);
|
2013-02-18 21:56:08 +01:00
|
|
|
unlink(statfile);
|
|
|
|
|
2006-04-06 22:38:00 +02:00
|
|
|
if (dbentry->tables != NULL)
|
|
|
|
hash_destroy(dbentry->tables);
|
2008-05-15 02:17:41 +02:00
|
|
|
if (dbentry->functions != NULL)
|
|
|
|
hash_destroy(dbentry->functions);
|
2006-04-06 22:38:00 +02:00
|
|
|
|
|
|
|
if (hash_search(pgStatDBHash,
|
2013-02-18 21:56:08 +01:00
|
|
|
(void *) &dbid,
|
2006-04-06 22:38:00 +02:00
|
|
|
HASH_REMOVE, NULL) == NULL)
|
|
|
|
ereport(ERROR,
|
2013-02-18 21:56:08 +01:00
|
|
|
(errmsg("database hash table corrupted during cleanup --- abort")));
|
2006-04-06 22:38:00 +02:00
|
|
|
}
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------
|
2005-05-11 03:41:41 +02:00
|
|
|
* pgstat_recv_resetcounter() -
|
2001-06-22 21:18:36 +02:00
|
|
|
*
|
2005-05-11 03:41:41 +02:00
|
|
|
* Reset the statistics for the specified database.
|
2001-06-22 21:18:36 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
PgStat_StatDBEntry *dbentry;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-07-29 21:30:09 +02:00
|
|
|
* Lookup the database in the hashtable. Nothing to do if not there.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
2005-07-29 21:30:09 +02:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
|
|
|
|
|
|
|
|
if (!dbentry)
|
|
|
|
return;
|
2001-06-22 21:18:36 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We simply throw away all the database's table entries by recreating a
|
|
|
|
* new hash table for them.
|
2001-06-22 21:18:36 +02:00
|
|
|
*/
|
|
|
|
if (dbentry->tables != NULL)
|
|
|
|
hash_destroy(dbentry->tables);
|
2008-05-15 02:17:41 +02:00
|
|
|
if (dbentry->functions != NULL)
|
|
|
|
hash_destroy(dbentry->functions);
|
2001-06-22 21:18:36 +02:00
|
|
|
|
2001-10-25 07:50:21 +02:00
|
|
|
dbentry->tables = NULL;
|
2008-05-15 02:17:41 +02:00
|
|
|
dbentry->functions = NULL;
|
2010-12-12 21:09:53 +01:00
|
|
|
|
|
|
|
/*
|
2013-02-18 21:56:08 +01:00
|
|
|
* Reset database-level stats, too. This creates empty hash tables for
|
|
|
|
* tables and functions.
|
2010-12-12 21:09:53 +01:00
|
|
|
*/
|
2013-02-18 21:56:08 +01:00
|
|
|
reset_dbentry_counters(dbentry);
|
2001-06-22 21:18:36 +02:00
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2010-01-19 15:11:32 +01:00
|
|
|
/* ----------
|
2019-08-05 05:14:58 +02:00
|
|
|
* pgstat_recv_resetsharedcounter() -
|
2010-01-19 15:11:32 +01:00
|
|
|
*
|
|
|
|
* Reset some shared statistics of the cluster.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
|
|
|
|
{
|
2010-02-26 03:01:40 +01:00
|
|
|
if (msg->m_resettarget == RESET_BGWRITER)
|
2010-01-19 15:11:32 +01:00
|
|
|
{
|
|
|
|
/* Reset the global background writer statistics for the cluster. */
|
|
|
|
memset(&globalStats, 0, sizeof(globalStats));
|
2011-02-10 15:09:35 +01:00
|
|
|
globalStats.stat_reset_timestamp = GetCurrentTimestamp();
|
2010-01-19 15:11:32 +01:00
|
|
|
}
|
2014-01-28 18:58:22 +01:00
|
|
|
else if (msg->m_resettarget == RESET_ARCHIVER)
|
|
|
|
{
|
|
|
|
/* Reset the archiver statistics for the cluster. */
|
|
|
|
memset(&archiverStats, 0, sizeof(archiverStats));
|
|
|
|
archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
|
|
|
|
}
|
2020-10-02 03:17:11 +02:00
|
|
|
else if (msg->m_resettarget == RESET_WAL)
|
|
|
|
{
|
|
|
|
/* Reset the WAL statistics for the cluster. */
|
|
|
|
memset(&walStats, 0, sizeof(walStats));
|
|
|
|
walStats.stat_reset_timestamp = GetCurrentTimestamp();
|
|
|
|
}
|
2010-01-19 15:11:32 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Presumably the sender of this message validated the target, don't
|
|
|
|
* complain here if it's not valid
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
2010-01-28 15:25:41 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_resetsinglecounter() -
|
|
|
|
*
|
|
|
|
* Reset a statistics for a single object
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
|
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
|
|
|
|
|
|
|
|
if (!dbentry)
|
|
|
|
return;
|
|
|
|
|
2011-02-10 15:09:35 +01:00
|
|
|
/* Set the reset timestamp for the whole database */
|
|
|
|
dbentry->stat_reset_timestamp = GetCurrentTimestamp();
|
2010-01-28 15:25:41 +01:00
|
|
|
|
|
|
|
/* Remove object if it exists, ignore it if not */
|
|
|
|
if (msg->m_resettype == RESET_TABLE)
|
2011-03-07 17:17:06 +01:00
|
|
|
(void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
|
|
|
|
HASH_REMOVE, NULL);
|
2010-01-28 15:25:41 +01:00
|
|
|
else if (msg->m_resettype == RESET_FUNCTION)
|
2011-03-07 17:17:06 +01:00
|
|
|
(void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
|
|
|
|
HASH_REMOVE, NULL);
|
2010-01-28 15:25:41 +01:00
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_resetslrucounter() -
|
|
|
|
*
|
|
|
|
* Reset some SLRU statistics of the cluster.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
|
|
|
|
{
|
|
|
|
int i;
|
2020-05-14 19:06:38 +02:00
|
|
|
TimestampTz ts = GetCurrentTimestamp();
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
|
|
|
for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
|
|
|
|
{
|
|
|
|
/* reset entry with the given index, or all entries (index is -1) */
|
|
|
|
if ((msg->m_index == -1) || (msg->m_index == i))
|
|
|
|
{
|
|
|
|
memset(&slruStats[i], 0, sizeof(slruStats[i]));
|
|
|
|
slruStats[i].stat_reset_timestamp = ts;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_resetreplslotcounter() -
|
|
|
|
*
|
|
|
|
* Reset some replication slot statistics of the cluster.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
|
|
|
|
int len)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int idx = -1;
|
|
|
|
TimestampTz ts;
|
|
|
|
|
|
|
|
ts = GetCurrentTimestamp();
|
|
|
|
if (msg->clearall)
|
|
|
|
{
|
|
|
|
for (i = 0; i < nReplSlotStats; i++)
|
|
|
|
pgstat_reset_replslot(i, ts);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Get the index of replication slot statistics to reset */
|
|
|
|
idx = pgstat_replslot_index(msg->m_slotname, false);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Nothing to do if the given slot entry is not found. This could
|
|
|
|
* happen when the slot with the given name is removed and the
|
|
|
|
* corresponding statistics entry is also removed before receiving the
|
|
|
|
* reset message.
|
|
|
|
*/
|
|
|
|
if (idx < 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Reset the stats for the requested replication slot */
|
|
|
|
pgstat_reset_replslot(idx, ts);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2006-06-19 03:51:22 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_autovac() -
|
|
|
|
*
|
2020-06-07 15:06:51 +02:00
|
|
|
* Process an autovacuum signaling message.
|
2006-06-19 03:51:22 +02:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
|
|
|
|
/*
|
2009-09-05 00:32:33 +02:00
|
|
|
* Store the last autovacuum time in the database's hashtable entry.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
2009-09-05 00:32:33 +02:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
|
|
|
dbentry->last_autovac_time = msg->m_start_time;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_vacuum() -
|
|
|
|
*
|
|
|
|
* Process a VACUUM message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
|
|
|
|
/*
|
2009-09-05 00:32:33 +02:00
|
|
|
* Store the data in the table's hashtable entry.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
2009-09-05 00:32:33 +02:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2009-09-05 00:32:33 +02:00
|
|
|
tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2014-01-19 01:24:20 +01:00
|
|
|
tabentry->n_live_tuples = msg->m_live_tuples;
|
|
|
|
tabentry->n_dead_tuples = msg->m_dead_tuples;
|
2009-06-07 00:13:52 +02:00
|
|
|
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
/*
|
|
|
|
* It is quite possible that a non-aggressive VACUUM ended up skipping
|
|
|
|
* various pages, however, we'll zero the insert counter here regardless.
|
2020-05-14 19:06:38 +02:00
|
|
|
* It's currently used only to track when we need to perform an "insert"
|
|
|
|
* autovacuum, which are mainly intended to freeze newly inserted tuples.
|
|
|
|
* Zeroing this may just mean we'll not try to vacuum the table again
|
|
|
|
* until enough tuples have been inserted to trigger another insert
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
* autovacuum. An anti-wraparound autovacuum will catch any persistent
|
|
|
|
* stragglers.
|
|
|
|
*/
|
|
|
|
tabentry->inserts_since_vacuum = 0;
|
|
|
|
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
if (msg->m_autovacuum)
|
2010-08-21 12:59:17 +02:00
|
|
|
{
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
|
2010-08-21 12:59:17 +02:00
|
|
|
tabentry->autovac_vacuum_count++;
|
|
|
|
}
|
2006-06-27 05:45:16 +02:00
|
|
|
else
|
2010-08-21 12:59:17 +02:00
|
|
|
{
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
tabentry->vacuum_timestamp = msg->m_vacuumtime;
|
2010-08-21 12:59:17 +02:00
|
|
|
tabentry->vacuum_count++;
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_analyze() -
|
|
|
|
*
|
|
|
|
* Process an ANALYZE message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
|
|
|
|
/*
|
2009-09-05 00:32:33 +02:00
|
|
|
* Store the data in the table's hashtable entry.
|
2006-06-19 03:51:22 +02:00
|
|
|
*/
|
2009-09-05 00:32:33 +02:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
2009-09-05 00:32:33 +02:00
|
|
|
tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
|
2006-06-19 03:51:22 +02:00
|
|
|
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
tabentry->n_live_tuples = msg->m_live_tuples;
|
|
|
|
tabentry->n_dead_tuples = msg->m_dead_tuples;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
|
|
|
|
/*
|
2016-06-06 23:44:17 +02:00
|
|
|
* If commanded, reset changes_since_analyze to zero. This forgets any
|
|
|
|
* changes that were committed while the ANALYZE was in progress, but we
|
|
|
|
* have no good way to estimate how many of those there were.
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
*/
|
2016-06-06 23:44:17 +02:00
|
|
|
if (msg->m_resetcounter)
|
|
|
|
tabentry->changes_since_analyze = 0;
|
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 21:32:14 +01:00
|
|
|
|
2006-10-04 02:30:14 +02:00
|
|
|
if (msg->m_autovacuum)
|
2010-08-21 12:59:17 +02:00
|
|
|
{
|
2006-06-19 03:51:22 +02:00
|
|
|
tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
|
2010-08-21 12:59:17 +02:00
|
|
|
tabentry->autovac_analyze_count++;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else
|
2010-08-21 12:59:17 +02:00
|
|
|
{
|
2006-06-19 03:51:22 +02:00
|
|
|
tabentry->analyze_timestamp = msg->m_analyzetime;
|
2010-08-21 12:59:17 +02:00
|
|
|
tabentry->analyze_count++;
|
|
|
|
}
|
2006-06-19 03:51:22 +02:00
|
|
|
}
|
2007-03-30 20:34:56 +02:00
|
|
|
|
|
|
|
|
2014-01-28 18:58:22 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_archiver() -
|
|
|
|
*
|
|
|
|
* Process a ARCHIVER message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
|
|
|
|
{
|
|
|
|
if (msg->m_failed)
|
|
|
|
{
|
|
|
|
/* Failed archival attempt */
|
|
|
|
++archiverStats.failed_count;
|
|
|
|
memcpy(archiverStats.last_failed_wal, msg->m_xlog,
|
2014-05-06 18:12:18 +02:00
|
|
|
sizeof(archiverStats.last_failed_wal));
|
2014-01-28 18:58:22 +01:00
|
|
|
archiverStats.last_failed_timestamp = msg->m_timestamp;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Successful archival operation */
|
|
|
|
++archiverStats.archived_count;
|
|
|
|
memcpy(archiverStats.last_archived_wal, msg->m_xlog,
|
2014-05-06 18:12:18 +02:00
|
|
|
sizeof(archiverStats.last_archived_wal));
|
2014-01-28 18:58:22 +01:00
|
|
|
archiverStats.last_archived_timestamp = msg->m_timestamp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-03-30 20:34:56 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_bgwriter() -
|
|
|
|
*
|
|
|
|
* Process a BGWRITER message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
2007-11-15 23:25:18 +01:00
|
|
|
pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
|
2007-03-30 20:34:56 +02:00
|
|
|
{
|
|
|
|
globalStats.timed_checkpoints += msg->m_timed_checkpoints;
|
|
|
|
globalStats.requested_checkpoints += msg->m_requested_checkpoints;
|
2012-04-05 20:03:21 +02:00
|
|
|
globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
|
|
|
|
globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
|
2007-03-30 20:34:56 +02:00
|
|
|
globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
|
2007-06-28 02:02:40 +02:00
|
|
|
globalStats.buf_written_clean += msg->m_buf_written_clean;
|
|
|
|
globalStats.maxwritten_clean += msg->m_maxwritten_clean;
|
2007-09-25 22:03:38 +02:00
|
|
|
globalStats.buf_written_backend += msg->m_buf_written_backend;
|
2010-11-15 18:42:59 +01:00
|
|
|
globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
|
2007-09-25 22:03:38 +02:00
|
|
|
globalStats.buf_alloc += msg->m_buf_alloc;
|
2007-03-30 20:34:56 +02:00
|
|
|
}
|
2008-05-15 02:17:41 +02:00
|
|
|
|
2020-10-02 03:17:11 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_wal() -
|
|
|
|
*
|
|
|
|
* Process a WAL message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_wal(PgStat_MsgWal *msg, int len)
|
|
|
|
{
|
2020-12-02 05:00:15 +01:00
|
|
|
walStats.wal_records += msg->m_wal_records;
|
|
|
|
walStats.wal_fpi += msg->m_wal_fpi;
|
|
|
|
walStats.wal_bytes += msg->m_wal_bytes;
|
2020-10-02 03:17:11 +02:00
|
|
|
walStats.wal_buffers_full += msg->m_wal_buffers_full;
|
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_slru() -
|
|
|
|
*
|
|
|
|
* Process a SLRU message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
|
|
|
|
{
|
|
|
|
slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
|
|
|
|
slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
|
|
|
|
slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
|
|
|
|
slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
|
|
|
|
slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
|
|
|
|
slruStats[msg->m_index].flush += msg->m_flush;
|
|
|
|
slruStats[msg->m_index].truncate += msg->m_truncate;
|
|
|
|
}
|
|
|
|
|
2011-01-03 12:46:03 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_recoveryconflict() -
|
|
|
|
*
|
2012-01-26 16:02:33 +01:00
|
|
|
* Process a RECOVERYCONFLICT message.
|
2011-01-03 12:46:03 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
2011-04-10 17:42:00 +02:00
|
|
|
|
2011-01-03 12:46:03 +01:00
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
|
|
|
|
|
|
|
switch (msg->m_reason)
|
|
|
|
{
|
|
|
|
case PROCSIG_RECOVERY_CONFLICT_DATABASE:
|
2011-04-10 17:42:00 +02:00
|
|
|
|
2011-01-03 12:46:03 +01:00
|
|
|
/*
|
2011-04-10 17:42:00 +02:00
|
|
|
* Since we drop the information about the database as soon as it
|
|
|
|
* replicates, there is no point in counting these conflicts.
|
2011-01-03 12:46:03 +01:00
|
|
|
*/
|
|
|
|
break;
|
|
|
|
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
|
|
|
|
dbentry->n_conflict_tablespace++;
|
|
|
|
break;
|
|
|
|
case PROCSIG_RECOVERY_CONFLICT_LOCK:
|
|
|
|
dbentry->n_conflict_lock++;
|
|
|
|
break;
|
|
|
|
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
|
|
|
|
dbentry->n_conflict_snapshot++;
|
|
|
|
break;
|
|
|
|
case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
|
|
|
|
dbentry->n_conflict_bufferpin++;
|
|
|
|
break;
|
|
|
|
case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
|
|
|
|
dbentry->n_conflict_startup_deadlock++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-01-26 15:58:19 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_deadlock() -
|
|
|
|
*
|
2012-01-26 16:02:33 +01:00
|
|
|
* Process a DEADLOCK message.
|
2012-01-26 15:58:19 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
|
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
|
|
|
|
|
|
|
dbentry->n_deadlocks++;
|
|
|
|
}
|
|
|
|
|
2019-03-09 19:45:17 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_checksum_failure() -
|
|
|
|
*
|
|
|
|
* Process a CHECKSUMFAILURE message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
|
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
|
|
|
|
|
|
|
dbentry->n_checksum_failures += msg->m_failurecount;
|
2019-04-12 14:04:50 +02:00
|
|
|
dbentry->last_checksum_failure = msg->m_failure_time;
|
2019-03-09 19:45:17 +01:00
|
|
|
}
|
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_replslot() -
|
|
|
|
*
|
|
|
|
* Process a REPLSLOT message.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
|
|
|
|
{
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the index of replication slot statistics. On dropping, we don't
|
|
|
|
* create the new statistics.
|
|
|
|
*/
|
|
|
|
idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The slot entry is not found or there is no space to accommodate the new
|
|
|
|
* entry. This could happen when the message for the creation of a slot
|
|
|
|
* reached before the drop message even though the actual operations
|
|
|
|
* happen in reverse order. In such a case, the next update of the
|
|
|
|
* statistics for the same slot will create the required entry.
|
|
|
|
*/
|
|
|
|
if (idx < 0)
|
|
|
|
return;
|
|
|
|
|
2020-11-06 03:42:48 +01:00
|
|
|
/* it must be a valid replication slot index */
|
2020-12-20 18:38:32 +01:00
|
|
|
Assert(idx < nReplSlotStats);
|
2020-11-06 03:42:48 +01:00
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
if (msg->m_drop)
|
|
|
|
{
|
|
|
|
/* Remove the replication slot statistics with the given name */
|
2020-12-20 18:38:32 +01:00
|
|
|
if (idx < nReplSlotStats - 1)
|
|
|
|
memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
|
|
|
|
sizeof(PgStat_ReplSlotStats));
|
2020-10-08 05:39:08 +02:00
|
|
|
nReplSlotStats--;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Update the replication slot statistics */
|
|
|
|
replSlotStats[idx].spill_txns += msg->m_spill_txns;
|
|
|
|
replSlotStats[idx].spill_count += msg->m_spill_count;
|
|
|
|
replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
|
2020-10-29 04:41:51 +01:00
|
|
|
replSlotStats[idx].stream_txns += msg->m_stream_txns;
|
|
|
|
replSlotStats[idx].stream_count += msg->m_stream_count;
|
|
|
|
replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
|
2020-10-08 05:39:08 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_tempfile() -
|
|
|
|
*
|
2012-01-26 16:02:33 +01:00
|
|
|
* Process a TEMPFILE message.
|
2012-01-26 14:41:19 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
|
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
|
|
|
|
|
|
|
dbentry->n_temp_bytes += msg->m_filesize;
|
|
|
|
dbentry->n_temp_files += 1;
|
|
|
|
}
|
|
|
|
|
2008-05-15 02:17:41 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_funcstat() -
|
|
|
|
*
|
|
|
|
* Count what the backend has done.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
PgStat_StatFuncEntry *funcentry;
|
|
|
|
int i;
|
|
|
|
bool found;
|
|
|
|
|
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Process all function entries in the message.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < msg->m_nentries; i++, funcmsg++)
|
|
|
|
{
|
|
|
|
funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(void *) &(funcmsg->f_id),
|
2009-06-11 16:49:15 +02:00
|
|
|
HASH_ENTER, &found);
|
2008-05-15 02:17:41 +02:00
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If it's a new function entry, initialize counters to the values
|
|
|
|
* we just got.
|
|
|
|
*/
|
|
|
|
funcentry->f_numcalls = funcmsg->f_numcalls;
|
2012-04-30 20:02:47 +02:00
|
|
|
funcentry->f_total_time = funcmsg->f_total_time;
|
|
|
|
funcentry->f_self_time = funcmsg->f_self_time;
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Otherwise add the values to the existing entry.
|
|
|
|
*/
|
|
|
|
funcentry->f_numcalls += funcmsg->f_numcalls;
|
2012-04-30 20:02:47 +02:00
|
|
|
funcentry->f_total_time += funcmsg->f_total_time;
|
|
|
|
funcentry->f_self_time += funcmsg->f_self_time;
|
2008-05-15 02:17:41 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_recv_funcpurge() -
|
|
|
|
*
|
|
|
|
* Arrange for dead function removal.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
|
|
|
|
{
|
|
|
|
PgStat_StatDBEntry *dbentry;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No need to purge if we don't even know the database.
|
|
|
|
*/
|
|
|
|
if (!dbentry || !dbentry->functions)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Process all function entries in the message.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < msg->m_nentries; i++)
|
|
|
|
{
|
|
|
|
/* Remove from hashtable if present; we don't care if it's not. */
|
|
|
|
(void) hash_search(dbentry->functions,
|
|
|
|
(void *) &(msg->m_functionid[i]),
|
|
|
|
HASH_REMOVE, NULL);
|
|
|
|
}
|
|
|
|
}
|
2013-02-18 21:56:08 +01:00
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_write_statsfile_needed() -
|
|
|
|
*
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
* Do we need to write out any stats files?
|
2013-02-18 21:56:08 +01:00
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
pgstat_write_statsfile_needed(void)
|
|
|
|
{
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
if (pending_write_requests != NIL)
|
2013-02-18 21:56:08 +01:00
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Everything was written recently */
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_db_requested() -
|
|
|
|
*
|
|
|
|
* Checks whether stats for a particular DB need to be written to a file.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
pgstat_db_requested(Oid databaseid)
|
|
|
|
{
|
2016-05-25 23:48:15 +02:00
|
|
|
/*
|
|
|
|
* If any requests are outstanding at all, we should write the stats for
|
|
|
|
* shared catalogs (the "database" with OID 0). This ensures that
|
|
|
|
* backends will see up-to-date stats for shared catalogs, even though
|
|
|
|
* they send inquiry messages mentioning only their own DB.
|
|
|
|
*/
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
if (databaseid == InvalidOid && pending_write_requests != NIL)
|
2016-05-25 23:48:15 +02:00
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Search to see if there's an open request to write this database. */
|
Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2e8cafc, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 21:54:46 +02:00
|
|
|
if (list_member_oid(pending_write_requests, databaseid))
|
|
|
|
return true;
|
2013-02-18 21:56:08 +01:00
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
2017-09-19 20:46:07 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert a potentially unsafely truncated activity string (see
|
|
|
|
* PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
|
|
|
|
* one.
|
|
|
|
*
|
|
|
|
* The returned string is allocated in the caller's memory context and may be
|
|
|
|
* freed.
|
|
|
|
*/
|
|
|
|
char *
|
2017-09-19 23:17:20 +02:00
|
|
|
pgstat_clip_activity(const char *raw_activity)
|
2017-09-19 20:46:07 +02:00
|
|
|
{
|
2017-09-19 23:17:20 +02:00
|
|
|
char *activity;
|
|
|
|
int rawlen;
|
|
|
|
int cliplen;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Some callers, like pgstat_get_backend_current_activity(), do not
|
|
|
|
* guarantee that the buffer isn't concurrently modified. We try to take
|
2017-09-20 01:39:18 +02:00
|
|
|
* care that the buffer is always terminated by a NUL byte regardless, but
|
|
|
|
* let's still be paranoid about the string's length. In those cases the
|
|
|
|
* underlying buffer is guaranteed to be pgstat_track_activity_query_size
|
|
|
|
* large.
|
2017-09-19 23:17:20 +02:00
|
|
|
*/
|
|
|
|
activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
|
|
|
|
|
2017-09-20 01:39:18 +02:00
|
|
|
/* now double-guaranteed to be NUL terminated */
|
2017-09-19 23:17:20 +02:00
|
|
|
rawlen = strlen(activity);
|
2017-09-19 20:46:07 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* All supported server-encodings make it possible to determine the length
|
|
|
|
* of a multi-byte character from its first byte (this is not the case for
|
|
|
|
* client encodings, see GB18030). As st_activity is always stored using
|
|
|
|
* server encoding, this allows us to perform multi-byte aware truncation,
|
|
|
|
* even if the string earlier was truncated in the middle of a multi-byte
|
|
|
|
* character.
|
|
|
|
*/
|
|
|
|
cliplen = pg_mbcliplen(activity, rawlen,
|
|
|
|
pgstat_track_activity_query_size - 1);
|
2017-09-19 23:17:20 +02:00
|
|
|
|
|
|
|
activity[cliplen] = '\0';
|
|
|
|
|
|
|
|
return activity;
|
2017-09-19 20:46:07 +02:00
|
|
|
}
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
2020-10-08 05:39:08 +02:00
|
|
|
/* ----------
|
|
|
|
* pgstat_replslot_index
|
|
|
|
*
|
|
|
|
* Return the index of entry of a replication slot with the given name, or
|
|
|
|
* -1 if the slot is not found.
|
|
|
|
*
|
|
|
|
* create_it tells whether to create the new slot entry if it is not found.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
pgstat_replslot_index(const char *name, bool create_it)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
Assert(nReplSlotStats <= max_replication_slots);
|
|
|
|
for (i = 0; i < nReplSlotStats; i++)
|
|
|
|
{
|
|
|
|
if (strcmp(replSlotStats[i].slotname, name) == 0)
|
|
|
|
return i; /* found */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The slot is not found. We don't want to register the new statistics if
|
|
|
|
* the list is already full or the caller didn't request.
|
|
|
|
*/
|
|
|
|
if (i == max_replication_slots || !create_it)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
/* Register new slot */
|
|
|
|
memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
|
2020-11-06 03:42:48 +01:00
|
|
|
strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
|
2020-10-08 05:39:08 +02:00
|
|
|
|
|
|
|
return nReplSlotStats++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------
|
|
|
|
* pgstat_reset_replslot
|
|
|
|
*
|
|
|
|
* Reset the replication slot stats at index 'i'.
|
|
|
|
* ----------
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
pgstat_reset_replslot(int i, TimestampTz ts)
|
|
|
|
{
|
|
|
|
/* reset only counters. Don't clear slot name */
|
|
|
|
replSlotStats[i].spill_txns = 0;
|
|
|
|
replSlotStats[i].spill_count = 0;
|
|
|
|
replSlotStats[i].spill_bytes = 0;
|
2020-10-29 04:41:51 +01:00
|
|
|
replSlotStats[i].stream_txns = 0;
|
|
|
|
replSlotStats[i].stream_count = 0;
|
|
|
|
replSlotStats[i].stream_bytes = 0;
|
2020-10-08 05:39:08 +02:00
|
|
|
replSlotStats[i].stat_reset_timestamp = ts;
|
|
|
|
}
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
/*
|
|
|
|
* pgstat_slru_index
|
|
|
|
*
|
|
|
|
* Determine index of entry for a SLRU with a given name. If there's no exact
|
|
|
|
* match, returns index of the last "other" entry used for SLRUs defined in
|
2020-07-08 10:11:43 +02:00
|
|
|
* external projects.
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
pgstat_slru_index(const char *name)
|
|
|
|
{
|
2020-05-14 19:06:38 +02:00
|
|
|
int i;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
|
|
|
for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
|
|
|
|
{
|
|
|
|
if (strcmp(slru_names[i], name) == 0)
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* return index of the last entry (which is the "other" one) */
|
|
|
|
return (SLRU_NUM_ELEMENTS - 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pgstat_slru_name
|
|
|
|
*
|
|
|
|
* Returns SLRU name for an index. The index may be above SLRU_NUM_ELEMENTS,
|
|
|
|
* in which case this returns NULL. This allows writing code that does not
|
|
|
|
* know the number of entries in advance.
|
|
|
|
*/
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
const char *
|
|
|
|
pgstat_slru_name(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
if (slru_idx < 0 || slru_idx >= SLRU_NUM_ELEMENTS)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
return NULL;
|
|
|
|
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
return slru_names[slru_idx];
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* slru_entry
|
|
|
|
*
|
|
|
|
* Returns pointer to entry with counters for given SLRU (based on the name
|
|
|
|
* stored in SlruCtl as lwlock tranche name).
|
|
|
|
*/
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
static inline PgStat_MsgSLRU *
|
|
|
|
slru_entry(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Fix async.c to not register any SLRU stats counts in the postmaster.
Previously, AsyncShmemInit forcibly initialized the first page of the
async SLRU, to save dealing with that case in asyncQueueAddEntries.
But this is a poor tradeoff, since many installations do not ever use
NOTIFY; for them, expending those cycles in AsyncShmemInit is a
complete waste. Besides, this only saves a couple of instructions
in asyncQueueAddEntries, which hardly seems likely to be measurable.
The real reason to change this now, though, is that now that we track
SLRU access stats, the existing code is causing the postmaster to
accumulate some access counts, which then get inherited into child
processes by fork(), messing up the statistics. Delaying the
initialization into the first child that does a NOTIFY fixes that.
Hence, we can revert f3d23d83e, which was an incorrect attempt at
fixing that issue. Also, add an Assert to pgstat.c that should
catch any future errors of the same sort.
Discussion: https://postgr.es/m/8367.1589391884@sss.pgh.pa.us
2020-05-14 04:48:09 +02:00
|
|
|
/*
|
|
|
|
* The postmaster should never register any SLRU statistics counts; if it
|
|
|
|
* did, the counts would be duplicated into child processes via fork().
|
|
|
|
*/
|
|
|
|
Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
|
|
|
|
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
return &SLRUStats[slru_idx];
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
/*
|
|
|
|
* SLRU statistics count accumulation functions --- called from slru.c
|
|
|
|
*/
|
|
|
|
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_page_zeroed(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_blocks_zeroed += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_page_hit(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_blocks_hit += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_page_exists(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_blocks_exists += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_page_read(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_blocks_read += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_page_written(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_blocks_written += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_flush(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_flush += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
pgstat_count_slru_truncate(int slru_idx)
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
{
|
Improve management of SLRU statistics collection.
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13 19:08:12 +02:00
|
|
|
slru_entry(slru_idx)->m_truncate += 1;
|
Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:11:38 +02:00
|
|
|
}
|