Tracks one counter for each database, which is reset whenever
the statistics for any individual object inside the database is
reset, and one counter for the background writer.
Tomas Vondra, reviewed by Greg Smith
We were failing to zero out some pg_stat_database counters that have
been added since the initial pgstats coding. This is a bug, but not
back-patching the fix since changing this behavior in a minor release
seems a cure worse than the disease.
Report and patch by Tomas Vondra.
This new field counts the number of times that a backend which writes a
buffer out to the OS must also fsync() it. This happens when the
bgwriter fsync request queue is full, and is generally detrimental to
performance, so it's good to know when it's happening. Along the way,
log a new message at level DEBUG1 whenever we fail to hand off an fsync,
so that the problem can also be seen in examination of log files
(if the logging level is cranked up high enough).
Greg Smith, with minor tweaks by me.
We may as well make pgstat_count_heap_scan() and related macros just count
whenever rel->pgstat_info isn't null. Testing pgstat_track_counts buys
nothing at all in the normal case where that flag is ON; and when it's OFF,
the pgstat_info link will be null, so it's still a useless test.
This change is unlikely to buy any noticeable performance improvement,
but a cycle shaved is a cycle earned; and my investigations earlier today
convinced me that we're down to the point where individual instructions in
the inner execution loops are starting to matter.
statistics counts. These numbers are being accumulated but haven't yet been
transmitted to the collector (and won't be, until the transaction ends).
For some purposes, though, it's handy to be able to look at them.
Joel Jacobson, reviewed by Itagaki Takahiro
unable to read a stats file for reasons other than ENOENT, and having to reset
last_statrequest because it's later than current time in the collector.
Not clear if this will shed any light on the "pgstat wait timeout" business,
but it seems like a good idea in general.
In passing, do some message-style-police work on recently-added
pgstat_reset_shared_counters code.
This silences some warnings on Win64. Not using the proper SOCKET datatype
was actually wrong on Win32 as well, but didn't cause any warnings there.
Also create define PGINVALID_SOCKET to indicate an invalid/non-existing
socket, instead of using a hardcoded -1 value.
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
The temporary hash tables made by pgstat_collect_oids should be allocated
in a short-term memory context, which is not the default behavior of
hash_create. Noted while looking through hash_create calls in connection
with Robert Haas' recent complaint.
This is a pre-existing bug, but it doesn't seem important enough to
back-patch. The hash table is not so large that it would matter unless this
happened many times within a session, which seems quite unlikely.
Formerly, these message types would be discarded unless there was already
a stats hash table entry for the target table. However, the intent of
saving hash table space for unused tables was subverted by the fact that
the physical I/O done by the vacuum or analyze would result in an immediately
following tabstat message, which would create the hash table entry anyway.
All that we had left was surprising loss of statistical data, as in a recent
complaint from Jaime Casanova.
It seems unlikely that a real database would have many tables that go totally
untouched over the long haul, so the consensus is that this "optimization"
serves little purpose anyhow. Remove it, and just create the hash table
entry on demand in all cases.
To make this work in the base case, pg_database now has a nailed-in-cache
relation descriptor that is initialized using hardwired knowledge in
relcache.c. This means pg_database is added to the set of relations that
need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this
path is taken, we'll have to do a seqscan of pg_database to find the row
we need.
In the normal case, we are able to do an indexscan to find the database's row
by name. This is made possible by storing a global relcache init file that
describes only the shared catalogs and their indexes (and therefore is usable
by all backends in any database). A new backend loads this cache file,
finds its database OID after an indexscan on pg_database, and then loads
the local relcache init file for that database.
This change should effectively eliminate number of databases as a factor
in backend startup time, even with large numbers of databases. However,
the real reason for doing it is as a first step towards getting rid of
the flat files altogether. There are still several other sub-projects
to be tackled before that can happen.
behavior in cases where we don't know the heap tuple count accurately; in
particular partial vacuum, but this also makes the API a bit more useful
for ANALYZE. This patch adds "estimated_count" flags to both structs so
that an approximate count can be flagged as such, and adjusts the logic
so that approximate counts are not used for updating pg_class.reltuples.
This fixes my previous complaint that VACUUM was putting ridiculous values
into pg_class.reltuples for indexes. The actual impact of that bug is
limited, because the planner only pays attention to reltuples for an index
if the index is partial; which probably explains why beta testers hadn't
noticed a degradation in plan quality from it. But it needs to be fixed.
The whole thing is a bit messy and should be redesigned in future, because
reltuples now has the potential to drift quite far away from reality when
a long period elapses with no non-partial vacuums. But this is as good as
it's going to get for 8.4.
skipped. We could update relpages anyway, but it seems better to only
update it together with reltuples, because we use the reltuples/relpages
ratio in the planner. Also don't update n_live_tuples in pgstat.
ANALYZE in VACUUM ANALYZE now needs to update pg_class, if the
VACUUM-phase didn't do so. Added some boolean-passing to let analyze_rel
know if it should update pg_class or not.
I also moved the relcache invalidation (to update rd_targblock) from
vac_update_relstats to where RelationTruncate is called, because
vac_update_relstats is not called for partial vacuums anymore. It's more
obvious to send the invalidation close to the truncation that requires it.
Per report by Ned T. Crigler.
where no function stats entries exist. Partial response to Pavel's
observation that small VACUUM operations are noticeably slower in CVS HEAD
than 8.3.
upon requests from backends, rather than on a fixed 500msec cycle. (There's
still throttling logic to ensure it writes no more often than once per
500msec, though.) This should result in a significant reduction in stats file
write traffic in typical scenarios where the stats are demanded only
infrequently.
This approach also means that the former difficulty with changing
stats_temp_directory on-the-fly has gone away, so remove the caution about
that as well as the thrashing we did to minimize the trouble window.
In passing, also fix pgstat_report_stat() so that we will send a stats
message if we have function call stats but not table stats to report;
this fixes a bug in the recent patch to support function-call stats.
Martin Pihlak
variable stats_temp_directory, instead of requiring the admin to
mount/symlink the pg_stat_tmp directory manually.
For now the config variable is PGC_POSTMASTER. Room for further improvment
that would allow it to be changed on-the-fly.
This allows the use of a ramdrive (either through mount or symlink) for
the temporary file that's written every half second, which should
reduce I/O.
On server shutdown/startup, the file is written to the old location in
the global directory, to preserve data across restarts.
Bump catversion since the $PGDATA directory layout changed.
As the buffer could now be a lot larger than before, and copying it could
thus be a lot more expensive than before, use strcpy instead of memcpy to
copy the query string, as was already suggested in comments. Also, only copy
the PgBackendStatus struct and string if the slot is in use.
Patch by Thomas Lee, with some changes by me.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
functions.
Note that because this patch changes FmgrInfo, any external C functions
you might be testing with 8.4 will need to be recompiled.
Patch by Martin Pihlak, some editorialization by me (principally, removing
tracking of getrusage() numbers)
classed all as "dead"; also get it to count DEAD item pointers as dead rows,
instead of ignoring them as before. Also improve matters so that tuples
previously inserted or deleted by our own transaction are handled nicely:
the stats collector's live-tuple and dead-tuple counts will end up correct
after our transaction ends, regardless of whether we end in commit or abort.
While there's more work that could be done to improve the counting of in-doubt
tuples in both VACUUM and ANALYZE, this commit is enough to alleviate some
known bad behaviors in 8.3; and the other stuff that's been discussed seems
like research projects anyway.
Pavan Deolasee and Tom Lane
query texts only to the server log. This eliminates the issue of possible
leaking of security-sensitive data in other sessions' queries. Since the
log is presumed secure, we can now log the queries of all sessions involved
in the deadlock, whether or not they belong to the same user as the one
reporting the failure.
(if they'd be visible to the current user in pg_stat_activity).
This might look like it's subject to race conditions, but it's actually
pretty safe because at the time DeadLockReport() is constructing the
report, we haven't yet aborted our transaction and so we can expect that
everyone else involved in the deadlock is still blocked on some lock.
(There are corner cases where that might not be true, such as a statement
timeout triggering in another backend before we finish reporting; but at
worst we'd report a misleading activity string, so it seems acceptable
considering the usefulness of reporting the queries.)
Original patch by Itagaki Takahiro, heavily modified by me.
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers. The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.
Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
* stats_start_collector goes away; we always start the collector process,
unless prevented by a problem with setting up the stats UDP socket.
* stats_reset_on_server_start goes away; it seems useless in view of the
availability of pg_stat_reset().
* stats_block_level and stats_row_level are merged into a single variable
"track_counts", which controls all reports sent to the collector process.
* stats_command_string is renamed to track_activities.
* log_autovacuum is renamed to log_autovacuum_min_duration to better reflect
its meaning.
The log_autovacuum change is not a compatibility issue since it didn't exist
before 8.3 anyway. The other changes need to be release-noted.