behavior in cases where we don't know the heap tuple count accurately; in
particular partial vacuum, but this also makes the API a bit more useful
for ANALYZE. This patch adds "estimated_count" flags to both structs so
that an approximate count can be flagged as such, and adjusts the logic
so that approximate counts are not used for updating pg_class.reltuples.
This fixes my previous complaint that VACUUM was putting ridiculous values
into pg_class.reltuples for indexes. The actual impact of that bug is
limited, because the planner only pays attention to reltuples for an index
if the index is partial; which probably explains why beta testers hadn't
noticed a degradation in plan quality from it. But it needs to be fixed.
The whole thing is a bit messy and should be redesigned in future, because
reltuples now has the potential to drift quite far away from reality when
a long period elapses with no non-partial vacuums. But this is as good as
it's going to get for 8.4.
skipped. We could update relpages anyway, but it seems better to only
update it together with reltuples, because we use the reltuples/relpages
ratio in the planner. Also don't update n_live_tuples in pgstat.
ANALYZE in VACUUM ANALYZE now needs to update pg_class, if the
VACUUM-phase didn't do so. Added some boolean-passing to let analyze_rel
know if it should update pg_class or not.
I also moved the relcache invalidation (to update rd_targblock) from
vac_update_relstats to where RelationTruncate is called, because
vac_update_relstats is not called for partial vacuums anymore. It's more
obvious to send the invalidation close to the truncation that requires it.
Per report by Ned T. Crigler.
where no function stats entries exist. Partial response to Pavel's
observation that small VACUUM operations are noticeably slower in CVS HEAD
than 8.3.
upon requests from backends, rather than on a fixed 500msec cycle. (There's
still throttling logic to ensure it writes no more often than once per
500msec, though.) This should result in a significant reduction in stats file
write traffic in typical scenarios where the stats are demanded only
infrequently.
This approach also means that the former difficulty with changing
stats_temp_directory on-the-fly has gone away, so remove the caution about
that as well as the thrashing we did to minimize the trouble window.
In passing, also fix pgstat_report_stat() so that we will send a stats
message if we have function call stats but not table stats to report;
this fixes a bug in the recent patch to support function-call stats.
Martin Pihlak
variable stats_temp_directory, instead of requiring the admin to
mount/symlink the pg_stat_tmp directory manually.
For now the config variable is PGC_POSTMASTER. Room for further improvment
that would allow it to be changed on-the-fly.
This allows the use of a ramdrive (either through mount or symlink) for
the temporary file that's written every half second, which should
reduce I/O.
On server shutdown/startup, the file is written to the old location in
the global directory, to preserve data across restarts.
Bump catversion since the $PGDATA directory layout changed.
As the buffer could now be a lot larger than before, and copying it could
thus be a lot more expensive than before, use strcpy instead of memcpy to
copy the query string, as was already suggested in comments. Also, only copy
the PgBackendStatus struct and string if the slot is in use.
Patch by Thomas Lee, with some changes by me.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
functions.
Note that because this patch changes FmgrInfo, any external C functions
you might be testing with 8.4 will need to be recompiled.
Patch by Martin Pihlak, some editorialization by me (principally, removing
tracking of getrusage() numbers)
classed all as "dead"; also get it to count DEAD item pointers as dead rows,
instead of ignoring them as before. Also improve matters so that tuples
previously inserted or deleted by our own transaction are handled nicely:
the stats collector's live-tuple and dead-tuple counts will end up correct
after our transaction ends, regardless of whether we end in commit or abort.
While there's more work that could be done to improve the counting of in-doubt
tuples in both VACUUM and ANALYZE, this commit is enough to alleviate some
known bad behaviors in 8.3; and the other stuff that's been discussed seems
like research projects anyway.
Pavan Deolasee and Tom Lane
query texts only to the server log. This eliminates the issue of possible
leaking of security-sensitive data in other sessions' queries. Since the
log is presumed secure, we can now log the queries of all sessions involved
in the deadlock, whether or not they belong to the same user as the one
reporting the failure.
(if they'd be visible to the current user in pg_stat_activity).
This might look like it's subject to race conditions, but it's actually
pretty safe because at the time DeadLockReport() is constructing the
report, we haven't yet aborted our transaction and so we can expect that
everyone else involved in the deadlock is still blocked on some lock.
(There are corner cases where that might not be true, such as a statement
timeout triggering in another backend before we finish reporting; but at
worst we'd report a misleading activity string, so it seems acceptable
considering the usefulness of reporting the queries.)
Original patch by Itagaki Takahiro, heavily modified by me.
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers. The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.
Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
* stats_start_collector goes away; we always start the collector process,
unless prevented by a problem with setting up the stats UDP socket.
* stats_reset_on_server_start goes away; it seems useless in view of the
availability of pg_stat_reset().
* stats_block_level and stats_row_level are merged into a single variable
"track_counts", which controls all reports sent to the collector process.
* stats_command_string is renamed to track_activities.
* log_autovacuum is renamed to log_autovacuum_min_duration to better reflect
its meaning.
The log_autovacuum change is not a compatibility issue since it didn't exist
before 8.3 anyway. The other changes need to be release-noted.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
so that we will be able to create a cookie for all processes for CSVlogs.
It is set wherever MyProcPid is set. Take the opportunity to remove the now
unnecessary session-only restriction on the %s and %c escapes in log_line_prefix.
we don't know at that point which relation OID to tell pgstat to forget.
The code was passing the relfilenode, which is incorrect, and could possibly
cause some other relation's stats to be zeroed out. While we could try to
clean this up, it seems much simpler and more reliable to let the next
invocation of pgstat_vacuum_tabstat() fix things; which indeed is how it
worked before I introduced the buggy code into 8.1.3 and later :-(.
Problem noticed by Itagaki Takahiro, fix is per subsequent discussion.
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.
Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.
Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
or abort within a backend; rearrange InitPostgres processing to make it so.
Revealed by just-added Asserts along with ECPG regression tests (hm, I wonder
why the core regression tests didn't expose it?). This possibly is another
reason for missing stats updates ...
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.
Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures. And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
messages to the stats collector. This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden. We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).
In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present. That's not done in this patch, but will be committed
separately.
when a relation is opened multiple times in the same transaction. This is
particularly useful for system catalogs, which we may heap_open or index_open
many times in a transaction, and it doesn't really cost anything extra even
if the rel is touched but once. Motivated by study of an example from Greg
Stark, in which pgstat_initstats() accounted for an unreasonably large
fraction of the runtime.
its table list and then rechecks pgstat before vacuuming each table to
verify that no one has vacuumed the table in the meantime.
In the current autovacuum world this only means that a worker will not
vacuum a table that a user has vacuumed manually after the worker started.
When support for multiple autovacuum workers is introduced, this will reduce
the probability of simultaneous workers on the same database doing redundant
work.
continuously, and requests vacuum runs of "autovacuum workers" to postmaster.
The workers do the actual vacuum work. This allows for future improvements,
like allowing multiple autovacuum jobs running in parallel.
For now, the code keeps the original behavior of having a single autovac
process at any time by sleeping until the previous worker has finished.
entries for the victim database go away sooner rather than later. We already
did the equivalent thing at the per-relation level, not sure why it's not
been done for whole databases. With this change, pgstat_vacuum_tabstat
should usually not find anything to do; though we still need it as a backstop
in case DROPDB or TABPURGE messages get lost under load.
already collected in the current transaction; this allows plpgsql functions to
watch for stats updates even though they are confined to a single transaction.
Use this instead of the previous kluge involving pg_stat_file() to wait for
the stats collector to update in the stats regression test. Internally,
decouple storage of stats snapshots from transaction boundaries; they'll
now stick around until someone calls pgstat_clear_snapshot --- which xact.c
still does at transaction end, to maintain the previous behavior. This makes
the logic a lot cleaner, at the price of a couple dozen cycles per transaction
exit.
input in the stats collector. Our select() emulation is apparently buggy
for UDP sockets :-(. This should resolve problems with stats collection
(and hence autovacuum) failing under more than minimal load. Diagnosis
and patch by Magnus Hagander.
Patch probably needs to be back-ported to 8.1 and 8.0, but first let's
see if it makes the buildfarm happy...
(or other types of pg_class entry): the function pgstat_vacuum_tabstat,
invoked during VACUUM startup, had runtime proportional to the number of
stats table entries times the number of pg_class rows; in other words
O(N^2) if the stats collector's information is reasonably complete.
Replace list searching with a hash table to bring it back to O(N)
behavior. Per report from kim at myemma.com.
Back-patch as far as 8.1; 8.0 and before use different coding here.
identify long-running transactions. Since we already need to record
the transaction-start time (e.g. for now()), we don't need any
additional system calls to report this information.
Catversion bumped, initdb required.
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process. This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command. Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system(). (There is no support in the
core backend for that, but it's widely done using untrusted PLs.) Per gripe
from Stephen Harris and subsequent discussion.