childprocess deaths instead of using one thread per child. This drastastically
reduces the address space usage and should allow for more backends running.
Also change the win32_waitpid functionality to use an IO Completion Port for
queueing child death notices instead of using a fixed-size array.
having several of them. Add two more flags: whether the process is
executing an ANALYZE, and whether a vacuum is for Xid wraparound (which
is obviously only set by autovacuum).
Sneakily move the worker's recently-acquired PostAuthDelay to a more useful
place.
with the next table on schedule instead of exiting, in all cases instead of
just on query cancel.
Add a errcontext() line indicating the activity of the worker to the error
message when it is cancelled.
Change the WorkerInfo struct to contain a pointer to the worker's PGPROC
instead of just the PID.
Add forgotten post-auth delays, per Simon Riggs. Also to autovac launcher.
- create a separate archive_mode GUC, on which archive_command is dependent
- %r option in recovery.conf sends last restartpoint to recovery command
- %r used in pg_standby, updated README
- minor other code cleanup in pg_standby
- doc on Warm Standby now mentions pg_standby and %r
- log_restartpoints recovery option emits LOG message at each restartpoint
- end of recovery now displays last transaction end time, as requested
by Warren Little; also shown at each restartpoint
- restart archiver if needed to carry away WAL files at shutdown
Simon Riggs
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers. The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.
Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
* stats_start_collector goes away; we always start the collector process,
unless prevented by a problem with setting up the stats UDP socket.
* stats_reset_on_server_start goes away; it seems useless in view of the
availability of pg_stat_reset().
* stats_block_level and stats_row_level are merged into a single variable
"track_counts", which controls all reports sent to the collector process.
* stats_command_string is renamed to track_activities.
* log_autovacuum is renamed to log_autovacuum_min_duration to better reflect
its meaning.
The log_autovacuum change is not a compatibility issue since it didn't exist
before 8.3 anyway. The other changes need to be release-noted.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
we'd dump core anyway immediately afterward if it were null; and it
seems to confuse some versions of icc into generating bad code.
Per report from Sergey Koposov. Patched in HEAD only, for the moment,
since this is only likely to affect developers.
recover from elog(ERROR). Problem was created by introduction of hash seq
search tracking awhile back, and affects all branches that have bgwriter;
in HEAD the disease has snuck into autovacuum and walwriter too. (Not sure
that the latter two use hash_seq_search at the moment, but surely they might
someday.) Per report from Sergey Koposov.
constant flow of new connection requests could prevent the postmaster from
completing a shutdown or crash restart. This is done by labeling child
processes that are "dead ends", that is, we know that they were launched only
to tell a client that it can't connect. These processes are managed
separately so that they don't confuse us into thinking that we can't advance
to the next stage of a shutdown or restart sequence, until the very end
where we must wait for them to drain out so we can delete the shmem segment.
Per discussion of a misbehavior reported by Keaton Adams.
Since this code was baroque already, and my first attempt at fixing the
problem made it entirely impenetrable, I took the opportunity to rewrite it
in a state-machine style. That eliminates some duplicated code sections and
hopefully makes everything a bit clearer.
as well as regular backends: if no regular backend launches before the autovac
launcher tries to start an autovac worker, the postmaster would get an Assert
fault due to calling PostmasterRandom before random_seed was initialized.
Cleanest solution seems to be to take the initialization of random_seed out
of ServerLoop and let PostmasterRandom do it for itself.
displayed in the postmaster log. This avoids Windows-specific problems with
localized time zone names that are in the wrong encoding, and generally seems
like a good idea to forestall other potential platform-dependent issues.
To preserve the existing behavior that all backends will log in the same time
zone, create a new GUC variable log_timezone that can only be changed on a
system-wide basis, and reference log-related calculations to that zone instead
of the TimeZone variable.
This fixes the issue reported by Hiroshi Saito that timestamps printed by
xlog.c startup could be improperly localized on Windows. We still need a
simpler patch for that problem in the back branches, however.
not bothering to initialize is_autovacuum for regular backends, meaning there
was a significant chance of the postmaster prematurely sending them SIGTERM
during database shutdown. Also, leaving the cancel key unset for an autovac
worker meant that any client could send it SIGINT, which doesn't sound
especially good either.
so that we will be able to create a cookie for all processes for CSVlogs.
It is set wherever MyProcPid is set. Take the opportunity to remove the now
unnecessary session-only restriction on the %s and %c escapes in log_line_prefix.
and fsync WAL at convenient intervals. For the moment it just tries to
offload this work from backends, but soon it will be responsible for
guaranteeing a maximum delay before asynchronously-committed transactions
will be flushed to disk.
This is a portion of Simon Riggs' async-commit patch, committed to CVS
separately because a background WAL writer seems like it might be a good idea
independently of the async-commit feature. I rebased walwriter.c on
bgwriter.c because it seemed like a more appropriate way of handling signals;
while the startup/shutdown logic in postmaster.c is more like autovac because
we want walwriter to quit before we start the shutdown checkpoint.
against a Unix server, and Windows-specific server-side authentication
using SSPI "negotiate" method (Kerberos or NTLM).
Only builds properly with MSVC for now.
we don't know at that point which relation OID to tell pgstat to forget.
The code was passing the relfilenode, which is incorrect, and could possibly
cause some other relation's stats to be zeroed out. While we could try to
clean this up, it seems much simpler and more reliable to let the next
invocation of pgstat_vacuum_tabstat() fix things; which indeed is how it
worked before I introduced the buggy code into 8.1.3 and later :-(.
Problem noticed by Itagaki Takahiro, fix is per subsequent discussion.
checkpoint. The comment claimed that we could do this anytime after
setting the checkpoint REDO point, but actually BufferSync is relying
on the assumption that buffers dumped by other backends will be fsync'd
too. So we really could not do it any sooner than we are doing it.
so that it responds to SIGQUIT reasonably promptly even on machines where
SA_RESTART signals restart a sleep from scratch. (This whole area could
stand some rethinking, but for now make it work like the other processes
do.) Also some marginal stylistic cleanups.
for it to die before telling the bgwriter to initiate shutdown checkpoint.
Since it's connected to shared memory, this seems more prudent than the
alternative of letting it quit asynchronously. Resolves my complaint
of yesterday about repeated shutdown checkpoints in CVS HEAD.
memory context pointing at a context not long lived enough.
Also, create a fake PortalContext where to store the vac_context, if only
to avoid having it be a top-level memory context.
continue with the schedule. Change current uses of SIGINT to abort a worker
into SIGTERM, which keeps the old behaviour of terminating the process.
Patch from ITAGAKI Takahiro, with some editorializing of my own.
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.
Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.
Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
by having the postmaster signal it when certain failures occur. This requires
the postmaster setting a flag in shared memory, but should be as safe as the
pmsignal.c code is.
Also make sure the launcher honor's a postgresql.conf change turning it off
on SIGHUP.
reassembled in the syslogger before writing to the log file. This prevents
partial messages from being written, which mucks up log rotation, and
messages from different backends being interleaved, which causes garbled
logs. Backport as far as 8.0, where the syslogger was introduced.
Tom Lane and Andrew Dunstan
causes a division-by-zero error in the vacuum code. This can happen when there
are more workers than cost limit units.
Per report from Galy Lee in
<200705310914.l4V9E6JA094603@wwwmaster.postgresql.org>.
value for the vacuum code. Instead, make zero signify getting the value from a
higher level configuration facility, just like -1 in the original coding. We
still document that -1 is the value that disables the feature, to avoid
confusing the user unnecessarily.
Reported by Galy Lee in <200705310914.l4V9E6JA094603@wwwmaster.postgresql.org>;
per subsequent discussion.
buffers, rather than blowing out the whole shared-buffer arena. Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer. Those flushes will now occur only once per
ring-ful. The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done. The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.
This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it. To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.
Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.
Original patch by Simon, reworked by Heikki and again by Tom.
or abort within a backend; rearrange InitPostgres processing to make it so.
Revealed by just-added Asserts along with ECPG regression tests (hm, I wonder
why the core regression tests didn't expose it?). This possibly is another
reason for missing stats updates ...
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.
Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures. And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
and TimestampDifference, to make coding clearer. I think this should also fix
the failure to start workers in platforms with low resolution timers, as
reported by Itagaki Takahiro.
messages to the stats collector. This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden. We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).
In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present. That's not done in this patch, but will be committed
separately.
when a relation is opened multiple times in the same transaction. This is
particularly useful for system catalogs, which we may heap_open or index_open
many times in a transaction, and it doesn't really cost anything extra even
if the rel is touched but once. Motivated by study of an example from Greg
Stark, in which pgstat_initstats() accounted for an unreasonably large
fraction of the runtime.
processes to be running simultaneously. Also, now autovacuum processes do not
count towards the max_connections limit; they are counted separately from
regular processes, and are limited by the new GUC variable
autovacuum_max_workers.
The launcher now has intelligence to launch workers on each database every
autovacuum_naptime seconds, limited only on the max amount of worker slots
available.
Also, the global worker I/O utilization is limited by the vacuum cost-based
delay feature. Workers are "balanced" so that the total I/O consumption does
not exceed the established limit. This part of the patch was contributed by
ITAGAKI Takahiro.
Per discussion.
its table list and then rechecks pgstat before vacuuming each table to
verify that no one has vacuumed the table in the meantime.
In the current autovacuum world this only means that a worker will not
vacuum a table that a user has vacuumed manually after the worker started.
When support for multiple autovacuum workers is introduced, this will reduce
the probability of simultaneous workers on the same database doing redundant
work.
module and teach PREPARE and protocol-level prepared statements to use it.
In service of this, rearrange utility-statement processing so that parse
analysis does not assume table schemas can't change before execution for
utility statements (necessary because we don't attempt to re-acquire locks
for utility statements when reusing a stored plan). This requires some
refactoring of the ProcessUtility API, but it ends up cleaner anyway,
for instance we can get rid of the QueryContext global.
Still to do: fix up SPI and related code to use the plan cache; I'm tempted to
try to make SQL functions use it too. Also, there are at least some aspects
of system state that we want to ensure remain the same during a replan as in
the original processing; search_path certainly ought to behave that way for
instance, and perhaps there are others.
o read global SSL configuration file
o add GUC "ssl_ciphers" to control allowed ciphers
o add libpq environment variable PGSSLKEY to control SSL hardware keys
Victor B. Wagner
continuously, and requests vacuum runs of "autovacuum workers" to postmaster.
The workers do the actual vacuum work. This allows for future improvements,
like allowing multiple autovacuum jobs running in parallel.
For now, the code keeps the original behavior of having a single autovac
process at any time by sleeping until the previous worker has finished.
socket is still read-ready, the code was a tight loop, wasting lots of CPU.
We can't do anything to clear the failure, other than wait, but we should give
other processes more chance to finish and release FDs; so insert a small sleep.
Also, avoid bogus "close(-1)" in this case. Per report from Jim Nasby.
entries for the victim database go away sooner rather than later. We already
did the equivalent thing at the per-relation level, not sure why it's not
been done for whole databases. With this change, pgstat_vacuum_tabstat
should usually not find anything to do; though we still need it as a backstop
in case DROPDB or TABPURGE messages get lost under load.
already collected in the current transaction; this allows plpgsql functions to
watch for stats updates even though they are confined to a single transaction.
Use this instead of the previous kluge involving pg_stat_file() to wait for
the stats collector to update in the stats regression test. Internally,
decouple storage of stats snapshots from transaction boundaries; they'll
now stick around until someone calls pgstat_clear_snapshot --- which xact.c
still does at transaction end, to maintain the previous behavior. This makes
the logic a lot cleaner, at the price of a couple dozen cycles per transaction
exit.
"database system is ready to accept connections", which is issued by the
postmaster when it really is ready to accept connections. Per proposal from
Markus Schiltknecht and subsequent discussion.
input in the stats collector. Our select() emulation is apparently buggy
for UDP sockets :-(. This should resolve problems with stats collection
(and hence autovacuum) failing under more than minimal load. Diagnosis
and patch by Magnus Hagander.
Patch probably needs to be back-ported to 8.1 and 8.0, but first let's
see if it makes the buildfarm happy...