Move common code, that was duplicated in every postmaster child/every
standalone process, into two functions in miscinit.c. Not only does
that already result in a fair amount of net code reduction but it also
makes it much easier to remove more duplication in the future. The
prime motivation wasn't code deduplication though, but easier addition
of new common code.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
st_changecount protocol needs the memory barriers to ensure that
the apparent order of execution is as it desires. Otherwise,
for example, the CPU might rearrange the code so that st_changecount
is incremented twice before the modification on a machine with
weak memory ordering. This surprising result can lead to bugs.
This commit introduces the macros to load and store st_changecount
with the memory barriers. These are called before and after
PgBackendStatus entries are modified or copied into private memory,
in order to prevent CPU from reordering PgBackendStatus access.
Per discussion on pgsql-hackers, we decided not to back-patch this
to 9.4 or before until we get an actual bug report about this.
Patch by me. Review by Robert Haas.
In passing, also make some debugging elog's in pgstat.c a bit more
consistently worded.
Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are
really old).
Mark Dilger identified and patched the type violations; the message
rewordings are mine.
If a connection committed or rolled back any transactions within a
PGSTAT_STAT_INTERVAL pacing interval without accessing any tables,
the reporting of those statistics would be held up until the
connection closed or until it ended a PGSTAT_STAT_INTERVAL interval
in which it had accessed a table. This could result in under-
reporting of transactions for an extended period, followed by a
spike in reported transactions.
While this is arguably a bug, the impact is minimal, primarily
affecting, and being affected by, monitoring software. It might
cause more confusion than benefit to change the existing behavior
in released stable branches, so apply only to master and the 9.4
beta.
Gurjeet Singh, with review and editing by Kevin Grittner,
incorporating suggested changes from Abhijit Menon-Sen and Tom
Lane.
187492b6c2 changed pgstat.c so that
the stats files were saved into $PGDATA/pg_stat directory when the server
was shutdowned. But it accidentally forgot to change the location of
pg_stat_statements permanent stats file. This commit fixes pg_stat_statements
so that its stats file is also saved into $PGDATA/pg_stat at shutdown.
Since this fix changes the file layout, we don't back-patch it to 9.3
where this oversight was introduced.
According to the Single Unix Spec and assorted man pages, you're supposed
to use the constants named AF_xxx when setting ai_family for a getaddrinfo
call. In a few places we were using PF_xxx instead. Use of PF_xxx
appears to be an ancient BSD convention that was not adopted by later
standardization. On BSD and most later Unixen, it doesn't matter much
because those constants have equivalent values anyway; but nonetheless
this code is not per spec.
In the same vein, replace PF_INET by AF_INET in one socket() call, which
wasn't even consistent with the other socket() call in the same function
let alone the remainder of our code.
Per investigation of a Cygwin trouble report from Marco Atzeri. It's
probably a long shot that this will fix his issue, but it's wrong in
any case.
Initialization of this field was not being done according to the
st_changecount protocol (it has to be done within the changecount increment
range, not outside). And the test to see if the value should be reported
as null was wrong. Noted while perusing uses of Port.remote_hostname.
This was wrong from the introduction of this code (commit 4a25bc145),
so back-patch to 9.1.
The new function dsm_detach_all() can be used either by postmaster
children that don't wish to take any risk of accidentally corrupting
shared memory; or by forked children of regular backends with
the same need. This patch also updates the postmaster children that
already do PGSharedMemoryDetach() to do dsm_detach_all() as well.
Per discussion with Tom Lane.
We were unlinking the permanent file, not the non-permanent one. But
since the stat collector already unlinks all permanent files on startup,
there was nothing for it to unlink. The non-permanent file remained in
place, and was copied to the permanent directory on shutdown, so in
effect no file was ever dropped.
Backpatch to 9.3, where the issue was introduced by commit 187492b6c2.
Before that, there were no per-database files and thus no file to drop
on DROP DATABASE.
Per report from Thom Brown.
Author: Tomáš Vondra
Historically, VACUUM has just reported its new_rel_tuples estimate
(the same thing it puts into pg_class.reltuples) to the stats collector.
That number counts both live and dead-but-not-yet-reclaimable tuples.
This behavior may once have been right, but modern versions of the
pgstats code track live and dead tuple counts separately, so putting
the total into n_live_tuples and zero into n_dead_tuples is surely
pretty bogus. Fix it to report live and dead tuple counts separately.
This doesn't really do much for situations where updating transactions
commit concurrently with a VACUUM scan (possibly causing double-counting or
omission of the tuples they add or delete); but it's clearly an improvement
over what we were doing before.
Hari Babu, reviewed by Amit Kapila
The PGSTAT_NUM_TABENTRIES macro should have been updated when new fields
were added to struct PgStat_MsgTabstat in commit 644828908, but it wasn't.
Fix that.
Also, add a static assertion that we didn't overrun the intended size limit
on stats messages. This will not necessarily catch every mistake in
computing the maximum array size for stats messages, but it will catch ones
that have practical consequences. (The assertion in fact doesn't complain
about the aforementioned error in PGSTAT_NUM_TABENTRIES, because that was
not big enough to cause the array length to increase.)
No back-patch, as there's no actual bug in existing releases; this is just
in the nature of future-proofing.
Mark Dilger and Tom Lane
Instead of deleting all files in stats_temp_directory and the permanent
directory on a crash, only remove those files that match the pattern of
files we actually write in them, to avoid possibly clobbering existing
unrelated contents of the temporary directory. Per complaint from Jeff
Janes, and subsequent discussion, starting at message
CAMkU=1z9+7RsDODnT4=cDFBRBp8wYQbd_qsLcMtKEf-oFwuOdQ@mail.gmail.com
Also, fix a bug in the same routine to avoid removing files from the
permanent directory twice (instead of once from that directory and then
from the temporary directory), also per report from Jeff Janes, in
message
CAMkU=1wbk947=-pAosDMX5VC+sQw9W4ttq6RM9rXu=MjNeEQKA@mail.gmail.com
Previously one had to use slist_delete(), implying an additional scan of
the list, making this infrastructure considerably less efficient than
traditional Lists when deletion of element(s) in a long list is needed.
Modify the slist_foreach_modify() macro to support deleting the current
element in O(1) time, by keeping a "prev" pointer in addition to "cur"
and "next". Although this makes iteration with this macro a bit slower,
no real harm is done, since in any scenario where you're not going to
delete the current list element you might as well just use slist_foreach
instead. Improve the comments about when to use each macro.
Back-patch to 9.3 so that we'll have consistent semantics in all branches
that provide ilist.h. Note this is an ABI break for callers of
slist_foreach_modify().
Andres Freund and Tom Lane
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
The point of turning off track_activities is to avoid this reporting
overhead, but a thinko in commit 4f42b546fd
caused pgstat_report_activity() to perform half of its updates anyway.
Fix that, and also make sure that we clear all the now-disabled fields
when transitioning to the non-reporting state.
A materialized view has a rule just like a view and a heap and
other physical properties like a table. The rule is only used to
populate the table, references in queries refer to the
materialized data.
This is a minimal implementation, but should still be useful in
many cases. Currently data is only populated "on demand" by the
CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW statements.
It is expected that future releases will add incremental updates
with various timings, and that a more refined concept of defining
what is "fresh" data will be developed. At some point it may even
be possible to have queries use a materialized in place of
references to underlying tables, but that requires the other
above-mentioned features to be working first.
Much of the documentation work by Robert Haas.
Review by Noah Misch, Thom Brown, Robert Haas, Marko Tiikkaja
Security review by KaiGai Kohei, with a decision on how best to
implement sepgsql still pending.
We now write one file per database and one global file, instead of
having the whole thing in a single huge file. This reduces the I/O that
must be done when partial data is required -- which is all the time,
because each process only needs information on its own database anyway.
Also, the autovacuum launcher does not need data about tables and
functions in each database; having the global stats for all DBs is
enough.
Catalog version bumped because we have a new subdir under PGDATA.
Author: Tomas Vondra. Some rework by Álvaro
Testing by Jeff Janes
Other discussion by Heikki Linnakangas, Tom Lane.
Normally, we suppress sending a tabstats message to the collector unless
there were some actual table stats to send. However, during backend exit
we should force out the message if there are any transaction commit/abort
counts to send, else the session's last few commit/abort counts will never
get reported at all. We had logic for this, but the short-circuit test
at the top of pgstat_report_stat() ignored the "force" flag, with the
consequence that session-ending transactions that touched no database-local
tables would not get counted. Seems to be an oversight in my commit
641912b4d1, which added the "force" flag.
That was back in 8.3, so back-patch to all supported versions.
In the previous coding, new backend processes would attempt to create their
self-pipe during the OwnLatch call in InitProcess. However, pipe creation
could fail if the kernel is short of resources; and the system does not
recover gracefully from a FATAL error right there, since we have armed the
dead-man switch for this process and not yet set up the on_shmem_exit
callback that would disarm it. The postmaster then forces an unnecessary
database-wide crash and restart, as reported by Sean Chittenden.
There are various ways we could rearrange the code to fix this, but the
simplest and sanest seems to be to split out creation of the self-pipe into
a new function InitializeLatchSupport, which must be called from a place
where failure is allowed. For most processes that gets called in
InitProcess or InitAuxiliaryProcess, but processes that don't call either
but still use latches need their own calls.
Back-patch to 9.1, which has only a part of the latch logic that 9.2 and
HEAD have, but nonetheless includes this bug.
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.
I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h. In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
There was a wild mix of calling conventions: Some were declared to
return void and didn't return, some returned an int exit code, some
claimed to return an exit code, which the callers checked, but
actually never returned, and so on.
Now all of these functions are declared to return void and decorated
with attribute noreturn and don't return. That's easiest, and most
code already worked that way.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file. Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP. Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.
We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message. The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.
To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time. The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.
We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
We have no need for a timeout here really, but some broken products from
Redmond seem to lose FD_READ events occasionally, and waking up and
retrying the recv() is the only known way to work around that. Perhaps
somebody will be motivated to figure out a better answer here; but not I.
Test results from buildfarm members mastodon/narwhal (Windows Server 2003)
make it look like that platform just plain loses FD_READ events
occasionally, and the only reason our previous coding seemed to work was
that it timed out every couple of seconds and retried the whole operation.
Try to verify this by reinserting a finite timeout into the pgstat loop.
This isn't meant to be a permanent patch either, just to confirm or
disprove a theory.
This should get rid of the usage of pgwin32_waitforsinglesocket entirely,
and perhaps thereby remove the race condition that's evidently still
present on some versions of Windows. The previous arrangement was a bit
unsafe anyway, since waiting at the recv() would not allow pgstat to notice
postmaster death.
Log main-loop blocking events and the results of inquiry messages.
This is to get some clarity as to what's happening on those Windows
buildfarm members that still don't like the latch-ified stats collector.
This bulks up the postmaster log a tad, so I won't leave it in place for
long.
This reverts commit cb2f2873d6, restoring
the latch-ified stats collector logic. We'll soon see if this works any
better on the Windows buildfarm machines.
This patch reverts commit 49340037ee and some
follow-on tweaking in pgstat.c. While the basic scheme of latch-ifying the
stats collector seems sound enough, it's failing on most Windows buildfarm
members for unknown reasons, and there's no time left to debug that before
9.2beta1. Better to ship a beta version without this improvement. I hope
to re-revert this once beta1 is out, though.
Per a suggestion from Peter Geoghegan, make WaitLatch responsible for
verifying that the WL_POSTMASTER_DEATH bit it returns is truthful (by
testing PostmasterIsAlive). Then simplify its callers, who no longer
need to do that for themselves. Remove weasel wording about falsely-set
result bits from WaitLatch's API contract.
In checkpointer and walwriter, avoid calling PostmasterIsAlive unless
WaitLatch has reported WL_POSTMASTER_DEATH. This saves a kernel call per
iteration of the process's outer loop, which is not all that much, but a
cycle shaved is a cycle earned. I had already removed the unconditional
PostmasterIsAlive calls in bgwriter and pgstat in previous patches, but
forgot that WL_POSTMASTER_DEATH is supposed to be treated as untrustworthy
(per comment in unix_latch.c); so adjust those two cases to match.
There are a few other places where the same idea might be applied, but only
after substantial code rearrangement, so I didn't bother.
Latch-ify the stats collector, so that it does not need an arbitrary wakeup
cycle to check for postmaster death. The incremental savings in idle power
is pretty marginal, since we only had it waking every two seconds; but I
believe that this patch may also improve the collector's performance under
load, by reducing the number of kernel calls made per message when messages
are arriving constantly (we now avoid a select/poll call except when we
need to sleep). The change also reduces the time needed for a normal
database shutdown on platforms where signals don't interrupt select().
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds. By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)
The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time
The first four of these are new in 9.2, so there is no compatibility issue
from changing them. The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
Ants Aasma's original patch to add timing information for buffer I/O
requests exposed this data at the relation level, which was judged too
costly. I've here exposed it at the database level instead.