Various places were supposing that errno could be expected to hold still
within an ereport() nest or similar contexts. This isn't true necessarily,
though in some cases it accidentally failed to fail depending on how the
compiler chanced to order the subexpressions. This class of thinko
explains recent reports of odd failures on clang-built versions, typically
missing or inappropriate HINT fields in messages.
Problem identified by Christian Kruse, who also submitted the patch this
commit is based on. (I fixed a few issues in his patch and found a couple
of additional places with the same disease.)
Back-patch as appropriate to all supported branches.
We calculated the rounded-up size for the allocation, but then failed to
use the rounded-up value in the mmap() call. Oops.
Also, initialize allocsize, to silence warnings seen with some compilers,
as pointed out by Jeff Janes.
This patch adds an option, huge_tlb_pages, which allows requesting the
shared memory segment to be allocated using huge pages, by using the
MAP_HUGETLB flag in mmap(). This can improve performance.
The default is 'try', which means that we will attempt using huge pages,
and fall back to non-huge pages if it doesn't work. Currently, only Linux
has MAP_HUGETLB. On other platforms, the default 'try' behaves the same as
'off'.
In the passing, don't try to round the mmap() size to a multiple of
pagesize. mmap() doesn't require that, and there's no particular reason for
PostgreSQL to do that either. When using MAP_HUGETLB, however, round the
request size up to nearest 2MB boundary. This is to work around a bug in
some Linux kernel versions, but also to avoid wasting memory, because the
kernel will round the size up anyway.
Many people were involved in writing this patch, including Christian Kruse,
Richard Poole, Abhijit Menon-Sen, reviewed by Peter Geoghegan, Andres Freund
and me.
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has". Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64). So let's
start using "z" instead. To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".
Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead. This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them. I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.
It's possible that this change will result in format-string warnings from
pre-C99 compilers. We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.
Andres Freund, with a little additional work by me
Development of IRIX has been discontinued, and support is scheduled
to end in December of 2013. Therefore, there will be no supported
versions of this operating system by the time PostgreSQL 9.4 is
released. Furthermore, we have no maintainer for this platform.
The exclusion of SIGALRM dates back to Berkeley days, when Postgres used
SIGALRM in only one very short stretch of code. Nowadays, allowing it to
interrupt kernel calls doesn't seem like a very good idea, since its use
for statement_timeout means SIGALRM could occur anyplace in the code, and
there are far too many call sites where we aren't prepared to deal with
EINTR failures. When third-party code is taken into consideration, it
seems impossible that we ever could be fully EINTR-proof, so better to
use SA_RESTART always and deal with the implications of that. One such
implication is that we should not assume pg_usleep() will be terminated
early by a signal. Therefore, long sleeps should probably be replaced
by WaitLatch operations where practical.
Back-patch to 9.3 so we can get some beta testing on this change.
We had two copies of this function in the backend and libpq, which was
already pretty bogus, but it turns out that we need it in some other
programs that don't use libpq (such as pg_test_fsync). So put it where
it probably should have been all along. The signal-mask-initialization
support in src/backend/libpq/pqsignal.c stays where it is, though, since
we only need that in the backend.
The behavior with larger values is unspecified by the Single Unix Spec.
It appears that BSD-derived kernels report EINVAL, although Linux does not.
If waiting for longer intervals is desired, the calling code has to do
something to limit the delay; we can't portably fix it here since "long"
may not be any wider than "int" in the first place.
Part of response to bug #7670, though this change doesn't fix that
(in fact, it converts the problem from an ERROR into an Assert failure).
No back-patch since it's just an assertion addition.
If the sleep is interrupted by a signal, we must recompute the remaining
time to wait; otherwise, a steady stream of non-wait-terminating interrupts
could delay return from WaitLatch indefinitely. This has been shown to be
a problem for the autovacuum launcher, and there may well be other places
now or in the future with similar issues. So we'd better make the function
robust, even though this'll add at least one gettimeofday call per wait.
Back-patch to 9.2. We might eventually need to fix 9.1 as well, but the
code is quite different there, and the usage of WaitLatch in 9.1 is so
limited that it's not clearly important to do so.
Reported and diagnosed by Jeff Janes, though I rewrote his patch rather
heavily.
In the previous coding, new backend processes would attempt to create their
self-pipe during the OwnLatch call in InitProcess. However, pipe creation
could fail if the kernel is short of resources; and the system does not
recover gracefully from a FATAL error right there, since we have armed the
dead-man switch for this process and not yet set up the on_shmem_exit
callback that would disarm it. The postmaster then forces an unnecessary
database-wide crash and restart, as reported by Sean Chittenden.
There are various ways we could rearrange the code to fix this, but the
simplest and sanest seems to be to split out creation of the self-pipe into
a new function InitializeLatchSupport, which must be called from a place
where failure is allowed. For most processes that gets called in
InitProcess or InitAuxiliaryProcess, but processes that don't call either
but still use latches need their own calls.
Back-patch to 9.1, which has only a part of the latch logic that 9.2 and
HEAD have, but nonetheless includes this bug.
Per testing by Andres Freund, this improves replication performance
and reduces replication latency and latency jitter. I was a bit
concerned about moving more work into XLogInsert, but testing seems
to show that it's not a problem in practice.
Along the way, improve comments for WaitLatchOrSocket.
Andres Freund. Review and stylistic cleanup by me.
The original coding had it as "PGShmemHeader *", but that doesn't offer any
notational benefit because we don't dereference it. And it was resulting
in compiler warnings on some platforms, notably buildfarm member
castoroides, where mmap() and munmap() are evidently declared to take and
return "char *".
On old HPUX this has to be #defined to -1. It might be that other values
are required on other dinosaur systems, but we'll worry about that when
and if we get reports.
Except when compiling with EXEC_BACKEND, we'll now allocate only a tiny
amount of System V shared memory (as an interlock to protect the data
directory) and allocate the rest as anonymous shared memory via mmap.
This will hopefully spare most users the hassle of adjusting operating
system parameters before being able to start PostgreSQL with a
reasonable value for shared_buffers.
There are a bunch of documentation updates needed here, and we might
need to adjust some of the HINT messages related to shared memory as
well. But it's not 100% clear how portable this is, so before we
write the documentation, let's give it a spin on the buildfarm and
see what turns red.
Since we have chosen to report socket EOF and error conditions via the
WL_SOCKET_READABLE flag bit, it's unsafe to wait only for
WL_SOCKET_WRITEABLE; the caller would never be notified of the socket
condition, and in some of these implementations WaitLatchOrSocket would
busy-wait until something else happens. Add this restriction to the API
specification, and add Asserts to check that callers don't try to do that.
At some point we might want to consider adjusting the API to relax this
restriction, but until we have an actual use case for waiting on a
write-only socket, it seems premature to design a solution.
Make sure WaitLatchOrSocket regards FD_CLOSE as a read-ready condition.
We might want to tweak this further, but it was surely wrong as-is.
Make pgwin32_waitforsinglesocket detach its private event object from the
passed socket before returning. I suspect that failure to do so leads
to race conditions when other code (such as WaitLatchOrSocket) attaches
a different event object to the same socket. Moreover, the existing
coding meant that repeated calls to pgwin32_waitforsinglesocket would
perform ResetEvent on an event actively connected to a socket, which
is rumored to be an unsafe practice; the WSAEventSelect documentation
appears to recommend against this, though it does not say not to do it
in so many words.
Also, uniformly use the coding pattern "WSAEventSelect(s, NULL, 0)" to
detach events from sockets, rather than passing the event in the second
parameter. The WSAEventSelect documentation says that the second parameter
is ignored if the third is 0, so theoretically this should make no
difference. However, elsewhere on the same reference page the use of NULL
in this context is recommended, and I have found suggestions on the net
that some versions of Windows have bugs with a non-NULL second parameter
in this usage.
Some other mostly-cosmetic cleanup, such as using the right one of
WSAGetLastError and GetLastError for reporting errors from these functions.
When using poll(), EOF on a socket is reported with the POLLHUP not
POLLIN flag (at least on Linux). WaitLatchOrSocket failed to check
this bit, causing it to go into a busy-wait loop if EOF occurs.
We earlier fixed the same mistake in the test for the state of the
postmaster_alive socket, but missed it for the caller-supplied socket.
Fortunately, this error is new in 9.2, since 9.1 only had a select()
based code path not a poll() based one.
Per a suggestion from Peter Geoghegan, make WaitLatch responsible for
verifying that the WL_POSTMASTER_DEATH bit it returns is truthful (by
testing PostmasterIsAlive). Then simplify its callers, who no longer
need to do that for themselves. Remove weasel wording about falsely-set
result bits from WaitLatch's API contract.
The original coding failed to reset ImmediateInterruptOK before returning,
which would potentially allow a subsequent query-cancel interrupt to be
accepted at an unsafe point. This is a really nasty bug since it's so hard
to predict the consequences, but they could be unpleasant.
Also, ensure that signal handlers are serviced before this function
returns, even if the semaphore is already set. This should make the
behavior more like Unix.
Back-patch to all supported versions.
Ensure that signal handlers are serviced before this function returns.
This should make the behavior more like Unix. Also, add some more
error checking, and make some other cosmetic improvements.
No back-patch since it's not clear whether this is fixing any live bug
that would affect 9.1. I'm more concerned about 9.2 anyway given our
considerable recent expansions in the usage of WaitLatch.
Remove the following ports:
- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel
These are obsolete and not worth rescuing. In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.
When the remote end of the pipe is closed, select() reports the fd as
readable, but poll() has a separate POLLHUP return code for that.
Spotted by Peter Geoghegan.
poll() is preferred over select() on platforms where both are available,
because it tends to be a bit faster and it doesn't have an arbitrary limit
on the range of FD numbers that can be accessed. The FD range limit does
not appear to be a risk factor for any 9.1 usages, so this doesn't need to
be back-patched, but we need to have it in place if we keep on expanding
the uses of WaitLatch.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens. This will allow WaitLatch callers to be
wakened properly by these signals.
In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno. Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch. Much of
this has to be back-patched into 9.1.
Peter Geoghegan, with additional work by Tom
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines. Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.
Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.
Peter Geoghegan
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code. Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.
This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them. In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.
This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.
The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.
Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
For consistency, have all non-ASCII characters from contributors'
names in the source be in UTF-8. But remove some other more
gratuitous uses of non-ASCII characters.
Also, we removed the display of the current value of
max_connections/MaxBackends from some messages earlier, because it was
confusing, so do that in the remaining one as well.
Most of these cast DWORD to int or unsigned int for printf type handling.
This is safe even on 64 bit architectures because a DWORD is always 32 bits.
In one case a variable is initialised to keep the compiler happy.
Remove the hard-wired assumption that __mips__ (and only __mips__) lacks
dlopen in FreeBSD and OpenBSD. This assumption is outdated at least for
OpenBSD, as per report from an anonymous 9.1 tester. We can perfectly well
use HAVE_DLOPEN instead to decide which code to use.
Some other cosmetic adjustments to make freebsd.c, netbsd.c, and openbsd.c
exactly alike.
than replication_timeout (a new GUC) milliseconds. The TCP timeout is often
too long, you want the master to notice a dead connection much sooner.
People complained about that in 9.0 too, but with synchronous replication
it's even more important to notice dead connections promptly.
Fujii Masao and Heikki Linnakangas
Fix broken test for pre-existing postmaster, caused by wrong code for
appending lines to the lockfile; don't write a failed listen_address
setting into the lockfile; don't arbitrarily change the location of the
data directory in the lockfile compared to previous releases; provide more
consistent and useful definitions of the socket path and listen_address
entries; avoid assuming that pg_ctl has the same DEFAULT_PGSOCKET_DIR as
the postmaster; assorted code style improvements.
Since we're not multithreaded it only provides marginally useful
information, and it does require a newer version of the Platform SDK
than we target. We may want to reconsider this in the future along
with a fix for MinGW.
Add support for collecting "minidump" style crash dumps on
Windows, by setting up an exception handling filter. Crash
dumps will be generated in PGDATA/crashdumps if the directory
is created (the existance of the directory is used as on/off
switch for the generation of the dumps).
Craig Ringer and Magnus Hagander
dynamic pool of event handles, we can permanently assign one for each
shared latch. Thanks to that, we no longer need a separate shared memory
block for latches, and we don't need to know in advance how many shared
latches there is, so you no longer need to remember to update
NumSharedLatches when you introduce a new latch to the system.
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.
On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.
Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.
Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
It turns out that some platforms return ENOMEM for a request that violates
SHMALL, whereas we were assuming that ENOSPC would always be used for that.
Apparently the latter is a Linuxism while ENOMEM is the BSD tradition.
Extend the ENOMEM hint to suggest that raising SHMALL might be needed.
Per gripe from A.M.
Backpatch to 9.0, but not further, because this doesn't seem important
enough to warrant creating extra translation work in the stable branches.
(If it were, we'd have figured this out years ago.)
linking both executables and shared libraries, and we add on LDFLAGS_EX when
linking executables or LDFLAGS_SL when linking shared libraries. This
provides a significantly cleaner way of dealing with link-time switches than
the former behavior. Also, make sure that the various platform-specific
%.so: %.o rules incorporate LDFLAGS and LDFLAGS_SL; most of them missed that
before. (I did not add these variables for the platforms that invoke $(LD)
directly, however. It's not clear if we can do that safely, since for the
most part we assume these variables use CC command-line syntax.)
Per gripe from Aaron Swenson and subsequent investigation.
returns EINVAL for an existing shared memory segment. Although it's not
terribly sensible, that behavior does meet the POSIX spec because EINVAL
is the appropriate error code when the existing segment is smaller than the
requested size, and the spec explicitly disclaims any particular ordering of
error checks. Moreover, it does in fact happen on OS X and probably other
BSD-derived kernels. (We were able to talk NetBSD into changing their code,
but purging that behavior from the wild completely seems unlikely to happen.)
We need to distinguish collision with a pre-existing segment from invalid size
request in order to behave sensibly, so it's worth some extra code here to get
it right. Per report from Gavin Kistner and subsequent investigation.
Back-patch to all supported versions, since any of them could get used
with a kernel having the debatable behavior.
and use this in pq_getbyte_if_available.
It's only a limited implementation which swithes the whole emulation layer
no non-blocking mode, but that's enough as long as non-blocking is only
used during a short period of time, and only one socket is accessed during
this time.
There was a race condition where the receiving pipe could be closed by the
child thread if the main thread was pre-empted before it got a chance to
create a new one, and the dispatch thread ran to completion during that time.
One symptom of this is that rows in pg_listener could be dropped under
heavy load.
Analysis and original patch by Radu Ilie, with some small
modifications by Magnus Hagander.
This is more in keeping with modern practice, and is a first step towards
porting to Win64 (which has sizeof(pointer) > sizeof(long)).
Tsutomu Yamada, Magnus Hagander, Tom Lane
that memory allocated by starting third party DLLs doesn't end up
conflicting with it.
Hopefully this solves the long-time issue with "could not reattach
to shared memory" errors on Win32.
Patch from Tsutomu Yamada and me, based on idea from Trevor Talbot.
it fails because the shared memory segment already exists. This
means it can take up to 10 seconds before it reports the error
if it *does* exist, but hopefully it will make the system capable
of restarting even when the server is under high load.
to make sure that the error code is reset, as a precaution in
case the API doesn't properly reset it on success. This could
be necessary, since we check the error value even if the function
doesn't fail for specific success cases.
using the system functions all the time. (These files are now just copies
of the osf.* files.) The homebrew functions were not getting used anyway
on AIX versions that have dlopen(), that is 4.3 and up, so they are not
needed on any AIX that is even remotely supported by the vendor anymore.
We'd have probably left them here anyway, except some questions were
raised about the copyright.
in the Global\ namespace, because it caused permission errors on
a lot of platforms.
We need to come up with something better for 8.4, but for now
revert to the pre-8.3.4 behaviour.