The true explanation for Peter Eisentraut's report of inexact asind results
seems to be that (a) he's compiling into x87 instruction set, which uses
wider-than-double float registers, plus (b) the library function asin() on
his platform returns a result that is wider than double and is not rounded
to double width. To fix, we have to force the function's result to be
rounded comparably to what happened to the scaling constant asin_0_5.
Experimentation suggests that storing it into a volatile local variable is
the least ugly way of making that happen. Although only asin() is known to
exhibit an observable inexact result, we'd better do this in all the places
where we're hoping to get an exact result by scaling.
Change max_parallel_degree default from 0 to 2. It is possible that
this is not a good idea, or that we should go with 1 worker rather
than 2, but we won't find out without trying it. Along the way,
reword the documentation for max_parallel_degree a little bit to
hopefully make it more clear.
Discussion: 20160420174631.3qjjhpwsvvx5bau5@alap3.anarazel.de
Commit 65abaab547 tried to prevent the scaling constants used in
the degree-based trig functions from being precomputed at compile time,
because some compilers do that with functions that don't yield results
identical-to-the-last-bit to what you get at runtime. A report from
Peter Eisentraut suggests that some recent compilers are smart enough
to see through that trick, though. Instead, let's put the inputs to
these calculations into non-const global variables, which should be a
more reliable way of convincing the compiler that it can't assume that
they are compile-time constants. (If we really get desperate, we could
mark these variables "volatile", but I do not believe we should have to.)
Several issues:
1) checkpoint_flush_after doc and code disagreed about the default
2) new GUCs were missing from postgresql.conf.sample
3) Outdated source-code comment about bgwriter_flush_after's default
4) Sub-optimal categories assigned to new GUCs
5) Docs suggested backend_flush_after is PGC_SIGHUP, but it's PGC_USERSET.
6) Spell out int as integer in the docs, as done elsewhere
Reported-By: Magnus Hagander, Fujii Masao
Discussion: CAHGQGwETyTG5VYQQ5C_srwxWX7RXvFcD3dKROhvAWWhoSBdmZw@mail.gmail.com
NetBSD has seen fit to invent a libc function named strtoi(), which
conflicts with the long-established static functions of the same name in
datetime.c and ecpg's interval.c. While muttering darkly about intrusions
on application namespace, we'll rename our functions to avoid the conflict.
Back-patch to all supported branches, since this would affect attempts
to build any of them on recent NetBSD.
Thomas Munro
The implementation of that feature involves injecting nodes into the
raw parsetree where explicit parentheses appear. Various places in
parse_expr.c that test to see "is this child node of type Foo" need to
look through such nodes, else we'll get different behavior when
operator_precedence_warning is on than when it is off. Note that we only
need to handle this when testing untransformed child nodes, since the
AEXPR_PAREN nodes will be gone anyway after transformExprRecurse.
Per report from Scott Ribe and additional code-reading. Back-patch
to 9.5 where this feature was added.
Report: <ED37E303-1B0A-4CD8-8E1E-B9C4C2DD9A17@elevated-dev.com>
Given a left join containing a full join in its righthand side, with
the left join's joinclause referencing only one side of the full join
(in a non-strict fashion, so that the full join doesn't get simplified),
the planner could fail with "failed to build any N-way joins" or related
errors. This happened because the full join was seen as overlapping the
left join's RHS, and then recent changes within join_is_legal() caused
that function to conclude that the full join couldn't validly be formed.
Rather than try to rejigger join_is_legal() yet more to allow this,
I think it's better to fix initsplan.c so that the required join order
is explicit in the SpecialJoinInfo data structure. The previous coding
there essentially ignored full joins, relying on the fact that we don't
flatten them in the joinlist data structure to preserve their ordering.
That's sufficient to prevent a wrong plan from being formed, but as this
example shows, it's not sufficient to ensure that the right plan will
be formed. We need to work a bit harder to ensure that the right plan
looks sane according to the SpecialJoinInfos.
Per bug #14105 from Vojtech Rylko. This was apparently induced by
commit 8703059c6 (though now that I've seen it, I wonder whether there
are related cases that could have failed before that); so back-patch
to all active branches. Unfortunately, that patch also went into 9.0,
so this bug is a regression that won't be fixed in that branch.
The coverage was rather lean for cases that bind() or listen() might
return. Add entries for everything that there's a direct equivalent
for in the set of Unix errnos that elog.c has heard of.
When we shoehorned "x op ANY (array)" into the SQL syntax, we created a
fundamental ambiguity as to the proper treatment of a sub-SELECT on the
righthand side: perhaps what's meant is to compare x against each row of
the sub-SELECT's result, or perhaps the sub-SELECT is meant as a scalar
sub-SELECT that delivers a single array value whose members should be
compared against x. The grammar resolves it as the former case whenever
the RHS is a select_with_parens, making the latter case hard to reach ---
but you can get at it, with tricks such as attaching a no-op cast to the
sub-SELECT. Parse analysis would throw away the no-op cast, leaving a
parsetree with an EXPR_SUBLINK SubLink directly under a ScalarArrayOpExpr.
ruleutils.c was not clued in on this fine point, and would naively emit
"x op ANY ((SELECT ...))", which would be parsed as the first alternative,
typically leading to errors like "operator does not exist: text = text[]"
during dump/reload of a view or rule containing such a construct. To fix,
emit a no-op cast when dumping such a parsetree. This might well be
exactly what the user wrote to get the construct accepted in the first
place; and even if she got there with some other dodge, it is a valid
representation of the parsetree.
Per report from Karl Czajkowski. He mentioned only a case involving
RLS policies, but actually the problem is very old, so back-patch to
all supported branches.
Report: <20160421001832.GB7976@moraine.isi.edu>
Also, avoid reading PGPROC's wait_event field twice, once for the wait
event and again for the wait_event_type, because the value might change
in the middle.
Petr Jelinek and Robert Haas
That commit increased all shared memory allocations to the next higher
multiple of PG_CACHE_LINE_SIZE, but it didn't ensure that allocation
started on a cache line boundary. It also failed to remove a couple
other pieces of now-useless code.
BUFFERALIGN() is perhaps obsolete at this point, and likely should be
removed at some point, too, but that seems like it can be left to a
future cleanup.
Mistakes all pointed out by Andres Freund. The patch is mine, with
a few extra assertions which I adopted from his version of this fix.
Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not. Subsequent patches
band-aided over some of the problems with this design by making things
even messier.
One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud). This would not typically
be noticeable during retail index updates. It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.
Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state. There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.
To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts. The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page. The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path. The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)
In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.
Report: <571276DD.5050303@dalibo.com>
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
Coverity complained that oldPartitionLock was possibly dereferenced after
having been set to NULL. That actually can't happen, because we'd only use
it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is
justified in complaining, because at line 1275 we actually overwrite
oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to
whether to release the oldPartitionLock. Thus, the code would be incorrect
if someone else had changed the buffer's BM_TAG_VALID flag meanwhile.
That should not happen, since we hold pin on the buffer throughout this
sequence, but it's starting to look like a rather shaky chain of logic.
And there's no need for such assumptions, because we can simply replace
the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL),
which has identical results and makes it plain to all comers that we don't
dereference a null pointer. A small side benefit is that the range of
liveness of oldFlags is greatly reduced, possibly allowing the compiler
to save a register.
This is just cleanup, not an actual bug fix, so there seems no need
for a back-patch.
We've had repeated troubles over the years with failures to initialize
spinlocks correctly; see 6b93fcd14 for a recent example. Most of the time,
on most platforms, such oversights can escape notice because all-zeroes is
the expected initial content of an slock_t variable. The only platform
we have where the initialized state of an slock_t isn't zeroes is HPPA,
and that's practically gone in the wild. To make it easier to catch such
errors without needing one of those, adjust the --disable-spinlocks code
so that zero is not a valid value for an slock_t for it.
In passing, remove a bunch of unnecessary #include's from spin.c;
commit daa7527afc removed all the intermodule coupling that
made them necessary.
Although OID acts pretty much like user data, the other system columns do
not, so an index on one would likely misbehave. And it's pretty hard to
see a use-case for one, anyway. Let's just forbid the case rather than
worry about whether it should be supported.
David Rowley
For reasons of sheer brain fade, we (I) was calling systable_endscan()
immediately after systable_getnext() and expecting the tuple returned
by systable_getnext() to still be valid.
That's clearly wrong. Move the systable_endscan() down below the tuple
usage.
Discovered initially by Pavel Stehule and then also by Alvaro.
Add a regression test based on Alvaro's testing.
Careless coding added by commit 07cacba983 could result in a crash
or a bizarre error message if someone tried to select an index on the
OID column as the replica identity index for a table. Back-patch to 9.4
where the feature was introduced.
Discussion: CAKJS1f8TQYgTRDyF1_u9PVCKWRWz+DkieH=U7954HeHVPJKaKg@mail.gmail.com
David Rowley
The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched. This has already confused
several people, so display both numbers and label them clearly.
Julien Rouhaud, reviewed by me.
pg_xlogdump includes bufmgr.h. With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere. Per buildfarm member pademelon.
The code had a query-lifespan memory leak when encountering GIN entries
that have posting lists (rather than posting trees, ie, there are a
relatively small number of heap tuples containing this index key value).
With a suitable data distribution this could add up to a lot of leakage.
Problem seems to have been introduced by commit 36a35c550, so back-patch
to 9.4.
Julien Rouhaud
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().
Give up on using a macro, and use a static inline instead.
Discussion: 4369.1460435533@sss.pgh.pa.us
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
Commit 314cbfc5da redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);
But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.
It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int". This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms. There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int. If we want to cope with PIDs that are wider
than int, this is not the place to start.
Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?
Per buildfarm. Back-patch to 9.5 which has same issue.
For a long time, opclasscmds.c explained that "we do not create a
dependency link to the AM [for an opclass or opfamily], because we don't
currently support DROP ACCESS METHOD". Commit 473b932870 invented
DROP ACCESS METHOD, but it batted only 1 for 2 on adding the dependency
links, and 0 for 2 on updating the comments about the topic.
In passing, undo the same commit's entirely inappropriate decision to
blow away an existing index as a side-effect of create_am.sql.
As part of reserving the pg_* namespace for default roles and in line
with SET ROLE and other previous efforts, disallow settings the role
to a default/reserved role using SET SESSION AUTHORIZATION.
These checks and restrictions on what is allowed regarding default /
reserved roles are under debate, but it seems prudent to ensure that
the existing checks at least cover the intended cases while the
debate rages on. On me to clean it up if the consensus decision is
to remove these checks.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.
Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.
Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.
On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.
In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.
Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.
Commit 428b1d6b29 introduced the use of
msync() for flushing dirty data from the kernel's file buffers. Several
portability issues were overlooked, though:
* Not all implementations of mmap() think that nbytes == 0 means "map
the whole file". To fix, use lseek() to find out the true length.
Fix callers of pg_flush_data to be aware that nbytes == 0 may result
in trashing the file's seek position.
* Not all implementations of mmap() will accept partial-page mmap
requests. To fix, round down the length request to whatever sysconf()
says the page size is. (I think this is OK from a portability standpoint,
because sysconf() is required by SUS v2, and we aren't trying to compile
this part on Windows anyway. Buildfarm should let us know if not.)
* On 32-bit machines, the file size might exceed the available free
address space, or even exceed what will fit in size_t. Check for
the latter explicitly to avoid passing a false request size to mmap().
If mmap fails, silently fall through to the next implementation method,
rather than bleating to the postmaster log and giving up.
* mmap'ing directories fails on some platforms, and even if it works,
msync'ing the directory is quite unlikely to help, as for that matter are
the other flush implementations. In pre_sync_fname(), just skip flush
attempts on directories.
In passing, copy-edit the comments a bit.
Stas Kelvich and myself
I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members. Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno. Put in a wrapper layer to do that.
If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.
Discussion: <4065.1452450340@sss.pgh.pa.us>
The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed. This patch
fixes that oversight.
David Rowley
That unused function was introduced as a sample because synchronous
replication or replication monitoring tools might need it in the future.
Recently commit 989be08 added the function SyncRepGetOldestSyncRecPtr
which provides almost the same functionality for multiple synchronous
standbys feature. So it's time to remove that unused sample function.
This commit does that.
Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.
On a big NUMA machine with 1000 connections in saturation load
there was a performance regression due to spinlock contention, for
acquiring values which were never used. Just fill with dummy
values if we're not going to use them.
This patch has not been benchmarked yet on a big NUMA machine, but
it seems like a good idea on general principle, and it seemed to
prevent an apparent 2.2% regression on a single-socket i7 box
running 200 connections at saturation load.
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future. Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.
Alexander Korotkov, adjusted slightly by me