The true explanation for Peter Eisentraut's report of inexact asind results
seems to be that (a) he's compiling into x87 instruction set, which uses
wider-than-double float registers, plus (b) the library function asin() on
his platform returns a result that is wider than double and is not rounded
to double width. To fix, we have to force the function's result to be
rounded comparably to what happened to the scaling constant asin_0_5.
Experimentation suggests that storing it into a volatile local variable is
the least ugly way of making that happen. Although only asin() is known to
exhibit an observable inexact result, we'd better do this in all the places
where we're hoping to get an exact result by scaling.
Change max_parallel_degree default from 0 to 2. It is possible that
this is not a good idea, or that we should go with 1 worker rather
than 2, but we won't find out without trying it. Along the way,
reword the documentation for max_parallel_degree a little bit to
hopefully make it more clear.
Discussion: 20160420174631.3qjjhpwsvvx5bau5@alap3.anarazel.de
Commit 65abaab547 tried to prevent the scaling constants used in
the degree-based trig functions from being precomputed at compile time,
because some compilers do that with functions that don't yield results
identical-to-the-last-bit to what you get at runtime. A report from
Peter Eisentraut suggests that some recent compilers are smart enough
to see through that trick, though. Instead, let's put the inputs to
these calculations into non-const global variables, which should be a
more reliable way of convincing the compiler that it can't assume that
they are compile-time constants. (If we really get desperate, we could
mark these variables "volatile", but I do not believe we should have to.)
Several issues:
1) checkpoint_flush_after doc and code disagreed about the default
2) new GUCs were missing from postgresql.conf.sample
3) Outdated source-code comment about bgwriter_flush_after's default
4) Sub-optimal categories assigned to new GUCs
5) Docs suggested backend_flush_after is PGC_SIGHUP, but it's PGC_USERSET.
6) Spell out int as integer in the docs, as done elsewhere
Reported-By: Magnus Hagander, Fujii Masao
Discussion: CAHGQGwETyTG5VYQQ5C_srwxWX7RXvFcD3dKROhvAWWhoSBdmZw@mail.gmail.com
NetBSD has seen fit to invent a libc function named strtoi(), which
conflicts with the long-established static functions of the same name in
datetime.c and ecpg's interval.c. While muttering darkly about intrusions
on application namespace, we'll rename our functions to avoid the conflict.
Back-patch to all supported branches, since this would affect attempts
to build any of them on recent NetBSD.
Thomas Munro
The implementation of that feature involves injecting nodes into the
raw parsetree where explicit parentheses appear. Various places in
parse_expr.c that test to see "is this child node of type Foo" need to
look through such nodes, else we'll get different behavior when
operator_precedence_warning is on than when it is off. Note that we only
need to handle this when testing untransformed child nodes, since the
AEXPR_PAREN nodes will be gone anyway after transformExprRecurse.
Per report from Scott Ribe and additional code-reading. Back-patch
to 9.5 where this feature was added.
Report: <ED37E303-1B0A-4CD8-8E1E-B9C4C2DD9A17@elevated-dev.com>
Given a left join containing a full join in its righthand side, with
the left join's joinclause referencing only one side of the full join
(in a non-strict fashion, so that the full join doesn't get simplified),
the planner could fail with "failed to build any N-way joins" or related
errors. This happened because the full join was seen as overlapping the
left join's RHS, and then recent changes within join_is_legal() caused
that function to conclude that the full join couldn't validly be formed.
Rather than try to rejigger join_is_legal() yet more to allow this,
I think it's better to fix initsplan.c so that the required join order
is explicit in the SpecialJoinInfo data structure. The previous coding
there essentially ignored full joins, relying on the fact that we don't
flatten them in the joinlist data structure to preserve their ordering.
That's sufficient to prevent a wrong plan from being formed, but as this
example shows, it's not sufficient to ensure that the right plan will
be formed. We need to work a bit harder to ensure that the right plan
looks sane according to the SpecialJoinInfos.
Per bug #14105 from Vojtech Rylko. This was apparently induced by
commit 8703059c6 (though now that I've seen it, I wonder whether there
are related cases that could have failed before that); so back-patch
to all active branches. Unfortunately, that patch also went into 9.0,
so this bug is a regression that won't be fixed in that branch.
The coverage was rather lean for cases that bind() or listen() might
return. Add entries for everything that there's a direct equivalent
for in the set of Unix errnos that elog.c has heard of.
When we shoehorned "x op ANY (array)" into the SQL syntax, we created a
fundamental ambiguity as to the proper treatment of a sub-SELECT on the
righthand side: perhaps what's meant is to compare x against each row of
the sub-SELECT's result, or perhaps the sub-SELECT is meant as a scalar
sub-SELECT that delivers a single array value whose members should be
compared against x. The grammar resolves it as the former case whenever
the RHS is a select_with_parens, making the latter case hard to reach ---
but you can get at it, with tricks such as attaching a no-op cast to the
sub-SELECT. Parse analysis would throw away the no-op cast, leaving a
parsetree with an EXPR_SUBLINK SubLink directly under a ScalarArrayOpExpr.
ruleutils.c was not clued in on this fine point, and would naively emit
"x op ANY ((SELECT ...))", which would be parsed as the first alternative,
typically leading to errors like "operator does not exist: text = text[]"
during dump/reload of a view or rule containing such a construct. To fix,
emit a no-op cast when dumping such a parsetree. This might well be
exactly what the user wrote to get the construct accepted in the first
place; and even if she got there with some other dodge, it is a valid
representation of the parsetree.
Per report from Karl Czajkowski. He mentioned only a case involving
RLS policies, but actually the problem is very old, so back-patch to
all supported branches.
Report: <20160421001832.GB7976@moraine.isi.edu>
Also, avoid reading PGPROC's wait_event field twice, once for the wait
event and again for the wait_event_type, because the value might change
in the middle.
Petr Jelinek and Robert Haas
That commit increased all shared memory allocations to the next higher
multiple of PG_CACHE_LINE_SIZE, but it didn't ensure that allocation
started on a cache line boundary. It also failed to remove a couple
other pieces of now-useless code.
BUFFERALIGN() is perhaps obsolete at this point, and likely should be
removed at some point, too, but that seems like it can be left to a
future cleanup.
Mistakes all pointed out by Andres Freund. The patch is mine, with
a few extra assertions which I adopted from his version of this fix.
Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not. Subsequent patches
band-aided over some of the problems with this design by making things
even messier.
One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud). This would not typically
be noticeable during retail index updates. It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.
Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state. There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.
To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts. The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page. The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path. The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)
In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.
Report: <571276DD.5050303@dalibo.com>
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
Coverity complained that oldPartitionLock was possibly dereferenced after
having been set to NULL. That actually can't happen, because we'd only use
it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is
justified in complaining, because at line 1275 we actually overwrite
oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to
whether to release the oldPartitionLock. Thus, the code would be incorrect
if someone else had changed the buffer's BM_TAG_VALID flag meanwhile.
That should not happen, since we hold pin on the buffer throughout this
sequence, but it's starting to look like a rather shaky chain of logic.
And there's no need for such assumptions, because we can simply replace
the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL),
which has identical results and makes it plain to all comers that we don't
dereference a null pointer. A small side benefit is that the range of
liveness of oldFlags is greatly reduced, possibly allowing the compiler
to save a register.
This is just cleanup, not an actual bug fix, so there seems no need
for a back-patch.
We've had repeated troubles over the years with failures to initialize
spinlocks correctly; see 6b93fcd14 for a recent example. Most of the time,
on most platforms, such oversights can escape notice because all-zeroes is
the expected initial content of an slock_t variable. The only platform
we have where the initialized state of an slock_t isn't zeroes is HPPA,
and that's practically gone in the wild. To make it easier to catch such
errors without needing one of those, adjust the --disable-spinlocks code
so that zero is not a valid value for an slock_t for it.
In passing, remove a bunch of unnecessary #include's from spin.c;
commit daa7527afc removed all the intermodule coupling that
made them necessary.
Although OID acts pretty much like user data, the other system columns do
not, so an index on one would likely misbehave. And it's pretty hard to
see a use-case for one, anyway. Let's just forbid the case rather than
worry about whether it should be supported.
David Rowley
For reasons of sheer brain fade, we (I) was calling systable_endscan()
immediately after systable_getnext() and expecting the tuple returned
by systable_getnext() to still be valid.
That's clearly wrong. Move the systable_endscan() down below the tuple
usage.
Discovered initially by Pavel Stehule and then also by Alvaro.
Add a regression test based on Alvaro's testing.
Careless coding added by commit 07cacba983 could result in a crash
or a bizarre error message if someone tried to select an index on the
OID column as the replica identity index for a table. Back-patch to 9.4
where the feature was introduced.
Discussion: CAKJS1f8TQYgTRDyF1_u9PVCKWRWz+DkieH=U7954HeHVPJKaKg@mail.gmail.com
David Rowley
The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched. This has already confused
several people, so display both numbers and label them clearly.
Julien Rouhaud, reviewed by me.
pg_xlogdump includes bufmgr.h. With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere. Per buildfarm member pademelon.
The code had a query-lifespan memory leak when encountering GIN entries
that have posting lists (rather than posting trees, ie, there are a
relatively small number of heap tuples containing this index key value).
With a suitable data distribution this could add up to a lot of leakage.
Problem seems to have been introduced by commit 36a35c550, so back-patch
to 9.4.
Julien Rouhaud
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().
Give up on using a macro, and use a static inline instead.
Discussion: 4369.1460435533@sss.pgh.pa.us
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
Commit 314cbfc5da redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);
But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.
It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int". This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms. There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int. If we want to cope with PIDs that are wider
than int, this is not the place to start.
Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?
Per buildfarm. Back-patch to 9.5 which has same issue.
For a long time, opclasscmds.c explained that "we do not create a
dependency link to the AM [for an opclass or opfamily], because we don't
currently support DROP ACCESS METHOD". Commit 473b932870 invented
DROP ACCESS METHOD, but it batted only 1 for 2 on adding the dependency
links, and 0 for 2 on updating the comments about the topic.
In passing, undo the same commit's entirely inappropriate decision to
blow away an existing index as a side-effect of create_am.sql.
As part of reserving the pg_* namespace for default roles and in line
with SET ROLE and other previous efforts, disallow settings the role
to a default/reserved role using SET SESSION AUTHORIZATION.
These checks and restrictions on what is allowed regarding default /
reserved roles are under debate, but it seems prudent to ensure that
the existing checks at least cover the intended cases while the
debate rages on. On me to clean it up if the consensus decision is
to remove these checks.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.
Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.
Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.
On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.
In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.
Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.
Commit 428b1d6b29 introduced the use of
msync() for flushing dirty data from the kernel's file buffers. Several
portability issues were overlooked, though:
* Not all implementations of mmap() think that nbytes == 0 means "map
the whole file". To fix, use lseek() to find out the true length.
Fix callers of pg_flush_data to be aware that nbytes == 0 may result
in trashing the file's seek position.
* Not all implementations of mmap() will accept partial-page mmap
requests. To fix, round down the length request to whatever sysconf()
says the page size is. (I think this is OK from a portability standpoint,
because sysconf() is required by SUS v2, and we aren't trying to compile
this part on Windows anyway. Buildfarm should let us know if not.)
* On 32-bit machines, the file size might exceed the available free
address space, or even exceed what will fit in size_t. Check for
the latter explicitly to avoid passing a false request size to mmap().
If mmap fails, silently fall through to the next implementation method,
rather than bleating to the postmaster log and giving up.
* mmap'ing directories fails on some platforms, and even if it works,
msync'ing the directory is quite unlikely to help, as for that matter are
the other flush implementations. In pre_sync_fname(), just skip flush
attempts on directories.
In passing, copy-edit the comments a bit.
Stas Kelvich and myself
I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members. Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno. Put in a wrapper layer to do that.
If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.
Discussion: <4065.1452450340@sss.pgh.pa.us>
The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed. This patch
fixes that oversight.
David Rowley
That unused function was introduced as a sample because synchronous
replication or replication monitoring tools might need it in the future.
Recently commit 989be08 added the function SyncRepGetOldestSyncRecPtr
which provides almost the same functionality for multiple synchronous
standbys feature. So it's time to remove that unused sample function.
This commit does that.
Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.
On a big NUMA machine with 1000 connections in saturation load
there was a performance regression due to spinlock contention, for
acquiring values which were never used. Just fill with dummy
values if we're not going to use them.
This patch has not been benchmarked yet on a big NUMA machine, but
it seems like a good idea on general principle, and it seemed to
prevent an apparent 2.2% regression on a single-socket i7 box
running 200 connections at saturation load.
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future. Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.
Alexander Korotkov, adjusted slightly by me
The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied. This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers. So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.
Alexander Korotkov, adjusted slightly by me
When IF NOT EXISTS was added to CREATE TABLE AS, this logic didn't get
the memo, possibly resulting in an Assert failure. It looks like there
would have been no ill effects in a non-Assert build, though. Back-patch
to 9.5 where the IF NOT EXISTS option was added.
Stas Kelvich
I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases. Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call. So make it readable.
Per gripes from Álvaro Herrera and Tom Lane
It was incorrectly declared as a volatile pointer to a non-volatile
structure. Eliminate the OldSnapshotControl struct definition; it
is really not needed. Pointed out by Tom Lane.
While at it, add OldSnapshotControlData to pgindent's list of
structures.
The existing code would either Assert or generate an invalid
SyncRepConfig variable, neither of which is desirable. A regular
error should be thrown instead.
This commit silences compiler warning in non assertion-enabled builds.
Per report from Jeff Janes.
Suggested fix by Tom Lane.
Coverity complained about an apparent missing "break" in a switch
added by bb140506df. The human-readable comments are pretty
clear that this is intentional, but add a standard /* FALL THRU */
comment to make it clear to tools too.
Coverity complained that the code added by 015e88942a lacked an
error check for SearchSysCache1 failures, which it should have. But
the code was pretty duff in other ways too, including failure to think
about whether it could really cope with arrays of different lengths.
Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.
Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.
Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
I used the wrong variable here. Doesn't make a difference today because
the only plausible caller passes a non-NULL variable, but someday it
will be wrong, and even today's correctness is subtle: the caller that
does pass a NULL is never invoked because of object type constraints.
Surely not a condition to rely on.
Noted by Coverity
Since we're requiring pages handled by generic_xlog.c to be standard
format, specify REGBUF_STANDARD when doing a full-page image, so that
xloginsert.c can compress out the "hole" between pd_lower and pd_upper.
Given the current API in which this path will be taken only for a newly
initialized page, the hole is likely to be particularly large in such
cases, so that this oversight could easily be performance-significant.
I don't notice any particular change in the runtime of contrib/bloom's
regression test, though.
Make the inner comparison loops of computeDelta() as tight as possible by
pulling considerations of valid and invalid ranges out of the inner loops,
and extending a match or non-match detection as far as possible before
deciding what to do next. To keep this tractable, give up the possibility
of merging fragments across the pd_lower to pd_upper gap. The fraction of
pages where that could happen (ie, there are 4 or fewer bytes in the gap,
*and* data changes immediately adjacent to it on both sides) is too small
to be worth spending cycles on.
Also, avoid two BLCKSZ-length memcpy()s by computing the delta before
moving data into the target buffer, instead of after. This doesn't save
nearly as many cycles as being tenser about computeDelta(), but it still
seems worth doing.
On my machine, this patch cuts a full 40% off the runtime of
contrib/bloom's regression test.
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.
Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.
While at it, do some wordsmithing on the SGML docs for generic WAL.
Improve commentary, use more specific names for the delta fields,
const-ify pointer arguments where possible, avoid assuming that
initializing only the first element of a local array will guarantee
that the remaining elements end up as we need them. (I think that
code in generic_redo actually worked, but only because InvalidBuffer
is zero; this is a particularly ugly way of depending on that ...)
Previously bcac23d exposed a subset of support functions, namely the
ones Kaigai found useful. In
20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that
there's some functions missing to use the facility in an external
project.
To avoid having to add functions piecemeal, add all the functions which
are used to define READ_* and WRITE_* macros; users of the extensible
node functionality are likely to need these. Additionally expose
outDatum(), which doesn't have it's own WRITE_ macro, as it needs
information from the embedding struct.
Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de
This creates an initial set of default roles which administrators may
use to grant access to, historically, superuser-only functions. Using
these roles instead of granting superuser access reduces the number of
superuser roles required for a system. Documention for each of the
default roles has been added to user-manag.sgml.
Bump catversion to 201604082, as we had a commit that bumped it to
201604081 and another that set it back to 201604071...
Reviews by José Luis Tallón and Robert Haas
This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.
This will allow for default roles to be provided at initdb time.
Reviews by José Luis Tallón and Robert Haas
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
These parameters are available for SSPI authentication only, to make
it possible to make it behave more like "normal gssapi", while
making it possible to maintain compatibility.
compat_realm is on by default, but can be turned off to make the
authentication use the full Kerberos realm instead of the NetBIOS name.
upn_username is off by default, and can be turned on to return the users
Kerberos UPN rather than the SAM-compatible name (a user in Active
Directory can have both a legacy SAM-compatible username and a new
Kerberos one. Normally they are the same, but not always)
Author: Christian Ullrich
Reviewed by: Robbie Harwood, Alvaro Herrera, me
OpenSSL has an unfortunate tendency to mix per-session state error
handling with per-thread error handling. This can cause problems when
programs that link to libpq with OpenSSL enabled have some other use of
OpenSSL; without care, one caller of OpenSSL may cause problems for the
other caller. Backend code might similarly be affected, for example
when a third party extension independently uses OpenSSL without taking
the appropriate precautions.
To fix, don't trust other users of OpenSSL to clear the per-thread error
queue. Instead, clear the entire per-thread queue ahead of certain I/O
operations when it appears that there might be trouble (these I/O
operations mostly need to call SSL_get_error() to check for success,
which relies on the queue being empty). This is slightly aggressive,
but it's pretty clear that the other callers have a very dubious claim
to ownership of the per-thread queue. Do this is both frontend and
backend code.
Finally, be more careful about clearing our own error queue, so as to
not cause these problems ourself. It's possibly that control previously
did not always reach SSLerrmessage(), where ERR_get_error() was supposed
to be called to clear the queue's earliest code. Make sure
ERR_get_error() is always called, so as to spare other users of OpenSSL
the possibility of similar problems caused by libpq (as opposed to
problems caused by a third party OpenSSL library like PHP's OpenSSL
extension). Again, do this is both frontend and backend code.
See bug #12799 and https://bugs.php.net/bug.php?id=68276
Based on patches by Dave Vitek and Peter Eisentraut.
From: Peter Geoghegan <pg@bowt.ie>
Create a "bsd" auth method that works the same as "password" so far as
clients are concerned, but calls the BSD Authentication service to
check the password. This is currently only available on OpenBSD.
Marisa Emerson, reviewed by Thomas Munro
This allows parallel aggregation to use them. It may seem surprising
that we use float8_combine for both float4_accum and float8_accum
transition functions, but that's because those functions differ only
in the type of the non-transition-state argument.
Haribabu Kommi, reviewed by David Rowley and Tomas Vondra
As noticed by Tom Lane changing operation's number in commit
bb140506df causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.
Author: Dmitry Ivanov
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
Don't try to examine S_ISLNK(st.st_mode) after a failed lstat().
It's undefined.
Also, if the lstat() reported ENOENT, we do not wish that to be a hard
error, but the code might nonetheless treat it as one (giving an entirely
misleading error message, too) depending on luck-of-the-draw as to what
S_ISLNK() returned.
Don't throw error for ENOENT from rmdir(), either. (We're not really
expecting ENOENT because we just stat'd the file successfully; but
if we're going to allow ENOENT in the symlink code path, surely the
directory code path should too.)
Generate an appropriate errcode for its-the-wrong-type-of-file complaints.
(ERRCODE_SYSTEM_ERROR doesn't seem appropriate, and failing to write
errcode() around it certainly doesn't work, and not writing an errcode
at all is not per project policy.)
Valgrind noticed the undefined S_ISLNK result; the other problems emerged
while reading the code in the area.
All of this appears to have been introduced in 8f15f74a44.
Back-patch to 9.5 where that commit appeared.
Benchmarking has shown that the current number of clog buffers limits
scalability. We've previously increased the number in 33aaa139, but
that's not sufficient with a large number of clients.
We've benchmarked the cost of increasing the limit by benchmarking worst
case scenarios; testing showed that 128 buffers don't cause a
regression, even in contrived scenarios, whereas 256 does
There are a number of more complex patches flying around to address
various clog scalability problems, but this is simple enough that we can
get it into 9.6; and is beneficial even after those patches have been
applied.
It is a bit unsatisfactory to increase this in small steps every few
releases, but a better solution seems to require a rewrite of slru.c;
not something done quickly.
Author: Amit Kapila and Andres Freund
Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.
Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me. Some further hacking by me.
The PAM_RHOST item is set to the remote IP address or host name and can
be used by PAM modules. A pg_hba.conf option is provided to choose
between IP address and resolved host name.
From: Grzegorz Sampolski <grzsmp@gmail.com>
Reviewed-by: Haribabu Kommi <kommi.haribabu@gmail.com>
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small. Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort. This tends to be faster except perhaps for very small
amounts of working memory.
Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate. Experimentation shows the extending the relation
multiple blocks at a time improves scalability.
Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
In cases where joins use multiple columns we currently assess each join
separately causing gross mis-estimates for join cardinality.
This patch adds use of FK information for the first time into the
planner. When FKs are present and we have multi-column join information,
plan estimates will be drastically improved. Cases with multiple FKs
are handled, though partial matches are ignored currently.
Net effect is substantial performance improvements for joins in many
common cases. Additional planning time is isolated to cases that are
currently performing poorly, measured at 0.08 - 0.15 ms.
Please watch for planner performance regressions; circumstances seem
unlikely but the law of unintended consequences may apply somewhen.
Additional complex tests welcome to prove this before release.
Tests can be performed using SET enable_fkey_estimates = on | off
using scripts provided during Hackers discussions, message id:
552335D9.3090707@2ndquadrant.com
Authors: Tomas Vondra and David Rowley
Reviewed and tested by Simon Riggs, adding comments only
Extracted from pending unique-join patch, since this is a rather large
delta but it's simply moving code out into separately-accessible
subroutines.
I (tgl) did choose to add a bit more logic to rel_supports_distinctness,
so that it verifies that there's at least one potentially usable unique
index rather than just checking indexlist != NIL. Otherwise there's
no functional change here.
David Rowley
While prior to this patch the user-visible effect on the database
of any set of successfully committed serializable transactions was
always consistent with some one-at-a-time order of execution of
those transactions, the presence of declarative constraints could
allow errors to occur which were not possible in any such ordering,
and developers had no good workarounds to prevent user-facing
errors where they were not necessary or desired. This patch adds
a check for serialization failure ahead of duplicate key checking
so that if a developer explicitly (redundantly) checks for the
pre-existing value they will get the desired serialization failure
where the problem is caused by a concurrent serializable
transaction; otherwise they will get a duplicate key error.
While it would be better if the reads performed by the constraints
could count as part of the work of the transaction for
serialization failure checking, and we will hopefully get there
some day, this patch allows a clean and reliable way for developers
to work around the issue. In many cases existing code will already
be doing the right thing for this to "just work".
Author: Thomas Munro, with minor editing of docs by me
Reviewed-by: Marko Tiikkaja, Kevin Grittner
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
readable
Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
Commit c22650cd64 sparked a discussion
about diverse interpretations of "token user" in error messages. Expel
old and new specimens of that phrase by making all GetTokenInformation()
callers report errors the way GetTokenUser() has been reporting them.
These error conditions almost can't happen, so users are unlikely to
observe this change.
Reviewed by Tom Lane and Stephen Frost.
Now that pg_dump will properly dump out any ACL changes made to
functions which exist in pg_catalog, switch to using the GRANT system
to manage access to those functions.
This means removing 'if (!superuser()) ereport()' checks from the
functions themselves and then REVOKEing EXECUTE right from 'public' for
these functions in system_views.sql.
Reviews by Alexander Korotkov, Jose Luis Tallon
Now that all of the infrastructure exists, add in the ability to
dump out the ACLs of the objects inside of pg_catalog or the ACLs
for objects which are members of extensions, but only if they have
been changed from their original values.
The original values are tracked in pg_init_privs. When pg_dump'ing
9.6-and-above databases, we will dump out the ACLs for all objects
in pg_catalog and the ACLs for all extension members, where the ACL
has been changed from the original value which was set during either
initdb or CREATE EXTENSION.
This should not change dumps against pre-9.6 databases.
Reviews by Alexander Korotkov, Jose Luis Tallon
This new catalog holds the privileges which the system was
initialized with at initdb time, along with any permissions set
by extensions at CREATE EXTENSION time. This allows pg_dump
(and any other similar use-cases) to detect when the privileges
set on initdb-created or extension-created objects have been
changed from what they were set to at initdb/extension-creation
time and handle those changes appropriately.
Reviews by Alexander Korotkov, Jose Luis Tallon
It inserts a new value into an jsonb array at arbitrary position or
a new key to jsonb object.
Author: Dmitry Dolgov
Reviewers: Petr Jelinek, Vitaly Burovoy, Andrew Dunstan
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.
Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.
(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)
Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
Previously synchronous replication offered only the ability to confirm
that all changes made by a transaction had been transferred to at most
one synchronous standby server.
This commit extends synchronous replication so that it supports multiple
synchronous standby servers. It enables users to consider one or more
standby servers as synchronous, and increase the level of transaction
durability by ensuring that transaction commits wait for replies from
all of those synchronous standbys.
Multiple synchronous standby servers are configured in
synchronous_standby_names which is extended to support new syntax of
'num_sync ( standby_name [ , ... ] )', where num_sync specifies
the number of synchronous standbys that transaction commits need to
wait for replies from and standby_name is the name of a standby
server.
The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before
is also still supported. It's the same as new syntax with num_sync=1.
This commit doesn't include "quorum commit" feature which was discussed
in pgsql-hackers. Synchronous standbys are chosen based on their priorities.
synchronous_standby_names determines the priority of each standby for
being chosen as a synchronous standby. The standbys whose names appear
earlier in the list are given higher priority and will be considered as
synchronous. Other standby servers appearing later in this list
represent potential synchronous standbys.
The regression test for multiple synchronous standbys is not included
in this commit. It should come later.
Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao
Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs,
Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen,
Rajeev Rastogi
Many thanks to the various individuals who were involved in
discussing and developing this feature.
This introduces a new dependency type which marks an object as depending
on an extension, such that if the extension is dropped, the object
automatically goes away; and also, if the database is dumped, the object
is included in the dump output. Currently the grammar supports this for
indexes, triggers, materialized views and functions only, although the
utility code is generic so adding support for more object types is a
matter of touching the parser rules only.
Author: Abhijit Menon-Sen
Reviewed-by: Alexander Korotkov, Álvaro Herrera
Discussion: http://www.postgresql.org/message-id/20160115062649.GA5068@toroid.org
has_parallel_hazard() was ignoring the proparallel markings for
aggregates, which is no good. Fix that. There was no way to mark
an aggregate as actually being parallel-safe, either, so add a
PARALLEL option to CREATE AGGREGATE.
Patch by me, reviewed by David Rowley.
Experimentation shows this only costs about 6kB, which seems well
worth it given the major performance effects that can be caused
by insufficient alignment, especially on larger systems.
Discussion: 14166.1458924422@sss.pgh.pa.us
This lets us use parallel aggregate for a variety of useful cases
that didn't work before, like sum(int8), sum(numeric), several
versions of avg(), and various other functions.
Add some regression tests, as well, testing the general sanity of
these and future catalog entries.
David Rowley, reviewed by Tomas Vondra, with a few further changes
by me.
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.
Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.
The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.
Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.
Reviewed by David Steele and Amit Kapila
As noted by Julian Schauder in bug #14063, the configuration-file parser
doesn't support embedded newlines in string literals. While there might
someday be a good reason to remove that restriction, there doesn't seem
to be one right now. However, ALTER SYSTEM SET could accept strings
containing newlines, since many of the variable-specific value-checking
routines would just see a newline as whitespace. This led to writing a
postgresql.auto.conf file that was broken and had to be removed manually.
Pending a reason to work harder, just throw an error if someone tries this.
In passing, fix several places in the ALTER SYSTEM logic that failed to
provide an errcode() for an ereport(), and thus would falsely log the
failure as an internal XX000 error.
Back-patch to 9.4 where ALTER SYSTEM was introduced.
Alex Shulgin complained that the underlying strategy wasn't all that
apparent, particularly not the fact that we intentionally have two
code paths depending on whether we think the column has a limited set
of possible values or not. Try to make it clearer.
On reflection, the pre-existing logic in ANALYZE is specifically meant to
compare the frequency of a candidate MCV against the estimated frequency of
a random distinct value across the whole table. The change to compare it
against the average frequency of values actually seen in the sample doesn't
seem very principled, and if anything it would make us less likely not more
likely to consider a value an MCV. So revert that, but keep the aspect of
considering only nonnull values, which definitely is correct.
In passing, rename the local variables in these stanzas to
"ndistinct_table", to avoid confusion with the "ndistinct" that appears at
an outer scope in compute_scalar_stats.
This elevel is useful for logging audit messages and similar information
that should not be passed to the client. It's equivalent to LOG in terms
of decisions about logging priority in the postmaster log, but messages
with this elevel will never be sent to the client.
In the current implementation, it's just an alias for the longstanding
COMMERROR elevel (or more accurately, we've made COMMERROR an alias for
this). At some point it might be interesting to allow a LOG_ONLY flag to
be attached to any elevel, but that would be considerably more complicated,
and it's not clear there's enough use-cases to justify the extra work.
For now, let's just take the easy 90% solution.
David Steele, reviewed by Fabien Coelho, Petr Jelínek, and myself
The first iteration of the signal-checking loop would compute sigmask(0)
which expands to 1<<(-1) which is undefined behavior according to the
C standard. The lack of field reports of trouble suggest that it
evaluates to 0 on all existing Windows compilers, but that's hardly
something to rely on. Since signal 0 isn't a queueable signal anyway,
we can just make the loop iterate from 1 instead, and save a few cycles
as well as avoiding the undefined behavior.
In passing, avoid evaluating the volatile expression UNBLOCKED_SIGNAL_QUEUE
twice in a row; there's no reason to waste cycles like that.
Noted by Aleksander Alekseev, though this isn't his proposed fix.
Back-patch to all supported branches.
When adjusting the estimate for the number of distinct values from a
rel in a grouped query to take into account the selectivity of the
rel's restrictions, use a formula that is less likely to produce
under-estimates.
The old formula simply multiplied the number of distinct values in the
rel by the restriction selectivity, which would be correct if the
restrictions were fully correlated with the grouping expressions, but
can produce significant under-estimates in cases where they are not
well correlated.
The new formula is based on the random selection probability, and so
assumes that the restrictions are not correlated with the grouping
expressions. This is guaranteed to produce larger estimates, and of
course risks over-estimating in cases where the restrictions are
correlated, but that has less severe consequences than
under-estimating, which might lead to a HashAgg that consumes an
excessive amount of memory.
This could possibly be improved upon in the future by identifying
correlated restrictions and using a hybrid of the old and new
formulae.
Author: Tomas Vondra, with some hacking be me
Reviewed-by: Mark Dilger, Alexander Korotkov, Dean Rasheed and Tom Lane
Discussion: http://www.postgresql.org/message-id/flat/56CD0381.5060502@2ndquadrant.com
If archive_timeout > 0 we should avoid logging XLOG_RUNNING_XACTS if idle.
Bug 13685 reported by Laurence Rowe, investigated in detail by Michael Paquier,
though this is not his proposed fix.
20151016203031.3019.72930@wrigleys.postgresql.org
Simple non-invasive patch to allow later backpatch to 9.4 and 9.5
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.
This commit skips the pin scan that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin
scan is skipped only in replay; the VACUUM code path on master is not
touched here and WAL is identical.
The current commit applies in all cases, effectively replacing commit
687f2cd7a0.
As with the previous patch, large numbers of null rows could skew this
calculation unfavorably, causing us to discard values that have a
legitimate claim to be MCVs, since our definition of MCV is that it's
most common among the non-null population of the column. Hence, make
the numerator of avgcount be the number of non-null sample values not
the number of sample rows; likewise for maxmincount in the
compute_scalar_stats variant.
Also, make the denominator be the number of distinct values actually
observed in the sample, rather than reversing it back out of the computed
stadistinct. This avoids depending on the accuracy of the Haas-Stokes
approximation, and really it's what we want anyway; the threshold should
depend only on what we see in the sample, not on what we extrapolate
about the contents of the whole column.
Alex Shulgin, reviewed by Tomas Vondra and myself
Previously, we included null rows in the values of n and N that went
into the formula, which amounts to considering null as a value in its
own right; but the d and f1 values do not include nulls. This is
inconsistent, and it contributes to significant underestimation of
ndistinct when the column is mostly nulls. In any case stadistinct
is defined as the number of distinct non-null values, so we should
exclude nulls when doing this computation.
This is an aboriginal bug in our application of the Haas-Stokes formula,
but we'll refrain from back-patching for fear of destabilizing plan
choices in released branches.
While at it, make the code a bit more readable by omitting unnecessary
casts and intermediate variables.
Observation and original patch by Tomas Vondra, adjusted to fix both
uses of the formula by Alex Shulgin, cosmetic improvements by me
Our actual convention, contrary to what I said in 59a2111b23, is not to
quote type names, as evidenced by unquoted use of format_type_be()
result value in error messages. Remove quotes from recently tweaked
messages accordingly.
Per note from Tom Lane
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.
Bump XLOG_PAGE_MAGIC because of new record type.
Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
Michael Paquier
Previously, the planner would reject an index-only scan if any restriction
clause for its table used a column not available from the index, even
if that restriction clause would later be dropped from the plan entirely
because it's implied by the index's predicate. This is a fairly common
situation for partial indexes because predicates using columns not included
in the index are often the most useful kind of predicate, and we have to
duplicate (or at least imply) the predicate in the WHERE clause in order
to get the index to be considered at all. So index-only scans were
essentially unavailable with such partial indexes.
To fix, we have to do detection of implied-by-predicate clauses much
earlier in the planner. This patch puts it in check_index_predicates
(nee check_partial_indexes), meaning it gets done for every partial index,
whereas we previously only considered this issue at createplan time,
so that the work was only done for an index actually selected for use.
That could result in a noticeable planning slowdown for queries against
tables with many partial indexes. However, testing suggested that there
isn't really a significant cost, especially not with reasonable numbers
of partial indexes. We do get a small additional benefit, which is that
cost_index is more accurate since it correctly discounts the evaluation
cost of clauses that will be removed. We can also avoid considering such
clauses as potential indexquals, which saves useless matching cycles in
the case where the predicate columns aren't in the index, and prevents
generating bogus plans that double-count the clause's selectivity when
the columns are in the index.
Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and
Konstantin Knizhnik, and whacked around a little by me
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline. This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.
This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.
Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.
Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
Formerly, the geometric I/O routines such as box_in and point_out relied
directly on strtod() and sprintf() for conversion of the float8 component
values of their data types. However, the behavior of those functions is
pretty platform-dependent, especially for edge-case values such as
infinities and NaNs. This was exposed by commit acdf2a8b37, which
added test cases involving boxes with infinity endpoints, and immediately
failed on Windows and AIX buildfarm members. We solved these problems
years ago in the main float8in and float8out functions, so let's fix it
by making the geometric types use that code instead of depending directly
on the platform-supplied functions.
To do this, refactor the float8in code so that it can be used to parse
just part of a string, and as a convenience make the guts of float8out
usable without going through DirectFunctionCall.
While at it, get rid of geo_ops.c's fairly shaky assumptions about the
maximum output string length for a double, by having it build results in
StringInfo buffers instead of fixed-length strings.
In passing, convert all the "invalid input syntax for type foo" messages
in this area of the code into "invalid input syntax for type %s" to reduce
the number of distinct translatable strings, per recent discussion.
We would have needed a fair number of the latter anyway for code-sharing
reasons, so we might as well just go whole hog.
Note: this patch is by no means intended to guarantee that the geometric
types uniformly behave sanely for infinity or NaN component values.
But any bugs we have in that line were there all along, they were just
harder to reach in a platform-independent way.
My compiler doesn't like the lack of initialization of "flag", and
I think it's right: if there were zero keys we'd have an undefined
result. The AND of zero items is TRUE, so initialize to TRUE.
Patch implements quad-tree over boxes, naive approach of 2D quad tree will not
work for any non-point objects because splitting space on node is not
efficient. The idea of pathc is treating 2D boxes as 4D points, so,
object will not overlap (in 4D space).
The performance tests reveal that this technique especially beneficial
with too much overlapping objects, so called "spaghetti data".
Author: Alexander Lebedev with editorization by Emre Hasegeli and me
During scan sometimes it would be very helpful to know some information about
parent node or all ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).
traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.
Subsequent patches for range opclass and quad4d tree will use it.
Author: Alexander Lebedev, Teodor Sigaev
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk. That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.
To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby. But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.
Thomas Munro, reviewed by Michael Paquier and by me. Some additional
tweaks by me.
interval_mul() attempts to prevent its calculations from producing silly
results, but it forgot that zero times infinity yields NaN in IEEE
arithmetic. Hence, a case like '1 second'::interval * 'infinity'::float8
produced a NaN for the months product, which didn't trigger the range
check, resulting in bogus and possibly platform-dependent output.
This isn't terribly obvious to the naked eye because if you try that
exact case, you get "interval out of range" which is what you expect
--- but if you look closer, the error is coming from interval_out not
interval_mul. interval_mul has allowed a bogus value into the system.
Fix by adding isnan tests.
Noted while testing Vitaly Burovoy's fix for infinity input to
to_timestamp(). Given the lack of field complaints, I doubt this
is worth a back-patch.
With the original SQL-function implementation, such cases failed because
we don't support infinite intervals. Converting the function to C lets
us bypass the interval representation, which should be a bit faster as
well as more flexible.
Vitaly Burovoy, reviewed by Anastasia Lubennikova
resulttypeLen and resulttypeByVal must be set correctly when serializing
aggregates, not just when finalizing them. This was in David's final
patch but I downloaded the wrong version by mistake and failed to spot
the error.
David Rowley
This is necessary infrastructure for supporting parallel aggregation
for aggregates whose transition type is "internal". Such values
can't be passed between cooperating processes, because they are
just pointers.
David Rowley, reviewed by Tomas Vondra and by me.
Whenever this function is used with the FORMAT_MESSAGE_FROM_SYSTEM flag,
it's good practice to include FORMAT_MESSAGE_IGNORE_INSERTS as well.
Otherwise, if the message contains any %n insertion markers, the function
will try to fetch argument strings to substitute --- which we are not
passing, possibly leading to a crash. This is exactly analogous to the
rule about not giving printf() a format string you're not in control of.
Noted and patched by Christian Ullrich.
Back-patch to all supported branches.
When tsearch was implemented I did several mistakes in hostname/email
definition rules:
1) allow underscore in hostname what prohibited by RFC
2) forget to allow leading digits separated by hyphen (like 123-x.com)
in hostname
3) do no allow underscore/hyphen after leading digits in localpart of email
Artur's patch resolves two last issues, but by the way allows hosts name like
123_x.com together with 123-x.com. RFC forbids underscore usage in hostname
but pg allows that since initial tsearch version in core, although only
for non-digits. Patch syncs support digits and nondigits in both hostname and
email.
Forbidding underscore in hostname may break existsing usage of tsearch and,
anyhow, it should be done by separate patch.
Author: Artur Zakirov
BUG: #13964
Per discussion, the new extensible node framework is thought to be
better designed than the custom path/scan/scanstate stuff we added
in PostgreSQL 9.5. Rework the latter to be more like the former.
This is not backward-compatible, but we generally don't promise that
for C APIs, and there probably aren't many people using this yet
anyway.
KaiGai Kohei, reviewed by Petr Jelinek and me. Some further
cosmetic changes by me.
Commit fbe5a3fb73 accidentally changed
this behavior; put things back the way they were, and add some
regression tests.
Report by Andres Freund; patch by Ashutosh Bapat, with a bit of
kibitzing by me.
Change the slightly different variations of the message
function FOO must return type BAR
to a single wording, removing the variability in type name so that they
all create a single translation entry; since the type name is not to be
translated, there's no point in it being part of the message anyway.
Also, change them all to use the same quoting convention, namely that
the function name is not to be quoted but the type name is. (I'm not
quite sure why this is so, but it's the clear majority.)
Some similar messages such as "encoding conversion function FOO must ..."
are also changed.