I applied the previous-to-last revision of Michaël's submitted patch
instead of the last; these two tweaks pointed out by Craig were left out
of the previous commit by accident.
Some code in the RewindTest test suite is more generally useful than
just for that suite, so put it where other test suites can reach it.
Some postgresql.conf parameters change their default values when a
cluster is initialized with 'allows_streaming' than the previous
behavior; most notably, autovacuum is no longer turned off.
(Also, we no longer call pg_ctl promote with -w, but that flag doesn't
actually do anything in promote so there's no behavior change.)
Author: Michael Paquier
CREATE TABLE .. AS EXECUTE can turn an apparently read-only query into
a write operation, which parallel query can't handle. It's a bit of a
shame that requires us to avoid parallel query for queries prepared via
PREPARE in all cases, but for right now it does.
This is basically a bug fix; the old code assumes that a ForeignScan
is always parallel-safe, but for postgres_fdw, for example, this is
definitely false. It should be true for file_fdw, though, since a
worker can read a file from the filesystem just as well as any other
backend process.
Original patch by Thomas Munro. Documentation, and changes to the
comments, by me.
Also, the dump_info method got split into another method that returns
the stuff as a string instead of just printing it to stdout.
Add a new README in src/test/perl too.
Author: Craig Ringer
Reviewed by: Michaël Paquier
Parallel query can't handle running a query only partially rather than
to completion. However, there seems to be no way to run a statement
prepared via SQL PREPARE other than to completion, so we can enable it
there without a problem.
The situation is more complicated for the extend query protocol.
libpq seems to provide no way to send an Execute message with a
non-zero rowcount, but some other client might. If that happens, and
a parallel plan was chosen, we'll execute the parallel plan without
using any workers, which may be somewhat inefficient but should still
work. Hopefully this won't be a problem; users can always set
max_parallel_degree=0 to avoid choosing parallel plans in the first
place.
Amit Kapila, reviewed by me.
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether. (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)
The new function has the following behaviors that are painful or impossible
to get right via pg_locks:
1. Correctly understands which lock modes block which other ones.
2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.
3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.
The motivation for doing this right now is mostly to fix the isolation
tests. Commit 38f8bdcac4 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly. But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions. Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
We don't really need this field, because it's either zero or redundant with
PGPROC.pid. The use of zero to mark "not a group leader" is not necessary
since we can just as well test whether lockGroupLeader is NULL. This does
not save very much, either as to code or data, but the simplification seems
worthwhile anyway.
In 4b4b680c3 I accidentally used sizeof(PrivateRefCountArray) instead of
sizeof(PrivateRefCountEntry) when creating the refcount overflow
hashtable. As the former is bigger than the latter, this luckily only
resulted in a slightly increased memory usage when many buffers are
pinned in a backend.
Reported-By: Takashi Horikawa
Discussion: 73FA3881462C614096F815F75628AFCD035A48C3@BPXM01GP.gisp.nec.co.jp
Backpatch: 9.5, where thew new ref count infrastructure was introduced
Coverity griped about use of unchecked strcpy() into a local variable.
There's unlikely to be any actual bug there, since no caller would be
passing a path longer than MAXPGPATH, but nonetheless use of strlcpy()
seems preferable.
While at it, get rid of unmaintainable separation between list of
field names and list of field values in favor of initializing them
in parallel. And we might as well declare get_configdata()'s path
argument as const char *, even though no current caller needs that.
Some over-eager copy-and-pasting on my part resulted in a nonsense
result being returned in this case. I have adopted the same pattern for
handling this case as is used in the one argument form of the function,
i.e. we just skip over the code that adds values to the object.
Diagnosis and patch from Michael Paquier, although not quite his
solution.
Fixes bug #13936.
Backpatch to 9.5 where jsonb_object was introduced.
Reflow text in lock manager README so that it fits within 80 columns.
Correct some mistakes. Expand the README to explain not only why group
locking exists but also the data structures that support it. Improve
comments related to group locking several files. Change the name of a
macro argument for improved clarity.
Most of these problems were reported by Tom Lane, but I found a few
of them myself.
Robert Haas and Tom Lane
Commit 53874c5228 broke various 32-bit
buildfarm machines because it incorrectly used an 'L' suffix for what
needed to be a 64-bit literal. Thanks to Michael Paquier for helping
to diagnose this.
This will parse strings in the format produced by pg_size_pretty() and
return sizes in bytes. This allows queries to be written with clauses
like "pg_total_relation_size(oid) > pg_size_bytes('10 GB')".
Author: Pavel Stehule with various improvements by Vitaly Burovoy
Discussion: http://www.postgresql.org/message-id/CAFj8pRD-tGoDKnxdYgECzA4On01_uRqPrwF-8LdkSE-6bDHp0w@mail.gmail.com
Reviewed-by: Vitaly Burovoy, Oleksandr Shulgin, Kyotaro Horiguchi,
Michael Paquier and Robert Haas
Architecture reference material specifies this order, and s_lock.h
inline assembly agrees. The former order failed to provide mutual
exclusion to lwlock.c and perhaps to other clients. The two xlc
buildfarm members, hornet and mandrill, have failed sixteen times with
duplicate key errors involving pg_class_oid_index or pg_type_oid_index.
Back-patch to 9.5, where commit b64d92f1a5
introduced atomics.
Reviewed by Andres Freund and Tom Lane.
StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans
data structure, which in some cases could lead to errors in startup for Hot
Standby.
This patch wraps the pageids correctly, avoiding any such errors.
Identified by exhaustive crash testing by Jeff Janes.
Jeff Janes
It was using %u to read a string that was earlier produced by snprintf with %d
into a signed integer variable. This seems to work in practice but is
incorrect.
found by cppcheck
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
Suppress creation of the pg_upgrade delete script when the new data
directory is inside the old data directory.
Reported-by: IRC
Backpatch-through: 9.3, where delete script tests were added
In commit a5c43b88 the behavior of command line pg_config was
inadvertantly changed to include the config name when specific
configs are requested, similar to when none are requested and
all are emitted. This breaks scripts that expect to use
pg_config for e.g. PGXS. Revert the behavior to the previous.
Move and refactor the underlying code for the pg_config client
application to src/common in support of sharing it with a new
system information SRF called pg_config() which makes the same
information available via SQL. Additionally wrap the SRF with a
new system view, as called pg_config.
Patch by me with extensive input and review by Michael Paquier
and additional review by Alvaro Herrera.
When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.
Peter Geoghegan, reviewd by Andreas Karlsson and by me.
A function name that's double-quoted in SQL can contain almost any
characters, but we were using that name directly as part of the name
generated for the Python-level function, and Python doesn't like
anything that isn't pretty much a standard identifier. To fix,
replace anything that isn't an ASCII letter or digit with an underscore
in the generated name. This doesn't create any risk of duplicate Python
function names because we were already appending the function OID to
the generated name to ensure uniqueness. Per bug #13960 from Jim Nasby.
Patch by Jim Nasby, modified a bit by me. Back-patch to all
supported branches.
In commit 7b4bfc87 the DATA and DESCR entries for the new
row_security_active() function were inadvertantly put after
the PROVOLATILE defines, rather than before as they should
have been placed. Move them up where they belong.
Backpatch to 9.5 where the new entries were introduced.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al. But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4. The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.
This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.
To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
The original code wasn't careful to test the file descriptor returned by
PQsocket() for an invalid socket. If an invalid socket did turn up,
that would amount to calling FD_ISSET with fd = -1, whereby undefined
behavior can be invoked.
To fix, test file descriptor for validity and stop further processing if
that fails.
Problem noticed by Coverity.
There is an existing FD_ISSET callsite that does check for invalid
sockets beforehand, but the error message reported by it was
strerror(errno); in testing the aforementioned change, that turns out to
result in "bad socket: Success" which isn't terribly helpful. Instead
use PQerrorMessage() in both places which is more likely to contain an
useful error message.
Backpatch-through: 9.1.
Reportedly, some compilers warn about tests like "c < 0" if c is unsigned,
and hence complain about the character range checks I added in commit
3bb3f42f37. This is a bit of a pain since
the regex library doesn't really want to assume that chr is unsigned.
However, since any such reconfiguration would involve manual edits of
regcustom.h anyway, we can put it on the shoulders of whoever wants to
do that to adjust this new range-checking macro correctly.
Per gripes from Coverity and Andres.
Previously we only allowed SetHintBits() to succeed if the commit LSN of
the last transaction touching the page has already been flushed to
disk. We can't generally change the LSN of the page, because we don't
necessarily have the required locks on the page. But the required LSN
interlock does not mean the commit record has to be flushed immediately,
it just requires that the commit record will be flushed before the page is
written out. Therefore if the buffer LSN is newer than the commit LSN,
the hint bit can be safely set.
In a number of scenarios (e.g. pgbench) this noticeably increases the
number of hint bits are set. But more importantly it also keeps the
success rate up when flushing WAL less frequently. That was the original
reason for commit 4de82f7d7, which has negative performance consequences
in a number of scenarios. This will allow a followup commit to reduce
the flush rate.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
In REFRESH MATERIALIZED VIEW command, CONCURRENTLY option is only
allowed if there is at least one unique index with no WHERE clause on
one or more columns of the matview. Previously, concurrent refresh
checked the existence of a unique index on the matview after filling
the data to new snapshot, i.e., after calling refresh_matview_datafill().
So, when there was no unique index, we could need to wait a long time
before we detected that and got the error. It was a waste of time.
To eliminate such wasting time, this commit changes concurrent refresh
so that it checks the existence of a unique index at the beginning of
the refresh operation, i.e., before starting any time-consuming jobs.
If CONCURRENTLY option is not allowed due to lack of a unique index,
concurrent refresh can immediately detect it and emit an error.
Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Fujii Masao
The API spec for this function was changed completely (and for the better)
by commit 3cba8999b3, but it didn't bother
with anything as mundane as updating the comments.
NextXID has been rendered in the form of a pg_lsn even though it
really is not. This can cause confusion, so change the format from
%u/%u to %u:%u, per discussion on hackers.
Complaint by me, patch by me and Bruce, reviewed by Michael Paquier
and Alvaro. Applied to HEAD only.
Author: Joe Conway, Bruce Momjian
Reviewed-by: Michael Paquier, Alvaro Herrera
Backpatch-through: master
The previous value of 5s is inadequate for the buildfarm's
CLOBBER_CACHE_ALWAYS animals: they take long enough to do the is-it-waiting
queries that the timeout expires, allowing the database state to change,
before isolationtester is done looking. Perhaps 10s will be enough.
(If it isn't, I'm inclined to reduce the number of sessions involved.)
This mostly reverts commit 9c9782f066.
I left in the parts that rearranged removal of completed waiting steps;
but the idea of not rechecking a step's blocked-ness isn't working.
There is no reason to have the per-thread logfile file pointer as a
separate parameter in various functions: it's much simpler to put it in
the per-thread state struct instead, which is already being passed to
all functions that need the log file anyway. Change the callsites in
which it was used as a boolean to test whether logging is active, so
that they use the use_log global variable instead.
No backpatch, even though this exists since commit a887c486d5 of March
2010, because this is just for cleanliness' sake and the surrounding
code has been modified a lot recently anyway.
It turns out that there is a second race condition in the new deadlock-hard
test: once the deadlock detector fires, it's uncertain whether step s7a8 or
step s8a1 will report first, because killing s8's transaction unblocks s7.
So far, s7 has only been seen to report first in CLOBBER_CACHE_ALWAYS
builds, but it's pretty reproducible there, and in theory it should
sometimes occur in normal builds too. If s7 were a bit slower than usual,
that could also break the test, since the existing expected-file assumes
that we'll see s7a8 report the first time we check it after s8a1 completes.
To fix, add a post-lock delay to s7a8.
If we're retrying a step, then we already decided it was blocked on a lock,
and there's no need to recheck that. The original coding of commit
38f8bdcac4 resulted in a large number of
is-it-waiting queries when dealing with multiple concurrently-blocked
sessions, which is fairly pointless and also results in test failures in
CLOBBER_CACHE_ALWAYS builds, where the is-it-waiting query is quite slow.
This definition also permits appending pg_sleep() calls to steps where it's
needed to control the order of finish of concurrent steps. Before, that
did not work nicely because we'd decide that a step performing a sleep was
not blocked and hang up waiting for it to finish, rather than noticing the
completion of the concurrent step we're supposed to notice first.
In passing, revise handling of removal of completed waiting steps
to make it a bit less messy.
An extensible node is always tagged T_Extensible, but the extnodename
field identifies it more specifically; it may also include arbitrary
private data. Extensible nodes can be copied, tested for equality,
serialized, and deserialized, but the core system doesn't know
anything about them otherwise. Some extensions may find it useful to
include these nodes in fdw_private or custom_private lists in lieu of
arm-wrestling their data into a format that the core code can
understand.
Along the way, so as not to burden the authors of such extensible
node types too much, expose the functions for writing serialized
tokens, and for serializing and deserializing bitmapsets.
KaiGai Kohei, per a design suggested by me. Reviewed by Andres Freund
and by me, and further edited by me.
The new deadlock-soft-2 test has a timing dependency too: it supposes
that isolationtester will detect step s1b as waiting before the deadlock
detector runs and grants it the lock. Adjust deadlock_timeout to ensure
that that's true even in CLOBBER_CACHE_ALWAYS builds, where the wait
detection query is quite slow. Per buildfarm member jaguarundi.
If we ever get around to allowing functional dependency to be proven
from other things besides simple primary keys, this code will need to
be rethought, but that was true anyway. In the meantime, we might as
well not have two very-similar routines for scanning pg_constraint.
David Rowley, reviewed by Julien Rouhaud
If a GROUP BY clause includes all columns of a non-deferred primary key,
as well as other columns of the same relation, those other columns are
redundant and can be dropped from the grouping; the pkey is enough to
ensure that each row of the table corresponds to a separate group.
Getting rid of the excess columns will reduce the cost of the sorting or
hashing needed to implement GROUP BY, and can indeed remove the need for
a sort step altogether.
This seems worth testing for since many query authors are not aware of
the GROUP-BY-primary-key exception to the rule about queries not being
allowed to reference non-grouped-by columns in their targetlists or
HAVING clauses. Thus, redundant GROUP BY items are not uncommon. Also,
we can make the test pretty cheap in most queries where it won't help
by not looking up a rel's primary key until we've found that at least
two of its columns are in GROUP BY.
David Rowley, reviewed by Julien Rouhaud
A pending patch requires exporting a function returning Bitmapset from
catalog/pg_constraint.c. As things stand, that would mean including
nodes/bitmapset.h in pg_constraint.h, which might be hazardous for the
client-side includability of that header. It's not entirely clear whether
any client-side code needs to include pg_constraint.h, but it seems prudent
to assume that there is some such code somewhere. Therefore, split off the
function definitions into a new file pg_constraint_fn.h, similarly to what
we've done for some other catalog header files.
Historically this message has been emitted at the end of ShutdownXLOG().
That's not an insane place for it in a standalone backend, but in the
postmaster environment we've grown a fair amount of stuff that happens
later, including archiver/walsender shutdown, stats collector shutdown,
etc. Recent buildfarm experimentation showed that on slower machines
there could be many seconds' delay between finishing ShutdownXLOG() and
actual postmaster exit. That's fairly confusing, both for testing
purposes and for DBAs. Hence, move the code that prints this message
into UnlinkLockFiles(), so that it comes out just after we remove the
postmaster's pidfile. That is a more appropriate definition of "is shut
down" from the point of view of "pg_ctl stop", for example. In general,
removing the pidfile should be the last externally-visible action of
either a postmaster or a standalone backend; compare commit
d73d14c271 for instance. So this seems
like a reasonably future-proof approach.
This finishes the work - spread across many commits over the last
several months - of putting each type of lock other than the named
individual locks into a separate tranche.
Amit Kapila
The original formulation of 4c9864b9b4
was extremely timing-sensitive, because it arranged for the deadlock
detector to be running (and possibly unblocking the current query)
at almost exactly the same time as isolationtester would be probing
to see if the query is blocked. The committed expected-file assumed
that the deadlock detection would finish first, but we see the opposite
on both fast and slow buildfarm animals. Adjust the deadlock timeout
settings to make it predictable that isolationtester *will* see the
query as waiting before deadlock detection unblocks it.
I used a 5s timeout for the same reasons mentioned in
a7921f71a3.
Fix a few oversights in 38f8bdcac4982215beb9f65a19debecaf22fd470:
don't leak memory in run_permutation(), remember when we've issued
a cancel rather than issuing another one every 10ms,
fix some typos in comments.
Commit 0e141c0fbb introduced a new
facility to reduce ProcArrayLock contention by clearing several XIDs
from the ProcArray under a single lock acquisition. The names
initially chosen were deemed not to be very good choices, so commit
4aec49899e renamed them. But now it
seems like we still didn't get it right. A pending patch wants to
add similar infrastructure for batching CLOG updates, so the names
need to be clear enough to allow a new set of structure members with
a related purpose.
Amit Kapila
This allows testing of deadlock scenarios. Scenarios that would
previously have been considered invalid are now simply taken as a
scenario in which more than one backend will wait.
This is a necessary prerequisite for forthcoming changes to allow deadlock
scenarios to be tested by the isolation tester. It is also a good idea on
general principle, since these scenarios add no useful test coverage not
provided by other scenarios, but do to take time to execute.
Thirty seconds was not consistently enough for promotion to complete on
buildfarm members sungazer and tern. Experiments suggest 43s would have
been enough. Back-patch to 9.5, where pg_rewind was introduced.
Many automated test suites call pg_ctl. Buildfarm members axolotl,
hornet, mandrill, shearwater, sungazer and tern have failed when server
shutdown took longer than the pg_ctl default 60s timeout. This addition
permits slow hosts to easily raise the timeout without us editing a
--timeout argument into every test suite pg_ctl call. Back-patch to 9.1
(all supported versions) for the sake of automated testing.
Reviewed by Tom Lane.
It turns out that on FreeBSD-derived platforms (including OS X), the
*scanf() family of functions is pretty much brain-dead about multibyte
characters. In particular it will apply isspace() to individual bytes
of input even when those bytes are part of a multibyte character, thus
allowing false recognition of a field-terminating space.
We appear to have little alternative other than instituting a coding
rule that *scanf() is not to be used if the input string might contain
multibyte characters. (There was some discussion of relying on "%ls",
but that probably just moves the portability problem somewhere else,
and besides it doesn't fully prevent BSD *scanf() from using isspace().)
This patch is a down payment on that: it gets rid of use of sscanf()
to parse ispell dictionary files, which are certainly at great risk
of having a problem. The code is cleaner this way anyway, though
a bit longer.
In passing, improve a few comments.
Report and patch by Artur Zakirov, reviewed and somewhat tweaked by me.
Back-patch to all supported branches.
As of commit c1772ad922, there's no
longer any way of requesting additional LWLocks in the main tranche,
so we don't need NumLWLocks() or LWLockAssign() any more. Also,
some of the allocation counters that we had previously aren't needed
any more either.
Amit Kapila
Further investigation says that there may be some slow operations after
we've finished ShutdownXLOG(), so add some more log messages to try to
isolate that. This is all temporary code too.
Early returns from the buildfarm show that there's a bit of a gap in the
logging I added in 3971f64843b02e4a: the portion of CreateCheckPoint()
after CheckPointGuts() can take a fair amount of time. Add a few more
log messages in that section of code. This too shall be reverted later.
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints. Maybe they
are just really overloaded, but maybe something else is going on. Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.
No attempt has been made to make this pretty. I'm not even totally sure
if it will build on Windows, but we'll soon find out.
Since pgindent treats typedef names as global, the original coding of
b47b4dbf68 would have had rather nasty effects on the formatting
of other files in which "string" is used as a variable or field name.
Use a less generic name for this typedef, and rename some other
identifiers to match.
Peter Geoghegan, per gripe from me
Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a
bad choice because it is outside the range of type "celt" (int32).
Characters approaching that limit could lead to infinite loops in logic
such as "for (c = a; c <= b; c++)" where c is of type celt but the
range bounds are chr. Such loops will work safely only if CHR_MAX+1
is representable in celt, since c must advance to beyond b before the
loop will exit.
Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe.
It's highly unlikely that Unicode will ever assign codes that high, and
none of our other backend encodings need characters beyond that either.
In addition to modifying the macro, we have to explicitly enforce character
range restrictions on the values of \u, \U, and \x escape sequences, else
the limit is trivially bypassed.
Also, the code for expanding case-independent character ranges in bracket
expressions had a potential integer overflow in its calculation of the
number of characters it could generate, which could lead to allocating too
small a character vector and then overwriting memory. An attacker with the
ability to supply arbitrary regex patterns could easily cause transient DOS
via server crashes, and the possibility for privilege escalation has not
been ruled out.
Quite aside from the integer-overflow problem, the range expansion code was
unnecessarily inefficient in that it always produced a result consisting of
individual characters, abandoning the knowledge that we had a range to
start with. If the input range is large, this requires excessive memory.
Change it so that the original range is reported as-is, and then we add on
any case-equivalent characters that are outside that range. With this
approach, we can bound the number of individual characters allowed without
sacrificing much. This patch allows at most 100000 individual characters,
which I believe to be more than the number of case pairs existing in
Unicode, so that the restriction will never be hit in practice.
It's still possible for range() to take awhile given a large character code
range, so also add statement-cancel detection to its loop. The downstream
function dovec() also lacked cancel detection, and could take a long time
given a large output from range().
Per fuzz testing by Greg Stark. Back-patch to all supported branches.
Security: CVE-2016-0773
Commit 7f46eaf added the regression test which checks that
gin_clean_pending_list() cleans up the GIN pending list and returns >0.
This usually works fine. But if autovacuum comes along and cleans
the list before gin_clean_pending_list() starts, the function may
return 0, and then the regression test may fail.
To fix the problem, this commit disables autovacuum on the target
index of gin_clean_pending_list() by setting autovacuum_enabled
reloption to off when creating the table.
Also this commit sets gin_pending_list_limit reloption to 4MB on
the target index. Otherwise when running "make installcheck" with
small gin_pending_list_limit GUC, insertions of data may trigger
the cleanup of pending list before gin_clean_pending_list() starts
and the function may return 0. This could cause the regression test
to fail.
Per buildfarm member spoonbill.
Reported-By: Tom Lane
In 61444bfb we started to allow HAVING clauses to be fully pushed down
into WHERE, even when grouping sets are in use. That turns out not to
work correctly, because grouping sets can "produce" NULLs, meaning that
filtering in WHERE and HAVING can have different results, even when no
aggregates or volatile functions are involved.
Instead only allow pushdown of empty grouping sets.
It'd be nice to do better, but the exact mechanics of deciding which
cases are safe are still being debated. It's important to give correct
results till we find a good solution, and such a solution might not be
appropriate for backpatching anyway.
Bug: #13863
Reported-By: 'wrb'
Diagnosed-By: Dean Rasheed
Author: Andrew Gierth
Reviewed-By: Dean Rasheed and Andres Freund
Discussion: 20160113183558.12989.56904@wrigleys.postgresql.org
Backpatch: 9.5, where grouping sets were introduced
The parser doesn't allow qualification of column names appearing in
these clauses, but ruleutils.c would sometimes qualify them, leading
to dump/reload failures. Per bug #13891 from Onder Kalaci.
(In passing, make stanzas in ruleutils.c that save/restore varprefix
more consistent.)
Peter Geoghegan
Commit 45f6240a8f added an assumption in ExecHashIncreaseNumBatches
and ExecHashIncreaseNumBuckets that they could find all tuples in the main
hash table by iterating over the "dense storage" introduced by that patch.
However, ExecHashRemoveNextSkewBucket continued its old practice of simply
re-linking deleted skew tuples into the main table's hashchains. Hence,
such tuples got lost during any subsequent increase in nbatch or nbuckets,
and would never get joined, as reported in bug #13908 from Seth P.
I (tgl) think that the aforesaid commit has got multiple design issues
and should be reworked rather completely; but there is no time for that
right now, so band-aid the problem by making ExecHashRemoveNextSkewBucket
physically copy deleted skew tuples into the "dense storage" arena.
The added test case is able to exhibit the problem by means of fooling the
planner with a WHERE condition that it will underestimate the selectivity
of, causing the initial nbatch estimate to be too small.
Tomas Vondra and Tom Lane. Thanks to David Johnston for initial
investigation into the bug report.
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe. For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process. When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.
Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.
Robert Haas, with help from Amit Kapila and Rushabh Lathia.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes. There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach. Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.
Robert Haas, with help from Amit Kapila. Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
It seems that sprintf(), at least in glibc's version, is unreasonably slow
compared to hand-rolled code for printing integers. Replacing most uses of
sprintf() in the datetime.c output functions with special-purpose code
turns out to give more than a 2X speedup in COPY of a table with a single
timestamp column; which is pretty impressive considering all the other
logic in that code path.
David Rowley and Andres Freund, reviewed by Peter Geoghegan and myself
Commit 30d7ae3c76 introduced an HJDEBUG
stanza that probably didn't compile at the time, and definitely doesn't
compile now, because it refers to a nonexistent variable. It doesn't seem
terribly useful anyway, so just get rid of it.
While I'm fooling with it, use %z modifier instead of the obsolete hack of
casting size_t to unsigned long, and include the HashJoinTable's address in
each printout so that it's possible to distinguish the activities of
multiple hashjoins occurring in one query.
Noted while trying to use HJDEBUG to investigate bug #13908. Back-patch
to 9.5, because code that doesn't compile is certainly not very helpful.
Future PL/Java versions will close CVE-2016-0766 by making these GUCs
PGC_SUSET. This PostgreSQL change independently mitigates that PL/Java
vulnerability, helping sites that update PostgreSQL more frequently than
PL/Java. Back-patch to 9.1 (all supported versions).
Commit a104a017fc has this check because
I added it to the submitted patch before commit, but that was entirely
wrongheaded, as explained to me by Ashutosh Bapat, who also wrote this
patch.
An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that
exactly one of a,b,c isn't NULL. The functions are variadic, so they
can also be pressed into service to count the number of null or nonnull
elements in an array.
Marko Tiikkaja, reviewed by Pavel Stehule
GetExistingLocalJoinPath() is useful for handling EvalPlanQual rechecks
properly, and GetUserMappingById() is needed to make sure you're using
the right credentials.
Shigeru Hanada, Etsuro Fujita, Ashutosh Bapat, Robert Haas
The previous RequestAddinLWLocks() method had several disadvantages.
First, the locks would be in the main tranche; we've recently decided
that it's useful for LWLocks used for separate purposes to have
separate tranche IDs. Second, there wasn't any correlation between
what code called RequestAddinLWLocks() and what code called
LWLockAssign(); when multiple modules are in use, it could become
quite difficult to troubleshoot problems where LWLockAssign() ran out
of locks. To fix, create a concept of named LWLock tranches which
can be used either by extension or by core code.
Amit Kapila and Robert Haas
If a view is split into CREATE TABLE + CREATE RULE to break a circular
dependency, then any triggers on the view must be dumped/reloaded after
the CREATE RULE; else the backend may reject the CREATE TRIGGER because
it's the wrong type of trigger for a plain table. This works all right
in plain dump/restore because of pg_dump's sorting heuristic that places
triggers after rules. However, when using parallel restore, the ordering
must be enforced by a dependency --- and we didn't have one.
Fixing this is a mere matter of adding an addObjectDependency() call,
except that we need to be able to find all the triggers belonging to the
view relation, and there was no easy way to do that. Add fields to
pg_dump's TableInfo struct to remember where the associated TriggerInfo
struct(s) are.
Per bug report from Dennis Kögel. The failure can be exhibited at least
as far back as 9.1, so back-patch to all supported branches.
Have varlena.c expose an interface that allows the char(n), bytea, and
bpchar types to piggyback on a now-generalized SortSupport for text.
This pushes a little more knowledge of the bpchar/char(n) type into
varlena.c than might be preferred, but that seems like the approach
that creates least friction. Also speed things up for index builds
that use text_pattern_ops or varchar_pattern_ops.
This patch does quite a bit of renaming, but it seems likely to be
worth it, so as to avoid future confusion about the fact that this code
is now more generally used than the old names might have suggested.
Peter Geoghegan, reviewed by Álvaro Herrera and Andreas Karlsson,
with small tweaks by me.
This patch doesn't put the new infrastructure to use anywhere, and
indeed it's not clear how it could ever be used for something like
postgres_fdw which has to send an SQL query and wait for a reply,
but there might be FDWs or custom scan providers that are CPU-bound,
so let's give them a way to join club parallel.
KaiGai Kohei, reviewed by me.
You can't really do anything useful with this in the form it currently
exists; among other problems, there's no way to reread whatever
information might be produced when the path is output. Work is
underway to replace this with a more useful and more general system of
extensible nodes, but let's start by getting rid of this bit.
Extracted from a larger patch by KaiGai Kohei.
Commit e09996ff8d was one brick shy of a load: it didn't insist
that the detected JSON number be the whole of the supplied string.
This allowed inputs such as "2016-01-01" to be misdetected as valid JSON
numbers. Per bug #13906 from Dmitry Ryabov.
In passing, be more wary of zero-length input (I'm not sure this can
happen given current callers, but better safe than sorry), and do some
minor cosmetic cleanup.
Insert sd_notify() calls at server start and stop for integration with
systemd. This allows the use of systemd service units of type "notify",
which greatly simplifies the systemd configuration.
Reviewed-by: Pavel Stěhule <pavel.stehule@gmail.com>
Previously, the first error seen would be that postgresql.conf does not
exist. But for the case where the whole directory does not exist, give
an error message about that, together with a hint for how to create one.
This field was included in the original definition of the printQueryOpt
struct in commit a45195a191, but it was not used anywhere in that
commit, nor since then. Spotted by Dickson S. Guedes.
create_foreignscan_plan needs to know whether any system columns are
requested from a relation (this flag is needed by ForeignNext during
execution). However, for join relations this is a pointless test,
because it's not possible to request system columns from them, so
remove the check.
Author: Etsuro Fujita
Discussion: http://www.postgresql.org/message-id/56AA0FC5.9000207@lab.ntt.co.jp
Reviewed-by: David Rowley, Robert Haas
Apparently at least one committer hasn't gotten the word that these do not
need to be maintained by hand, since initdb will create them automatically.
Noted while fixing bug #13905.
No catversion bump since the post-initdb state is exactly the same either
way. I don't see a need for back-patch, either.
All the other jsonb function descriptions refer to the arguments as being
"jsonb", but these two said "json". Make it consistent. Per bug #13905
from Petru Florin Mihancea.
No catversion bump --- we can't force one in the back branches, and this
isn't very critical anyway.
KNN GiST with recheck flag should return to executor the same type as ordering
operator, GiST detects this type by looking to return type of function which
implements ordering operator. But occasionally detecting code works after
replacing ordering operator function to distance support function.
Distance support function always returns float8, so, detecting code get float8
instead of actual return type of ordering operator.
Built-in opclasses don't have ordering operator which doesn't return
non-float8 value, so, tests are impossible here, at least now.
Backpatch to 9.5 where lozzy KNN was introduced.
Author: Alexander Korotkov
Report by: Artur Zakirov
Provide per-script statistical info (count of transactions executed
under that script, average latency for the whole script) after a
multi-script run, adding an intermediate level of detail to existing
global stats and per-command stats.
Author: Fabien Coelho
Reviewer: Michaël Paquier, Álvaro Herrera
Dividing INT_MIN by -1 or taking INT_MIN modulo -1 can sometimes
cause floating-point exceptions or otherwise misbehave.
Fabien Coelho and Michael Paquier
Add
- ALTER SYSTEM SET/RESET ... -> GUC variables
- ALTER TABLE ... SET WITH -> OIDS
- ALTER DATABASE/FUNCTION/ROLE/USER ... SET/RESET -> GUC variables
- ALTER DATABASE/FUNCTION/ROLE/USER ... SET ... -> FROM CURRENT/TO
- ALTER DATABASE/FUNCTION/ROLE/USER ... SET ... TO/= -> possible values
Author: Fujii Masao
Reviewed-by: Michael Paquier, Masahiko Sawada
This is following in a long train of similar changes and for the same
reasons - see b319356f0e and
fe702a7b3f inter alia.
Author: Amit Kapila
Reviewed-by: Alexander Korotkov, Robert Haas
Previously, each PGPROC's backendLock was part of the main tranche,
and the PGPROC just contained a pointer. Now, the actual LWLock is
part of the PGPROC.
As with previous, similar patches, this makes it significantly easier
to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output
and improves modularity.
Author: Ildus Kurbangaliev
Reviewed-by: Amit Kapila, Robert Haas
This doesn't add any functionality but just shuffles things around so
that it can be reused and improved later.
Author: Fabien Coelho
Reviewed-by: Michael Paquier, Álvaro Herrera
listForeignTables' invocation of processSQLNamePattern did not match up
with the other ones that handle potentially-schema-qualified names; it
failed to make use of pg_table_is_visible() and also passed the name
arguments in the wrong order. Bug seems to have been aboriginal in commit
0d692a0dc9. It accidentally sort of worked as long as you didn't
inquire too closely into the behavior, although the silliness was later
exposed by inconsistencies in the test queries added by 59efda3e50
(which I probably should have questioned at the time, but didn't).
Per bug #13899 from Reece Hart. Patch by Reece Hart and Tom Lane.
Back-patch to all affected branches.
Previously, the foreign join pushdown infrastructure left the question
of security entirely up to individual FDWs, but it would be easy for
a foreign data wrapper to inadvertently open up subtle security holes
that way. So, make it the core code's job to determine which user
mapping OID is relevant, and don't attempt join pushdown unless it's
the same for all relevant relations.
Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat,
reviewed by Etsuro Fujita and KaiGai Kohei, with some further
changes by me.
Previously, postgres_fdw's connection cache was keyed by user OID and
server OID, but this can lead to multiple connections when it's not
really necessary. In particular, if all relevant users are mapped to
the public user mapping, then their connection options are certainly
the same, so one connection can be used for all of them.
While we're cleaning things up here, drop the "server" argument to
GetConnection(), which isn't really needed. This saves a few cycles
because callers no longer have to look this up; the function itself
does, but only when establishing a new connection, not when reusing
an existing one.
Ashutosh Bapat, with a few small changes by me.
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.
This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.
Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.
Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
pg_size_pretty function should be marked immutable rather than volatile
because it always returns the same result given the same argument.
Pavel Stehule
Previously, it was possible to specify one or several custom scripts to
run, or only one of the builtin scripts. With this patch it is also
possible to specify to run the builtin scripts multiple times, using the
new -b option. Also, unify the code for both cases; this eases future
pgbench improvements.
Author: Fabien Coelho
Review: Michaël Paquier, Álvaro Herrera
pgindent for recent commits; also change some variables from int to
boolean, which is how they are really used.
Mostly submitted by Fabien Coelho; this is in preparation to commit
further patches to the file.
We entirely randomly chose to initialize port->remote_host just after
printing the log_connections message, when we could perfectly well do it
just before, allowing %h and %r to work for that message. Per gripe from
Artem Tomyuk.
The original coding was quite fast so long as objects were always
released in reverse order of addition; otherwise, it degenerated into
O(N^2) behavior due to searching for the array element to delete.
Improve matters by switching to hashed storage when the number of
objects of a given type exceeds 64. (The cutover point is open to
discussion, of course, but some simple performance testing suggests
that hashing has enough overhead to be a loser below there.)
Also, refactor resowner.c so that we don't need N copies of the array
management code. Since all the resource IDs the code currently needs
to deal with are either pointers or integers, it seems sufficient to
create a one-size-fits-all infrastructure in which everything is
converted to a Datum for storage.
Aleksander Alekseev, reviewed by Stas Kelvich, further fixes by me
Buildfarm member cockatiel is still saying that cosd(60) isn't 0.5.
What seems likely is that the subexpression (1.0 - cos(x)) isn't being
rounded to double width before more arithmetic is done on it, so force
that by storing it into a variable.
The last round didn't do it. Per Noah Misch, the problem on at least
some machines is that the compiler pre-evaluates trig functions having
constant arguments using code slightly different from what will be used
at runtime. Therefore, we must prevent the compiler from seeing constant
arguments to any of the libm trig functions used in this code.
The method used here might still fail if init_degree_constants() gets
inlined into the call sites. That probably won't happen given the large
number of call sites; but if it does, we could probably fix it by making
init_degree_constants() non-static. I'll avoid that till proven
necessary, though.
The buildfarm isn't very happy with the results of commit e1bd684a34.
To try to get the expected exact results everywhere:
* Replace M_PI / 180 subexpressions with a precomputed constant, so that
the compiler can't decide to rearrange that division with an adjacent
operation. Hopefully this will fix failures to get exactly 0.5 from
sind(30) and cosd(60).
* Add scaling to ensure that tand(45) and cotd(45) give exactly 1; there
was nothing particularly guaranteeing that before.
* Replace minus zero by zero when tand() or cotd() would output that;
many machines did so for tand(180) and cotd(270), but not all. We could
alternatively deem both results valid, but that doesn't seem likely to
be what users will want.
Add
- ALTER FOREIGN DATA WRAPPER -> RENAME TO
- ALTER SERVER -> RENAME TO
- ALTER SERVER ... VERSION ... -> OPTIONS
- CREATE FOREIGN DATA WRAPPER -> OPTIONS
- CREATE SERVER -> OPTIONS
- CREATE|ALTER USER MAPPING -> OPTIONS
From: Andreas Karlsson <andreas@proxel.se>
The original code was adding double quotes to an already-quoted
identifier, leading to nonsensical results. Remove the quoting call.
I introduced the broken code in 7eca575d1c of 9.5 era, so backpatch to
9.5.
Report and patch by Elvis Pranskevichus
Reviewed by Michael Paquier
The implementations go to some lengths to deliver exact results for values
where an exact result can be expected, such as sind(30) = 0.5 exactly.
Dean Rasheed, reviewed by Michael Paquier
Ensure that the trig functions return NaN for NaN input regardless of what
the underlying C library functions might do. Also ensure that an error
is thrown for Inf (or otherwise out-of-range) input, except for atan/atan2
which should accept it.
All these behaviors should now conform to the POSIX spec; previously, all
our popular platforms deviated from that in one case or another.
The main remaining platform dependency here is whether the C library might
choose to throw a domain error for sin/cos/tan inputs that are large but
less than infinity. (Doing so is not unreasonable, since once a single
unit-in-the-last-place exceeds PI, there can be no significance at all in
the result; however there doesn't seem to be any suggestion in POSIX that
such an error is allowed.) We will report such errors if they are reported
via "errno", but not if they are reported via "fetestexcept" which is the
other mechanism sanctioned by POSIX. Some preliminary experiments with
fetestexcept indicated that it might also report errors we could do
without, such as complaining about underflow at an unreasonably large
threshold. So let's skip that complexity for now.
Dean Rasheed, reviewed by Michael Paquier
Commit e529cd4ffa introduced an Assert requiring NAMEDATALEN to be
less than MAX_LEVENSHTEIN_STRLEN, which has been 255 for a long time.
Since up to that instant we had always allowed NAMEDATALEN to be
substantially more than that, this was ill-advised.
It's debatable whether we need MAX_LEVENSHTEIN_STRLEN at all (versus
putting a CHECK_FOR_INTERRUPTS into the loop), or whether it has to be
so tight; but this patch takes the narrower approach of just not applying
the MAX_LEVENSHTEIN_STRLEN limit to calls from the parser.
Trusting the parser for this seems reasonable, first because the strings
are limited to NAMEDATALEN which is unlikely to be hugely more than 256,
and second because the maximum distance is tightly constrained by
MAX_FUZZY_DISTANCE (though we'd forgotten to make use of that limit in one
place). That means the cost is not really O(mn) but more like O(max(m,n)).
Relaxing the limit for user-supplied calls is left for future research;
given the lack of complaints to date, it doesn't seem very high priority.
In passing, fix confusion between lengths-in-bytes and lengths-in-chars
in comments and error messages.
Per gripe from Kevin Day; solution suggested by Robert Haas. Back-patch
to 9.5 where the unwanted restriction was introduced.