Patch by David Rowley. Backpatch to 9.5, as some of the calls were new in
9.5, and keeping the code in sync with master makes future backpatching
easier.
It's called "missing_ok" in the docs and in the C code.
I refrained from doing a catversion bump for this, because the name of an
input argument is just documentation, it has no effect on any callers.
Michael Paquier
Commit 179cdd09 added macros to check if a filename is a WAL segment
or other such file. However there were still some instances of the
strlen + strspn combination to check for that in WAL-related utilities
like pg_archivecleanup. Those checks can be replaced with the macros.
This patch makes use of the macros in those utilities and
which would make the code a bit easier to read.
Back-patch to 9.5.
Michael Paquier
Free the contexts holding this data after we're done using it, by the
expedient of attaching them to the PostmasterContext which we were
already taking care to delete (and where, indeed, this data used to live
before commits e5e2fc842c and 7c45e3a3c6). This saves a
probably-usually-negligible amount of space per running backend. It also
avoids leaving potentially-security-sensitive data lying around in memory
in processes that don't need it. You'd have to be unusually paranoid to
think that that amounts to a live security bug, so I've not gone so far as
to forcibly zero the memory; but there surely isn't a good reason to keep
this data around.
Arguably this is a memory management bug in the aforementioned commits,
but it doesn't seem important enough to back-patch.
This function is documented to return a value in the range (0,1),
which is what its predecessor anl_random_fract() did. However, the
new version depends on pg_erand48() which returns a value in [0,1).
The possibility of returning zero creates hazards of division by zero
or trying to compute log(0) at some call sites, and it might well
break third-party modules using anl_random_fract() too. So let's
change it to never return zero. Spotted by Coverity.
Michael Paquier, cosmetically adjusted by me
XLogFileCopy() was changed heavily in commit de76884. However it was
partially reverted in commit 7abc685 and most of those changes to
XLogFileCopy() were no longer needed. Then commit 7cbee7c removed
those unnecessary code, but XLogFileCopy() looked different in master
and 9.4 though the contents are almost the same.
This patch makes XLogFileCopy() look the same in master and back-branches,
which makes back-patching easier, per discussion on pgsql-hackers.
Back-patch to 9.5.
Discussion: 55760844.7090703@iki.fi
Michael Paquier
After calling XLogInitBufferForRedo(), the page might be all-zeros if it was
not in page cache already. btree_xlog_unlink_page initialized the page
correctly, but it called PageGetSpecialPointer before initializing it, which
would lead to a corrupt page at WAL replay, if the unlinked page is not in
page cache.
Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
I broke this with my WAL format refactoring patch. Before that, the metapage
was read from disk, and modified in-place regardless of the LSN. That was
always a bit silly, as there's no need to read the old page version from
disk disk when we're overwriting it anyway. So that was changed in 9.5, but
I failed to add a GinInitPage call to initialize the page-headers correctly.
Usually you wouldn't notice, because the metapage is already in the page
cache and is not zeroed.
One way to reproduce this is to perform a VACUUM on an already vacuumed
table (so that the vacuum has no real work to do), immediately after a
checkpoint, and then perform an immediate shutdown. After recovery, the
page headers of the metapage will be incorrectly all-zeroes.
Reported by Jeff Janes
Avoid memory leak from incorrect choice of how to free a StringInfo
(resetStringInfo doesn't do it). Now that pg_split_opts doesn't scribble
on the optstr, mark that as "const" for clarity. Attach the commentary in
protocol.sgml to the right place, and add documentation about the
user-visible effects of this change on postgres' -o option and libpq's
PGOPTIONS option.
_Asm_sched_fence() is just a compiler barrier, not a memory barrier. But
spinlock release on IA64 needs, at the very least, release
semantics. Use a full barrier instead.
This might be the cause for the occasional failures on buildfarm member
anole.
Discussion: 20150629101108.GB17640@alap3.anarazel.de
As first committed, this view reported on the file contents as they were
at the last SIGHUP event. That's not as useful as reporting on the current
contents, and what's more, it didn't work right on Windows unless the
current session had serviced at least one SIGHUP. Therefore, arrange to
re-read the files when pg_show_all_settings() is called. This requires
only minor refactoring so that we can pass changeVal = false to
set_config_option() so that it won't actually apply any changes locally.
In addition, add error reporting so that errors that would prevent the
configuration files from being loaded, or would prevent individual settings
from being applied, are visible directly in the view. This makes the view
usable for pre-testing whether edits made in the config files will have the
desired effect, before one actually issues a SIGHUP.
I also added an "applied" column so that it's easy to identify entries that
are superseded by later entries; this was the main use-case for the original
design, but it seemed unnecessarily hard to use for that.
Also fix a 9.4.1 regression that allowed multiple entries for a
PGC_POSTMASTER variable to cause bogus complaints in the postmaster log.
(The issue here was that commit bf007a27ac unintentionally reverted
3e3f65973a, which suppressed any duplicate entries within
ParseConfigFp. However, since the original coding of the pg_file_settings
view depended on such suppression *not* happening, we couldn't have fixed
this issue now without first doing something with pg_file_settings.
Now we suppress duplicates by marking them "ignored" within
ProcessConfigFileInternal, which doesn't hide them in the view.)
Lesser changes include:
Drive the view directly off the ConfigVariable list, instead of making a
basically-equivalent second copy of the data. There's no longer any need
to hang onto the data permanently, anyway.
Convert show_all_file_settings() to do its work in one call and return a
tuplestore; this avoids risks associated with assuming that the GUC state
will hold still over the course of query execution. (I think there were
probably latent bugs here, though you might need something like a cursor
on the view to expose them.)
Arrange to run SIGHUP processing in a short-lived memory context, to
forestall process-lifespan memory leaks. (There is one known leak in this
code, in ProcessConfigDirectory; it seems minor enough to not be worth
back-patching a specific fix for.)
Remove mistaken assignment to ConfigFileLineno that caused line counting
after an include_dir directive to be completely wrong.
Add missed failure check in AlterSystemSetConfigFile(). We don't really
expect ParseConfigFp() to fail, but that's not an excuse for not checking.
When archive recovery and restartpoints were initially introduced,
checkpoint_segments was ignored on the grounds that the files restored from
archive don't consume any space in the recovery server. That was changed in
later releases, but even then it was arguably a feature rather than a bug,
as performing restartpoints as often as checkpoints during normal operation
might be excessive, but you might nevertheless not want to waste a lot of
space for pre-allocated WAL by setting checkpoint_segments to a high value.
But now that we have separate min_wal_size and max_wal_size settings, you
can bound WAL usage with max_wal_size, and still avoid consuming excessive
space usage by setting min_wal_size to a lower value, so that argument is
moot.
There are still some issues with actually limiting the space usage to
max_wal_size: restartpoints in recovery can only start after seeing the
checkpoint record, while a checkpoint starts flushing buffers as soon as
the redo-pointer is set. Restartpoint is paced to happen at the same
leisurily speed, determined by checkpoint_completion_target, as checkpoints,
but because they are started later, max_wal_size can be exceeded by upto
one checkpoint cycle's worth of WAL, depending on
checkpoint_completion_target. But that seems better than not trying at all,
and max_wal_size is a soft limit anyway.
The documentation already claimed that max_wal_size is obeyed in recovery,
so this just fixes the behaviour to match the docs. However, add some
weasel-words there to mention that max_wal_size may well be exceeded by
some amount in recovery.
Seems like cheap insurance for WAL bugs. A spurious call to
XLogBeginInsert() in itself would be fairly harmless, but if there is any
data registered and the insertion is not completed/cancelled properly, there
is a risk that the data ends up in a wrong WAL record.
Per Jeff Janes's suggestion.
If data checksums or wal_log_hints is on, and a GIN page is split, the code
to find a new, empty, block was called after having already called
XLogBeginInsert(). That causes an assertion failure or PANIC, if finding the
new block involves updating a FSM page that had not been modified since last
checkpoint, because that update is WAL-logged, which calls XLogBeginInsert
again. Nested XLogBeginInsert calls are not supported.
To fix, rearrange GIN code so that XLogBeginInsert is called later, after
finding the victim buffers.
Reported by Jeff Janes.
If a file is removed from the source server, while pg_rewind is running, the
invocation of pg_read_binary_file() will fail. Use the just-added missing_ok
option to that function, to have it return NULL instead, and handle that
gracefully. And similarly for pg_ls_dir and pg_stat_file.
Reported by Fujii Masao, fix by Michael Paquier.
This makes it possible to use the functions without getting errors, if there
is a chance that the file might be removed or renamed concurrently.
pg_rewind needs to do just that, although this could be useful for other
purposes too. (The changes to pg_rewind to use these functions will come in
a separate commit.)
The read_binary_file() function isn't very well-suited for extensions.c's
purposes anymore, if it ever was. So bite the bullet and make a copy of it
in extension.c, tailored for that use case. This seems better than the
accidental code reuse, even if it's a some more lines of code.
Michael Paquier, with plenty of kibitzing by me.
A few places assumed they could pass NULL for the argtypes array when
looking up functions known to have zero arguments. At first glance
it seems that this should be safe enough, since memcmp() is surely not
allowed to fetch any bytes if its count argument is zero. However,
close reading of the C standard says that such calls have undefined
behavior, so we'd probably best avoid it.
Since the number of places doing this is quite small, and some other
places looking up zero-argument functions were already passing dummy
arrays, let's standardize on the latter solution rather than hacking
the function lookup code to avoid calling memcmp() in these cases.
I also added Asserts to catch any future violations of the new rule.
Given the utter lack of any evidence that this actually causes any
problems in the field, I don't feel a need to back-patch this change.
Per report from Piotr Stefaniak, though this is not his patch.
Commit b89e151054 added the
ResolveCminCmaxDuringDecoding declaration to tqual.h, which uses an
HTAB parameter, without declaring HTAB. It accidentally fails to
fail to build with current sources because a declaration happens to
be included, directly or indirectly, in all source files that
currently use tqual.h before tqual.h is first included, but we
shouldn't count on that. Since an opaque declaration is enough
here, just use that, as was done in snapmgr.h.
Backpatch to 9.4, where the HTAB reference was added to tqual.h.
VACUUM FREEZE generated false cancelations of standby queries on an
otherwise idle master. Caused by an off-by-one error on cutoff_xid
which goes back to original commit.
Backpatch to all versions 9.0+
Analysis and report by Marco Nenciarini
Bug fix by Simon Riggs
Commit b488c580ae, which added the DDL command collection feature,
neglected to update the code that commit cac7658205 had previously
added two weeks earlier for the TRANSFORM feature.
Reported by Michael Paquier.
There was a confusion about which block number to use when storing an
item's pointer in the revmap -- the revmap page's blkno was being used,
not the data page's blkno.
Spotted-by: Jeff Janes
Don't apply rmtree(), which will gleefully remove an entire subtree,
and don't even apply unlink() unless it's symlink or a directory,
the only things that we expect to find.
Amit Kapila, with minor tweaks by me, per extensive discussions
involving Andrew Dunstan, Fujii Masao, and Heikki Linnakangas,
at least some of whom also reviewed the code.
Spotted by Coverity and reported by Michael Paquier. Per discussion,
we don't necessarily care about making Coverity happy in all such
instances, but we can go ahead and change them where it otherwise
seems to improve the code.
This was essentially "broken" since 0c8eda62; but until more
recently (14e8803f) barriers usage in signal handlers was infrequent.
The failure to be reentrant was noticed because the test_shm_mq, which
uses memory barriers at a high frequency, occasionally got stuck on some
solaris buildfarm animals. Turns out, those machines use sun studio
12.1, which doesn't yet have efficient memory barrier support. A machine
with a newer sun studio did not fail. Forcing the barrier fallback to
be used on x86 allows to reproduce the problem.
The new fallback is to use kill(PostmasterPid, 0) based on the theory
that that'll always imply a barrier due to checking the liveliness of
PostmasterPid on systems old enough to need fallback support. It's hard
to come up with a good and performant fallback.
I'm not backpatching this for now - the problem isn't active in the back
branches, and we haven't backpatched barrier changes for
now. Additionally master looks entirely different than the back branches
due to the new atomics abstraction. It seems better to let this rest in
master, where the non-reentrancy actively causes a problem, and then
consider backpatching.
Found-By: Robert Haas
Discussion: 55626265.3060800@dunslane.net
Allow CustomPath to have a list of paths, CustomPlan a list of plans,
and CustomPlanState a list of planstates known to the core system, so
that custom path/plan providers can more reasonably use this
infrastructure for nodes with multiple children.
KaiGai Kohei, per a design suggestion from Tom Lane, with some
further kibitzing by me.
1. Replay of the WAL record for setting a bit in the visibility map
contained an assertion that a full-page image of that record type can only
occur with checksums enabled. But it can also happen with wal_log_hints, so
remove the assertion. Unlike checksums, wal_log_hints can be changed on the
fly, so it would be complicated to figure out if it was enabled at the time
that the WAL record was generated.
2. wal_log_hints has the same effect on the locking needed to read the LSN
of a page as data checksums. BufferGetLSNAtomic() didn't get the memo.
Backpatch to 9.4, where wal_log_hints was added.
Commit f3b5565dd4 was a couple of bricks shy
of a load; specifically, it missed putting pg_trigger_tgrelid_tgname_index
into the relcache init file, because that index is not used by any
syscache. However, we have historically nailed that index into cache for
performance reasons. The upshot was that load_relcache_init_file always
decided that the init file was busted and silently ignored it, resulting
in a significant hit to backend startup speed.
To fix, reinstantiate RelationIdIsInInitFile() as a wrapper around
RelationSupportsSysCache(), which can know about additional relations
that should be in the init file despite being unknown to syscache.c.
Also install some guards against future mistakes of this type: make
write_relcache_init_file Assert that all nailed relations get written to
the init file, and make load_relcache_init_file emit a WARNING if it takes
the "wrong number of nailed relations" exit path. Now that we remove the
init files during postmaster startup, that case should never occur in the
field, even if we are starting a minor-version update that added or removed
rels from the nailed set. So the warning shouldn't ever be seen by end
users, but it will show up in the regression tests if somebody breaks this
logic.
Back-patch to all supported branches, like the previous commit.
Commit c03ad5602f introduced a planner
performance regression for UPDATE/DELETE on large inheritance sets.
It required copying the append_rel_list (which is of size proportional to
the number of inherited tables) once for each inherited table, thus
resulting in O(N^2) time and memory consumption. While it's difficult to
avoid that in general, the extra work only has to be done for
append_rel_list entries that actually reference subquery RTEs, which
inheritance-set entries will not. So we can buy back essentially all of
the loss in cases without subqueries in FROM; and even for those, the added
work is mainly proportional to the number of UNION ALL subqueries.
Back-patch to 9.2, like the previous commit.
Tom Lane and Dean Rasheed, per a complaint from Thomas Munro.
This supplements the GNU libc bug #6530 workarounds introduced in commit
54cd4f0457. On affected systems, a
tar-format pg_basebackup failed when some filename beneath the data
directory was not valid character data in the postmaster/walsender
locale. Back-patch to 9.1, where pg_basebackup was introduced. Extant,
bug-prone conversion specifications receive only ASCII bytes or involve
low-importance messages.
Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.
Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets. Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.
Discussion: 20150608163707.GD20772@alap3.anarazel.de
Backpatch into 9.3, where it became more likely that multixacts wrap
around.
9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The
other WAL_DEBUG dependant messages check for the wal_debug GUC, but this
one did not. While at it replace 'upto' with 'up to'.
Discussion: 20150610110253.GF3832@alap3.anarazel.de
Backpatch to 9.4, the first release containing 9a20a9b2.
POSIX permits setlocale() calls to invalidate any previous setlocale()
return values, but commit 5f538ad004
neglected to account for setlocale(LC_CTYPE, NULL) doing so. The effect
was to set the LC_CTYPE environment variable to an unintended value.
pg_perm_setlocale() sets this variable to assist PL/Perl; without it,
Perl would undo PostgreSQL's locale settings. The known-affected
configurations are 32-bit, release builds using Visual Studio 2012 or
Visual Studio 2013. Visual Studio 2010 is unaffected, as were all
buildfarm-attested configurations. In principle, this bug could leave
the wrong LC_CTYPE in effect after PL/Perl use, which could in turn
facilitate problems like corrupt tsvector datums. No known platform
experiences that consequence, because PL/Perl on Windows does not use
this environment variable.
The bug has been user-visible, as early postmaster failure, on systems
with Windows ANSI code page set to CP936 for "Chinese (Simplified, PRC)"
and probably on systems using other multibyte code pages.
(SetEnvironmentVariable() rejects values containing character data not
valid under the Windows ANSI code page.) Back-patch to 9.4, where the
faulty commit first appeared.
Reported by Didi Hu and 林鹏程. Reviewed by Tom Lane, though this fix
strategy was not his first choice.
This adjusts commit 82233ce7ea so that the
postmaster does not exit until all its child processes have exited, even
if the 5-second timeout elapses and we have to send SIGKILL. There is no
great value in having the postmaster process quit sooner, and doing so can
mislead onlookers into thinking that the cluster is fully terminated when
actually some child processes still survive.
This effect might explain recent test failures on buildfarm member hamster,
wherein we failed to restart a cluster just after shutting it down with
"pg_ctl stop -m immediate".
I also did a bit of code review/beautification, including fixing a faulty
use of the Max() macro on a volatile expression.
Back-patch to 9.4. In older branches, the postmaster never waited for
children to exit during immediate shutdowns, and changing that would be
too much of a behavioral change.
This avoids the problem that it might go to sleep for an unreasonable
amount of time in unusual conditions like the server clock moving
backwards an unreasonable amount of time.
(Simply moving the server clock forward again doesn't solve the problem
unless you wake up the autovacuum launcher manually, say by sending it
SIGHUP).
Per trouble report from Prakash Itnal in
https://www.postgresql.org/message-id/CAHC5u79-UqbapAABH2t4Rh2eYdyge0Zid-X=Xz-ZWZCBK42S0Q@mail.gmail.com
Analyzed independently by Haribabu Kommi and Tom Lane.
Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(),
and that function looks only at the on-disk state, it's possible for it
to fail to find a page that exists in the in-memory SLRU that has not
been written yet. If that happens, SetOffsetVacuumLimit() will
erroneously decide to force emergency autovacuuming immediately.
We should probably fix find_multixact_start() to consider the data
cached in memory as well as on the on-disk state, but that's no excuse
for SetOffsetVacuumLimit() to be stupid about the case where it can
no longer read the value after having previously succeeded in doing so.
Report by Andres Freund.
POSIX permits setlocale() calls to invalidate any previous setlocale()
return values. Commit 5f538ad004
neglected to account for that. In advance of fixing that bug, switch to
failing hard on affected configurations. This is a planned temporary
commit to assay buildfarm-represented configurations.
jsonb_set() and other clients of the setPathArray() utility function
could get spurious results when an array integer subscript is provided
that is not within the range of int.
To fix, ensure that the value returned by strtol() within setPathArray()
is within the range of int; when it isn't, assume an invalid input in
line with existing, similar cases. The path-orientated operators that
appeared in PostgreSQL 9.3 and 9.4 do not call setPathArray(), and
already independently take this precaution, so no change there.
Peter Geoghegan
In commit 9e3ad1aac5 I modified plpgsql
to use exec_stmt_return's simple-variables fast path in more cases.
However, I overlooked that there are really two different return
conventions in use here, depending on whether estate->retistuple is true,
and the existing fast-path code had only bothered to handle one of them.
So trying to return a scalar in a function returning composite, or vice
versa, could lead to unexpected error messages (typically "cache lookup
failed for type 0") or to a null-pointer-dereference crash.
In the DTYPE_VAR case, we can just throw error if retistuple is true,
corresponding to what happens in the general-expression code path that was
being used previously. (Perhaps someday both of these code paths should
attempt a coercion, but today is not that day.)
In the REC and ROW cases, just hand the problem to exec_eval_datum()
when not retistuple. Also clean up the ROW coding slightly so it looks
more like exec_eval_datum().
The previous commit also caused exec_stmt_return_next() to be used in
more cases, but that code seems to be OK as-is.
Per off-list report from Serge Rielau. This bug is new in 9.5 so no need
to back-patch.
We already tried to improve this once, but the "improved" text was rather
off-target if you had provided a USING clause. Also, it seems helpful
to provide the exact text of a suggested USING clause, so users can just
copy-and-paste it when needed. Per complaint from Keith Rarick and a
suggestion from Merlin Moncure.
Back-patch to 9.2 where the current wording was adopted.
After the archiver dies, postmaster tries to start a new one immediately.
But previously this could happen only while server was running normally
even though archiving was enabled always (i.e., archive_mode was set to
always). So the archiver running during recovery could not restart soon
after it died. This is an oversight in commit ffd3774.
This commit changes reaper(), postmaster's signal handler to cleanup
after a child process dies, so that it tries to a new archiver even during
recovery if necessary.
Patch by me. Review by Alvaro Herrera.
System catalogs and views should be listed alphabetically
in catalog.sgml, but only pg_file_settings view not.
This patch also fixes typos in pg_file_settings comments.
RMGRDESCSOURCES is defined and used only in pg_xlogdump Makefile,
but pg_rewind Makefile mentioned it as extra files to remove in "make clean".
This patch removes that useless mention from pg_rewind Makefile.
Michael Paquier
Following recent discussion on -hackers. The underlying function is
also renamed to jsonb_delete_path. The regression tests now don't need
ugly type casts to avoid the ambiguity, so they are also removed.
Catalog version bumped.
* Remove invalid option character "N" from the third argument (valid option
string) of getopt_long().
* Use pg_free() or pfree() to free the memory allocated by pg_malloc() or
palloc() instead of always using free().
* Assume problem is no disk space if write() fails but doesn't set errno.
* Fix several typos.
Patch by me. Review by Michael Paquier.
We don't know why a few Windows users have seen this fail, but the
taciturnity of the error message certainly isn't helping debug it.
Let's at least find out which LC category isn't working.
* Remove unused argument "dstfname" and related code from XLogFileCopy().
* Previously XLogFileCopy() returned a pstrdup'd string so that
InstallXLogFileSegment() used it later. Since the pstrdup'd string was never
free'd, there could be a risk of memory leak. It was almost harmless because
the startup process exited just after calling XLogFileCopy(), it existed.
This commit changes XLogFileCopy() so that it directly calls
InstallXLogFileSegment() and doesn't call pstrdup() at all. Which fixes that
memory leak problem.
* Extend InstallXLogFileSegment() so that the caller can specify the log level.
Which allows us to emit an error when InstallXLogFileSegment() fails a disk
file access like link() and rename(). Previously it was always logged with
LOG level and additionally needed to be logged with ERROR when we wanted
to treat it as an error.
Michael Paquier
HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL
replay to happen in the startup process, missing the single user case.
This buglet is fairly harmless as it only causes problems when single
user mode in an assertion enabled build is used to replay a btree vacuum
record.
Backpatch to 9.2. 061b079f was backpatched further, but the assertion
was not.
Supporting deletion of JSON pairs within jsonb objects using an
array-style integer subscript allowed for surprising outcomes. This was
mostly due to the implementation-defined ordering of pairs within
objects for jsonb.
It also seems desirable to make jsonb integer subscript deletion
consistent with the 9.4 era general purpose integer subscripting
operator for jsonb (although that operator returns NULL when an object
is encountered, while we prefer here to throw an error).
Peter Geoghegan, following discussion on -hackers.
When we invalidate the relcache entry for a system catalog or index, we
must also delete the relcache "init file" if the init file contains a copy
of that rel's entry. The old way of doing this relied on a specially
maintained list of the OIDs of relations present in the init file: we made
the list either when reading the file in, or when writing the file out.
The problem is that when writing the file out, we included only rels
present in our local relcache, which might have already suffered some
deletions due to relcache inval events. In such cases we correctly decided
not to overwrite the real init file with incomplete data --- but we still
used the incomplete initFileRelationIds list for the rest of the current
session. This could result in wrong decisions about whether the session's
own actions require deletion of the init file, potentially allowing an init
file created by some other concurrent session to be left around even though
it's been made stale.
Since we don't support changing the schema of a system catalog at runtime,
the only likely scenario in which this would cause a problem in the field
involves a "vacuum full" on a catalog concurrently with other activity, and
even then it's far from easy to provoke. Remarkably, this has been broken
since 2002 (in commit 7863404417), but we had
never seen a reproducible test case until recently. If it did happen in
the field, the symptoms would probably involve unexpected "cache lookup
failed" errors to begin with, then "could not open file" failures after the
next checkpoint, as all accesses to the affected catalog stopped working.
Recovery would require manually removing the stale "pg_internal.init" file.
To fix, get rid of the initFileRelationIds list, and instead consult
syscache.c's list of relations used in catalog caches to decide whether a
relation is included in the init file. This should be a tad more efficient
anyway, since we're replacing linear search of a list with ~100 entries
with a binary search. It's a bit ugly that the init file contents are now
so directly tied to the catalog caches, but in practice that won't make
much difference.
Back-patch to all supported branches.
Not sure how "//XXX" got into a committed patch in the first place,
as it's both content-free and against project style. pgindent made a
bit of a hash of it, too.
Going forward, we should have at least one buildfarm member using
"gcc -ansi" to catch such things, at least till such time as we
decide the project target language isn't C90 any more. I've turned
this option on on dromedary.
We should set MyProc->databaseId after acquiring the per-database lock,
not beforehand. The old way risked deadlock against processes trying to
copy or delete the target database, since they would first acquire the lock
and then wait for processes with matching databaseId to exit; that left a
window wherein an incoming process could set its databaseId and then block
on the lock, while the other process had the lock and waited in vain for
the incoming process to exit.
CountOtherDBBackends() would time out and fail after 5 seconds, so this
just resulted in an unexpected failure not a permanent lockup, but it's
still annoying when it happens. A real-world example of a use-case is that
short-duration connections to a template database should not cause CREATE
DATABASE to fail.
Doing it in the other order should be fine since the contract has always
been that processes searching the ProcArray for a database ID must hold the
relevant per-database lock while searching. Thus, this actually removes
the former race condition that required an assumption that storing to
MyProc->databaseId is atomic.
It's been like this for a long time, so back-patch to all active branches.
Recent commits, mainly b69bf30b9b and
53bb309d2d, introduced mechanisms to
protect against wraparound of the MultiXact member space: the number
of multixacts that can exist at one time is limited to 2^32, but the
total number of members in those multixacts is also limited to 2^32,
and older code did not take care to enforce the second limit,
potentially allowing old data to be overwritten while it was still
needed.
Unfortunately, these new mechanisms failed to account for the fact
that the code paths in which they run might be executed during
recovery or while the cluster was in an inconsistent state. Also,
they failed to account for the fact that users who used pg_upgrade
to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have
might oldestMultiXid = 1 in the control file despite the true value
being larger.
To fix these problems, first, avoid unnecessarily examining the
mmembers of MultiXacts when the cluster is not known to be consistent.
TruncateMultiXact has done this for a long time, and this patch does
not fix that. But the new calls used to prevent member wraparound
are not needed until we reach normal running, so avoid calling them
earlier. (SetMultiXactIdLimit is actually called before InRecovery
is set, so we can't rely on that; we invent our own multixact-specific
flag instead.)
Second, make failure to look up the members of a MultiXact a non-fatal
error. Instead, if we're unable to determine the member offset at
which wraparound would occur, postpone arming the member wraparound
defenses until we are able to do so. If we're unable to determine the
member offset that should force autovacuum, force it continuously
until we are able to do so. If we're unable to deterine the member
offset at which we should truncate the members SLRU, log a message and
skip truncation.
An important consequence of these changes is that anyone who does have
a bogus oldestMultiXid = 1 value in pg_control will experience
immediate emergency autovacuuming when upgrading to a release that
contains this fix. The release notes should highlight this fact. If
a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1
in the control file, they may wish to vacuum any tables with
relminmxid = 1 prior to upgrading in order to avoid an immediate
emergency autovacuum after the upgrade. This must be done with a
PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age
and vacuum_multixact_freeze_table_age set to 0.
This patch also adds an additional log message at each database server
startup, indicating either that protections against member wraparound
have been engaged, or that they have not. In the latter case, once
autovacuum has advanced oldestMultiXid to a sane value, the message
indicating that the guards have been engaged will appear at the next
checkpoint. A few additional messages have also been added at the DEBUG1
level so that the correct operation of this code can be properly audited.
Along the way, this patch fixes another, related bug in TruncateMultiXact
that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at
all, the truncation code looks up NextMultiXactId, which doesn't exist
yet. This can lead to TruncateMultiXact removing every file in
pg_multixact/offsets instead of keeping one around, as it should.
This in turn will cause the database server to refuse to start
afterwards.
Patch by me. Review by Álvaro Herrera, Andres Freund, Noah Misch, and
Thomas Munro.
This reverts commit 5cdf25e168,
which was almost immediately proven insufficient by the buildfarm.
On second thought, the tables involved are not large enough that
autovacuum or autoanalyze would notice them; what seems far more
likely to be the culprit is the database-wide "vacuum analyze"
in the concurrent gist test. That thing has given us one headache
too many, so get rid of it in favor of targeted vacuuming of that
test's own tables only.
Verify that the number of matches is exactly what it should be, not just
that it not be zero. This should help us detect any environment-dependent
issues.
Also, verify that we're getting the expected type of scan plan (either
bitmap or seqscan as appropriate). Right now, this is failing on the
cidrcol test cases, as shown in the output file. I'll look into that
in a bit, but it seems good to commit this as-is temporarily to verify
that it behaves as expected on the buildfarm.
Casting to char, without quotes, does not give the same results as casting
to "char". That meant we were not testing the brin "char" paths at all,
since we ended up with a text operator not a "char" operator.
This test used seqscans on tenk1, with LIMIT, to build test data.
That works most of the time, but if the synchronized-seqscan logic
kicks in, we get varying test data. This seems likely to explain
the erratic test failures on buildfarm member chipmunk, which uses
smaller-than-default shared_buffers. To fix, add ORDER BY clauses to
force the ordering to be what it was implicitly being assumed to be.
Peter Geoghegan had noticed this with respect to one of the trouble
spots, though not the ones actually causing the chipmunk issue.
Some recent buildfarm failures can be explained by supposing that
autovacuum or autoanalyze fired on the tables created by this test,
resulting in plan changes. Do a proactive VACUUM ANALYZE on the
test's principal tables to try to forestall such changes.
The commit c22ed3d523 turned
the -i/--ignore-version options into no-ops and marked as deprecated.
Considering we shipped that in 8.4, it's time to remove all trace of
those switches, per discussion. We'd still have to wait a couple releases
before it'd be safe to use -i for something else, but it'd be a start.
add_path_precheck was doing exact comparisons of path costs, but it really
needs to do them fuzzily to be sure it won't reject paths that could
survive add_path's comparisons. (This can only matter if the initial cost
estimate is very close to the final one, but that turns out to often be
true.)
Also, it should ignore startup cost for this purpose if and only if
compare_path_costs_fuzzily would do so. The previous coding always ignored
startup cost for parameterized paths, which is wrong as of commit
3f59be836c555fa6; it could result in improper early rejection of paths that
we care about for SEMI/ANTI joins. It also always considered startup cost
for unparameterized paths, which is just as wrong though the only effect is
to waste planner cycles on paths that can't survive. Instead, it should
consider startup cost only when directed to by the consider_startup/
consider_param_startup relation flags.
Likewise, compare_path_costs_fuzzily should have symmetrical behavior
for parameterized and unparameterized paths. In this case, the best
answer seems to be that after establishing that total costs are fuzzily
equal, we should compare startup costs whether or not the consider_xxx
flags are on. That is what it's always done for unparameterized paths,
so let's make the behavior for parameterized paths match.
These issues were noted while developing the SEMI/ANTI join costing fix
of commit 3f59be836c, but we chose not to back-patch these fixes,
because they can cause changes in the planner's choices among
nearly-same-cost plans. (There is in fact one minor change in plan choice
within the core regression tests.) Destabilizing plan choices in back
branches without very clear improvements is frowned on, so we'll just fix
this in HEAD.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
The function is given a fourth parameter, which defaults to true. When
this parameter is true, if the last element of the path is missing
in the original json, jsonb_set creates it in the result and assigns it
the new value. If it is false then the function does nothing unless all
elements of the path are present, including the last.
Based on some original code from Dmitry Dolgov, heavily modified by me.
Catalog version bumped.
The fsync code from the backend essentially assumes that somebody's already
validated PGDATA, at least to the extent of it being a readable directory.
That's safe enough for initdb's normal code path too, but "initdb -S"
doesn't have any other processing at all that touches the target directory.
To have reasonable error-case behavior, add a pg_check_dir call.
Per gripe from Peter E.
The argument that this is a sufficiently-expected case to be silently
ignored seems pretty thin. Andres had brought it up back when we were
still considering that most fsync failures should be hard errors, and it
probably would be legit not to fail hard for ETXTBSY --- but the same is
true for EROFS and other cases, which is why we gave up on hard failures.
ETXTBSY is surely not a normal case, so logging the failure seems fine
from here.
opr_sanity.sql has a test checking that relevant properties of built-in
functions match when the same C function is referenced by multiple pg_proc
entries. The test neglected to check proleakproof, though, and when
I added that condition it exposed that xideqint4 hadn't been updated to
match xideq. So fix that as well, and in consequence bump catversion.
This isn't very critical, so no need to worry about fixing back branches.
Make initdb's version of this logic look as much like the backend's
as possible. This is much less critical than in the backend since not
so many people use "initdb -S", but we want the same corner-case error
handling in both cases.
Back-patch to 9.3 where initdb -S option was introduced. Before that,
initdb only had to deal with freshly-created data directories, wherein
no failures should be expected.
Abhijit Menon-Sen
This undoes a poorly-thought-out choice in commit 970a18687f, namely
to export guc.c's internal variable data_directory. The authoritative
variable so far as C code is concerned is DataDir; there is no reason for
anything except specific bits of GUC code to look at the GUC variable.
After yesterday's commits fixing the fsync-on-restart patch, the only
remaining misuse of data_directory was in AlterSystemSetConfigFile(),
which would be much better off just using a relative path anyhow: it's
less code and it doesn't break if the DBA moves the data directory of a
running system, which is a case we've taken some pains over in the past.
This is mostly cosmetic, so no need for a back-patch (and I'd be hesitant
to remove a global variable in stable branches anyway).
Commit 2ce439f337 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
The previous coding suffered a null-pointer dereference if it found any
symlink at the top level of $PGDATA. Fix that, and teach it to recurse
into a symlink for pg_xlog, but not anything else.
Per note from Abhijit Menon-Sen.
Ensure that we null-terminate the result string (one place in pg_rewind).
Be paranoid about out-of-range results from readlink() (should not happen,
but there is no good reason for some call sites to be careful about it and
others not). Consistently use the whole buffer, not sometimes one byte
less. Ensure we emit an appropriate errcode() in all cases. Spell the
error messages the same way.
The only serious bug here is the missing null-termination in pg_rewind,
which is new code, so no need for a back-patch.
Abhijit Menon-Sen and Tom Lane
Seems to have been an oversight in the original leakproofness patch.
Per report and patch from Jeevan Chalke.
In passing, prettify some awkward leakproof-related code in AlterFunction.
specparse.y and specscanner.l used "string" as a token name. Now, bison
likes to define each token name as a macro for the token code it assigns,
which means those names are basically off-limits for any other use within
the grammar file or included headers. So names as generic as "string" are
dangerous. This is what was causing the recent failures on protosciurus:
some versions of Solaris' sys/kstat.h use "string" as a field name.
With late-model bison we don't see this problem because the token macros
aren't defined till later (that is why castoroides didn't show the problem
even though it's on the same machine). But protosciurus uses bison 1.875
which defines the token macros up front.
This land mine has been there from day one; we'd have found it sooner
except that protosciurus wasn't trying to run the isolation tests till
recently.
To fix, rename the token to "string_literal" which is hopefully less
likely to collide with names used by system headers. Back-patch to
all branches containing the isolation tests.
brin.sql included a call of brin_summarize_new_values(), and expected
it to always report exactly 5 summarization events. This failed sometimes
during parallel regression tests, as a consequence of the database-wide
VACUUM in gist.sql getting there first. The most future-proof way
to avoid variation in the test results is to forget about using
brin_summarize_new_values() and just do a plain "VACUUM brintest",
which will exercise the same code anyway.
Having done that, there's no need for preventing autovacuum on brintest;
doing so just reduces the scope of test coverage, so let's not.
Commit 9b74f32cdb did this for objects of
type jbvBinary, but in trying further to simplify some of the new jsonb
code I discovered that objects of type jbvObject or jbvArray passed as
WJB_ELEM or WJB_VALUE also caused problems. These too are now added
component by component.
Backpatch to 9.4.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
Multixact truncation is now handled differently, and this file hadn't
gotten the memo.
Per note from Amit Langote. I didn't use his patch, though.
Also update the description of infomask bits, which weren't completely up
to date either. This commit also propagates b01a4f6838 back to 9.3 and
9.4, which apparently I failed to do back then.
Some of this is made possible by commit
9b74f32cdb which lets pushJsonbValue
handle binary Jsonb values, meaning that clients no longer have to, and
some is just doing things in simpler and more straightforward ways.
Fix some places where pgindent did silly stuff, often because project
style wasn't followed to begin with. (I've not touched the atomics
headers, though.)
The name objectType is widely used as a field name, and it's pure luck that
this conflict has not caused pgindent to go crazy before. It messed up
pg_audit.c pretty good though. Since pg_shdepend.c doesn't export this
typedef and only uses it in three places, changing that seems saner than
changing the field usages.
Back-patch because we're contemplating using the union of all branch
typedefs for future pgindent runs, so this won't fix anything if it
stays the same in back branches.
Remove a bunch of "extern Datum foo(PG_FUNCTION_ARGS);" declarations that
are no longer needed now that PG_FUNCTION_INFO_V1(foo) provides that.
Some of these were evidently missed in commit e7128e8dbb, but others
were cargo-culted in in code added since then. Possibly that can be blamed
in part on the fact that we'd not fixed relevant documentation examples,
which I've now done.
Typo in commit 7cbee7c0a. No practical effect since the buffer should
never actually be overrun, but various compilers and static analyzers will
whine about it.
Petr Jelinek
Fix confusion in documentation, substantial memory leakage if float8 or
float4 are pass-by-reference, and assorted comments that were obsoleted
by commit 98edd617f3.
Previously, INSERT with ON CONFLICT DO UPDATE specified used a new
command tag -- UPSERT. It was introduced out of concern that INSERT as
a command tag would be a misrepresentation for ON CONFLICT DO UPDATE, as
some affected rows may actually have been updated.
Alvaro Herrera noticed that the implementation of that new command tag
was incomplete; in subsequent discussion we concluded that having it
doesn't provide benefits that are in line with the compatibility breaks
it requires.
Catversion bump due to the removal of PlannedStmt->isUpsert.
Author: Peter Geoghegan
Discussion: 20150520215816.GI5885@postgresql.org
Silly oversight in commit 1dc5ebc9077ab742079ce5dac9a6664248d42916:
when array2 is an expanded array, it might have array2->xpn.dnulls equal
to NULL, indicating the array is known null-free. The code wasn't
expecting that, because it formerly always used deconstruct_array() which
always delivers a nulls array.
Per bug #13334 from Regina Obe.
pushJsonbValue was accepting jbvBinary objects passed as WJB_ELEM or
WJB_VALUE data. While this succeeded, when those objects were later
encountered in attempting to convert the result to Jsonb, errors
occurred. With this change we ghuarantee that a JSonbValue constructed
from calls to pushJsonbValue does not contain any jbvBinary objects.
This cures a problem observed with jsonb_delete.
This means callers of pushJsonbValue no longer need to perform this
unpacking themselves. A subsequent patch will perform some cleanup in
that area.
The error was not triggered by any 9.4 code, but this is a publicly
visible routine, and so the error could be exercised by third party
code, therefore backpatch to 9.4.
Bug report from Peter Geoghegan, fix by me.
With commit de768844, a copy of the partial segment was archived with the
.partial suffix, but the original file was still left in pg_xlog, so it
didn't actually solve the problems with archiving the partial segment that
it was supposed to solve. With this patch, the partial segment is renamed
rather than copied, so we only archive it with the .partial suffix.
Also be more robust in detecting if the last segment is already being
archived. Previously I used XLogArchiveIsBusy() for that, but that's not
quite right. With archive_mode='always', there might be a .ready file for
it, and we don't want to rename it to .partial in that case.
The old segment is needed until we're fully committed to the new timeline,
i.e. until we've written the end-of-recovery WAL record and updated the
min recovery point and timeline in the control file. So move the renaming
later in the startup sequence, after all that's been done.
Paul Ramsey reported that commit 35fcb1b3d0
induced a core dump on commuted ORDER BY expressions, because it was
assuming that the indexorderby expression could be found verbatim in the
relevant equivalence class, but it wasn't there. We really don't need
anything that complicated anyway; for the data types likely to be used for
index ORDER BY operators in the foreseeable future, the exprType() of the
ORDER BY expression will serve fine. (The case where we'd have to work
harder is where the ORDER BY expression's result is only binary-compatible
with the declared input type of the ordering operator; long before worrying
about that, one would need to get rid of GiST's hard-wired assumption that
said datatype is float8.)
Aside from fixing that crash and adding a regression test for the case,
I did some desultory code review:
nodeIndexscan.c was likewise overthinking how hard it ought to work to
identify the datatype of the ORDER BY expressions.
Add comments explaining how come nodeIndexscan.c can get away with
simplifying assumptions about NULLS LAST ordering and no backward scan.
Revert no-longer-needed changes of find_ec_member_for_tle(); while the
new definition was no worse than the old, it wasn't better either, and
it might cause back-patching pain.
Revert entirely bogus additions to genam.h.
We want this struct to be exactly a series of 3 int16 words, no more
and no less. Historically, at least, some ARM compilers preferred to
pad it to 8 bytes unless coerced. Our old way of doing that was just
to use __attribute__((packed)), but as pointed out by Piotr Stefaniak,
that does too much: it also licenses the compiler to give the struct
only byte-alignment. We don't want that because it adds access overhead,
possibly quite significant overhead. According to the GCC manual, what
we want requires also specifying __attribute__((align(2))). It's not
entirely clear if all the relevant compilers accept this pragma as well,
but we can hope the buildfarm will tell us if not. We can also add a
static assertion that should fire if the compiler padded the struct.
Since the combination of these pragmas should define exactly what we
want on any compiler that accepts them, let's try using them wherever
we think they exist, not only for __arm__. (This is likely to expose
that the conditional definitions in c.h are inadequate, but finding
that out would be a good thing.)
The immediate motivation for this is that the current definition of
ExecRowMark allows its curCtid field to be misaligned. It is not clear
whether there are any other uses of ItemPointerData with a similar hazard.
We could change the definition of ExecRowMark if this doesn't work, but
it would be far better to have a future-proof fix.
Piotr Stefaniak, some further hacking by me
Previously even if recovery_target_action was set to pause and
the recovery target was reached, the recovery could never be paused.
Because the setting of pause was *always* overridden with that of
shutdown unexpectedly. This override is valid and intentional
if hot_standby is not enabled because there is no way to resume
the paused recovery in this case and the setting of pause is
completely useless. But not if hot_standby is enabled.
This patch changes the code so that the setting of pause is overridden
with that of shutdown only when hot_standby is not enabled.
Bug reported by Andres Freund
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.
Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.
Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
This reverts commit 16304a0134, except
for its changes in src/port/snprintf.c; as well as commit
cac18a76bb which is no longer needed.
Fujii Masao reported that the previous commit caused failures in psql on
OS X, since if one exits the pager program early while viewing a query
result, psql sees an EPIPE error from fprintf --- and the wrapper function
thought that was reason to panic. (It's a bit surprising that the same
does not happen on Linux.) Further discussion among the security list
concluded that the risk of other such failures was far too great, and
that the one-size-fits-all approach to error handling embodied in the
previous patch is unlikely to be workable.
This leaves us again exposed to the possibility of the type of failure
envisioned in CVE-2015-3166. However, that failure mode is strictly
hypothetical at this point: there is no concrete reason to believe that
an attacker could trigger information disclosure through the supposed
mechanism. In the first place, the attack surface is fairly limited,
since so much of what the backend does with format strings goes through
stringinfo.c or psprintf(), and those already had adequate defenses.
In the second place, even granting that an unprivileged attacker could
control the occurrence of ENOMEM with some precision, it's a stretch to
believe that he could induce it just where the target buffer contains some
valuable information. So we concluded that the risk of non-hypothetical
problems induced by the patch greatly outweighs the security risks.
We will therefore revert, and instead undertake closer analysis to
identify specific calls that may need hardening, rather than attempt a
universal solution.
We have kept the portion of the previous patch that improved snprintf.c's
handling of errors when it calls the platform's sprintf(). That seems to
be an unalloyed improvement.
Security: CVE-2015-3166
Neither the deparsing of the new alias for INSERT's target table, nor of
the inference clause was supported. Also fixup a typo in an error
message.
Add regression tests to test those code paths.
Author: Peter Geoghegan
Defer lookup of opfamily and input type of a of a user specified opclass
until the optimizer selects among available unique indexes; and store
the opclass in the parse analyzed tree instead. The primary reason for
doing this is that for rule deparsing it's easier to use the opclass
than the previous representation.
While at it also rename a variable in the inference code to better fit
it's purpose.
This is separate from the actual fixes for deparsing to make review
easier.
The point of the assertion is to ensure that the arrays allocated in stack
are large enough, but the check was one item short.
This won't matter in practice because MaxIndexTuplesPerPage is an
overestimate, so you can't have that many items on a page in reality.
But let's be tidy.
Spotted by Anastasia Lubennikova. Backpatch to all supported versions, like
the patch that added the assertion.
No index in template0 should have collation-dependent ordering, especially
not indexes on shared catalogs. For most textual columns we avoid this
issue by using type "name" (which sorts per strcmp()). However there are a
few indexed columns that we'd prefer to use "text" for, and for that, the
default opclass text_ops is unsafe. Fortunately, text_pattern_ops is safe
(it sorts per memcmp()), and it has no real functional disadvantage for our
purposes. So change the indexes on pg_seclabel.provider and
pg_shseclabel.provider to use text_pattern_ops.
In passing, also mark pg_replication_origin.roname as using
text_pattern_ops --- for some reason it was labeled varchar_pattern_ops
which is just wrong, even though it accidentally worked.
Add regression test queries to catch future errors of these kinds.
We still can't do anything about the misdeclared pg_seclabel and
pg_shseclabel indexes in back branches :-(
The plain C string language name needs to be wrapped in makeString() so
that the parse tree is copyable. This is detectable by
-DCOPY_PARSE_PLAN_TREES. Add a test case for the COMMENT case.
Also make the quoting in the error messages more consistent.
discovered by Tom Lane
These were "text", but that's a bad idea because it has collation-dependent
ordering. No index in template0 should have collation-dependent ordering,
especially not indexes on shared catalogs. There was general agreement
that provider names don't need to be longer than other identifiers, so we
can fix this at a small waste of table space by changing from text to name.
There's no way to fix the problem in the back branches, but we can hope
that security labels don't yet have widespread-enough usage to make it
urgent to fix.
There needs to be a regression sanity test to prevent us from making this
same mistake again; but before putting that in, we'll need to get rid of
similar brain fade in the recently-added pg_replication_origin catalog.
Note: for lack of a suitable testing environment, I've not really exercised
this change. I trust the buildfarm will show up any mistakes.
The previous coding was a leftover from attempting to hang all the on
conflict logic onto modify table's child nodes. It appears to not have
actually caused problems except for explain.
Add test exercising the broken and some other code paths.
Author: Peter Geoghegan and Andres Freund
Commit 83e176ec18 removed the longstanding
support functions for block sampling without any consideration of the
impact this would have on third-party FDWs. The new API is not notably
more functional for FDWs than the old, so forcing them to change doesn't
seem like a good thing. We can provide the old API as a wrapper (more
or less) around the new one for a minimal amount of extra code.
PostgreSQL already checked the vast majority of these, missing this
handful that nearly cannot fail. If putenv() failed with ENOMEM in
pg_GSS_recvauth(), authentication would proceed with the wrong keytab
file. If strftime() returned zero in cache_locale_time(), using the
unspecified buffer contents could lead to information exposure or a
crash. Back-patch to 9.0 (all supported versions).
Other unchecked calls to these functions, especially those in frontend
code, pose negligible security concern. This patch does not address
them. Nonetheless, it is always better to check return values whose
specification provides for indicating an error.
In passing, fix an off-by-one error in strftime_win32()'s invocation of
WideCharToMultiByte(). Upon retrieving a value of exactly MAX_L10N_DATA
bytes, strftime_win32() would overrun the caller's buffer by one byte.
MAX_L10N_DATA is chosen to exceed the length of every possible value, so
the vulnerable scenario probably does not arise.
Security: CVE-2015-3166
All known standard library implementations of these functions can fail
with ENOMEM. A caller neglecting to check for failure would experience
missing output, information exposure, or a crash. Check return values
within wrappers and code, currently just snprintf.c, that bypasses the
wrappers. The wrappers do not return after an error, so their callers
need not check. Back-patch to 9.0 (all supported versions).
Popular free software standard library implementations do take pains to
bypass malloc() in simple cases, but they risk ENOMEM for floating point
numbers, positional arguments, large field widths, and large precisions.
No specification demands such caution, so this commit regards every call
to a printf family function as a potential threat.
Injecting the wrappers implicitly is a compromise between patch scope
and design goals. I would prefer to edit each call site to name a
wrapper explicitly. libpq and the ECPG libraries would, ideally, convey
errors to the caller rather than abort(). All that would be painfully
invasive for a back-patched security fix, hence this compromise.
Security: CVE-2015-3166
Reentering this function with the right timing caused a double free,
typically crashing the backend. By synchronizing a disconnection with
the authentication timeout, an unauthenticated attacker could achieve
this somewhat consistently. Call be_tls_close() solely from within
proc_exit_prepare(). Back-patch to 9.0 (all supported versions).
Benkocs Norbert Attila
Security: CVE-2015-3165
This oversight results in a crash at executor startup if the plan has
been copied. outfuncs.c was missed as well.
While we could probably have taught both those files to cope with the
originally chosen representation of an Oid array, it would have been
painful, not least because there'd be no easy way to verify the array
length. An Oid List is far easier to work with. And AFAICS, there is
no particular notational benefit to using an array rather than a list
in the existing parts of the patch either. So just change it to a list.
Error in commit 35fcb1b3d0, which is new,
so no need for back-patch.
Previously, this prevented promoted standby servers from being upgraded
because of a missing WAL history file. (Timeline 1 doesn't need a
history file, and we don't copy WAL files anyway.)
Report by Christian Echerer(?), Alexey Klyukin
Backpatch through 9.0
This patch causes pg_upgrade to error out during its check phase if:
(1) template0 is marked connectable
or
(2) any other database is marked non-connectable
This is done because, in the first case, pg_upgrade would fail because
the pg_dumpall --globals restore would fail, and in the second case, the
database would not be restored, leading to data loss.
Report by Matt Landry (1), Stephen Frost (2)
Backpatch through 9.0
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage. The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting. The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
Needs a catversion bump because stored rules may change.
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
DST law changes in Egypt, Mongolia, Palestine.
Historical corrections for Canada and Chile.
Revised zone abbreviation for America/Adak (HST/HDT not HAST/HADT).
This lets BRIN be used with R-Tree-like indexing strategies.
Also provided are operator classes for range types, box and inet/cidr.
The infrastructure provided here should be sufficient to create operator
classes for similar datatypes; for instance, opclasses for PostGIS
geometries should be doable, though we didn't try to implement one.
(A box/point opclass was also submitted, but we ripped it out before
commit because the handling of floating point comparisons in existing
code is inconsistent and would generate corrupt indexes.)
Author: Emre Hasegeli. Cosmetic changes by me
Review: Andreas Karlsson
For upcoming BRIN opclasses, it's convenient to have strategy numbers
defined in a single place. Since there's nothing appropriate, create
it. The StrategyNumber typedef now lives there, as well as existing
strategy numbers for B-trees (from skey.h) and R-tree-and-friends (from
gist.h). skey.h is forced to include stratnum.h because of the
StrategyNumber typedef, but gist.h is not; extensions that currently
rely on gist.h for rtree strategy numbers might need to add a new
A few .c files can stop including skey.h and/or gist.h, which is a nice
side benefit.
Per discussion:
https://www.postgresql.org/message-id/20150514232132.GZ2523@alvh.no-ip.org
Authored by Emre Hasegeli and Álvaro.
(It's not clear to me why bootscanner.l has any #include lines at all.)
Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
points up to U+FFFF, but the actual spec defines conversions for all code
points up to U+10FFFF. That would be rather impractical as a lookup table,
but fortunately there is a simple algorithmic conversion between the
additional code points and the equivalent GB18030 byte patterns. Make use
of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
additional conversions.
Having created the infrastructure to do that, we can use the same code to
map certain linearly-related subranges of the Unicode space below U+FFFF,
allowing removal of the corresponding lookup table entries. This more
than halves the lookup table size, which is a substantial savings;
utf8_and_gb18030.so drops from nearly a megabyte to about half that.
In support of doing that, replace ISO10646-GB18030.TXT with the data file
gb-18030-2000.xml (retrieved from
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
in which these subranges have been deleted from the simple lookup entries.
Per bug #12845 from Arjen Nienhuis. The conversion code added here is
based on his proposed patch, though I whacked it around rather heavily.
Add a TABLESAMPLE clause to SELECT statements that allows
user to specify random BERNOULLI sampling or block level
SYSTEM sampling. Implementation allows for extensible
sampling functions to be written, using a standard API.
Basic version follows SQLStandard exactly. Usable
concrete use cases for the sampling API follow in later
commits.
Petr Jelinek
Reviewed by Michael Paquier and Simon Riggs
The expected-output files for these tests were broken by the recent
addition of a warning for hash indexes. Update them.
Also add a test case for GB18030 encoding, similar to the other ones.
This is a pretty weak test, but it's better than nothing.
The expected output contained some floating point values which might get
rounded slightly differently on different platforms. The exact output isn't
very interesting in this test, so just round it.
Per buildfarm member rover_firefly.
We can only support a lossy distance function when the distance function's
datatype is comparable with the original ordering operator's datatype.
The distance function always returns a float8, so we are limited to float8,
and float4 (by a hard-coded cast of the float8 to float4).
In light of this limitation, it seems like a good idea to have a separate
'recheck' flag for the ORDER BY expressions, so that if you have a non-lossy
distance function, it still works with lossy quals. There are cases like
that with the build-in or contrib opclasses, but it's plausible.
There was a hidden assumption that the ORDER BY values returned by GiST
match the original ordering operator's return type, but there are plenty
of examples where that's not true, e.g. in btree_gist and pg_trgm. As long
as the distance function is not lossy, we can tolerate that and just not
return the distance to the executor (or rather, always return NULL). The
executor doesn't need the distances if there are no lossy results.
There was another little bug: the recheck variable was not initialized
before calling the distance function. That revealed the bigger issue,
as the executor tried to reorder tuples that didn't need reordering, and
that failed because of the datatype mismatch.
The previous coding effectively only verified that the second byte of a
multibyte character was in the expected range; moreover, it wasn't careful
to make sure that the second byte even exists in the buffer before touching
it. The latter seems unlikely to cause any real problems in the field
(in particular, it could never be a problem with null-terminated input),
but it's still a bug.
Since GB18030 is not a supported backend encoding, the only thing we'd
really be doing with GB18030 text is converting it to UTF8 in LocalToUtf,
which would fail anyway on any invalid character for lack of a match in
its lookup table. So the only user-visible consequence of this change
should be that you'll get "invalid byte sequence for encoding" rather than
"character has no equivalent" for malformed GB18030 input. However,
impending changes to the GB18030 conversion code will require these tighter
up-front checks to avoid producing bogus results.
The distance function can now set *recheck = false, like index quals. The
executor will then re-check the ORDER BY expressions, and use a queue to
reorder the results on the fly.
This makes it possible to do kNN-searches on polygons and circles, which
don't store the exact value in the index, but just a bounding box.
Alexander Korotkov and me
When this option is specified, a progress report is printed as each index
is reindexed.
Per discussion, we agreed on the following syntax for the extensibility of
the options.
REINDEX (flexible options) { INDEX | ... } name
Sawada Masahiko.
Reviewed by Robert Haas, Fabrízio Mello, Alvaro Herrera, Kyotaro Horiguchi,
Jim Nasby and me.
Discussion: CAD21AoA0pK3YcOZAFzMae+2fcc3oGp5zoRggDyMNg5zoaWDhdQ@mail.gmail.com
Until now, these functions have only supported encoding conversions using
lookup tables, which is fine as long as there's not too many code points
to convert. However, GB18030 expects all 1.1 million Unicode code points
to be convertible, which would require a ridiculously-sized lookup table.
Fortunately, a large fraction of those conversions can be expressed through
arithmetic, ie the conversions are one-to-one in certain defined ranges.
To support that, provide a callback function that is used after consulting
the lookup tables. (This patch doesn't actually change anything about the
GB18030 conversion behavior, just provide infrastructure for fixing it.)
Since this requires changing the APIs of UtfToLocal/LocalToUtf anyway,
take the opportunity to rearrange their argument lists into what seems
to me a saner order. And beautify the call sites by using lengthof()
instead of error-prone sizeof() arithmetic.
In passing, also mark all the lookup tables used by these calls "const".
This moves an impressive amount of stuff into the text segment, at least
on my machine, and is safer anyhow.
This patch introduces the ability for complex datatypes to have an
in-memory representation that is different from their on-disk format.
On-disk formats are typically optimized for minimal size, and in any case
they can't contain pointers, so they are often not well-suited for
computation. Now a datatype can invent an "expanded" in-memory format
that is better suited for its operations, and then pass that around among
the C functions that operate on the datatype. There are also provisions
(rudimentary as yet) to allow an expanded object to be modified in-place
under suitable conditions, so that operations like assignment to an element
of an array need not involve copying the entire array.
The initial application for this feature is arrays, but it is not hard
to foresee using it for other container types like JSON, XML and hstore.
I have hopes that it will be useful to PostGIS as well.
In this initial implementation, a few heuristics have been hard-wired
into plpgsql to improve performance for arrays that are stored in
plpgsql variables. We would like to generalize those hacks so that
other datatypes can obtain similar improvements, but figuring out some
appropriate APIs is left as a task for future work. (The heuristics
themselves are probably not optimal yet, either, as they sometimes
force expansion of arrays that would be better left alone.)
Preliminary performance testing shows impressive speed gains for plpgsql
functions that do element-by-element access or update of large arrays.
There are other cases that get a little slower, as a result of added array
format conversions; but we can hope to improve anything that's annoyingly
bad. In any case most applications should see a net win.
Tom Lane, reviewed by Andres Freund