If we ANALYZE only selected columns of a table, we should not postpone
auto-analyze because of that; other columns may well still need stats
updates. As committed, the counter is left alone if a column list is
given, whether or not it includes all analyzable columns of the table.
Per complaint from Tomasz Ostrowski.
It's been like this a long time, so back-patch to all supported branches.
Report: <ef99c1bd-ff60-5f32-2733-c7b504eb960c@ato.waw.pl>
If a Gather node has read as many tuples as it needs (for example, due
to Limit) it may detach the queue connecting it to the worker before
reading all of the worker's tuples. Rather than let the worker
continue to generate and send all of the results, have it stop after
sending the next tuple.
More could be done here to stop the worker even quicker, but this is
about as well as we can hope to do for 9.6.
This is in response to a problem report from Andreas Seltenreich.
Commit 44339b892a should be actually be
sufficient to fix that example even without this change, but it seems
better to do this, too, since we might otherwise waste quite a large
amount of effort in one or more workers.
Discussion: CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com
Amit Kapila
Prior to this patch, it was occasionally possible, after shm_mq_sendv
had previously returned SHM_MQ_DETACHED, for a later shm_mq_sendv
operation to fail an assertion instead of just again returning
SHM_MQ_ATTACHED. From the shm_mq code's point of view, it was
expecting to be called again with the same arguments, since the
previous operation had only partially completed. However, a caller
who isn't using non-blocking mode won't be prepared to repeat the call
with the same arguments, and this code shouldn't expect that they
will. Repair in such a way that we'll be OK whether the next call
uses the same arguments or not.
Found by Andreas Seltenreich. Analysis and sketch of fix by Amit
Kapila. Patch by me, reviewed by Amit Kapila.
Fix still another bug in commit 35fcb1b3d: it failed to fully initialize
the SortSupport states it introduced to allow the executor to re-check
ORDER BY expressions containing distance operators. That led to a null
pointer dereference if the sortsupport code tried to use ssup_cxt. The
problem only manifests in narrow cases, explaining the lack of previous
field reports. It requires a GiST-indexable distance operator that lacks
SortSupport and is on a pass-by-ref data type, which among core+contrib
seems to be only btree_gist's interval opclass; and it requires the scan
to be done as an IndexScan not an IndexOnlyScan, which explains how
btree_gist's regression test didn't catch it. Per bug #14134 from
Jihyun Yu.
Peter Geoghegan
Report: <20160511154904.2603.43889@wrigleys.postgresql.org>
It'd be good for "(x AND y) AND z" to produce a three-child AND node
whether or not operator_precedence_warning is on, but that failed to
happen when it's on because makeAndExpr() didn't look through the added
AEXPR_PAREN node. This has no effect on generated plans because prepqual.c
would flatten the AND nest anyway; but it does affect the number of parens
printed in ruleutils.c, for example. I'd already fixed some similar
hazards in parse_expr.c in commit abb164655, but didn't think to search
gram.y for problems of this ilk. Per gripe from Jean-Pierre Pelletier.
Report: <fa0535ec6d6428cfec40c7e8a6d11156@mail.gmail.com>
This attempts to buy back some of whatever performance we lost from fixing
bug #14174 by inlining the initial checks in MakeExpandedObjectReadOnly()
into the callers. We can do that in a macro without creating multiple-
evaluation hazards, so it's pretty much free notationally; and the amount
of code added to callers should be minimal as well. (Testing a value can't
take many more instructions than passing it to a subroutine.)
Might as well inline DatumIsReadWriteExpandedObject() while we're at it.
This is an ABI break for callers, so it doesn't seem safe to put into 9.5,
but I see no reason not to do it in HEAD.
Further thought about bug #14174 motivated me to try the case of a
R/W datum being returned from a VALUES list, and sure enough it was
broken. Fix that.
Also add a regression test case exercising the same scenario for
FunctionScan. That's not broken right now, because the function's
result will get shoved into a tuplestore between generation and use;
but it could easily become broken whenever we get around to optimizing
FunctionScan better.
There don't seem to be any other places where we put the result of
expression evaluation into a virtual tuple slot that could then be
the source for Vars of further expression evaluation, so I think
this is the end of this bug.
If a plan node output expression returns an "expanded" datum, and that
output column is referenced in more than one place in upper-level plan
nodes, we need to ensure that what is returned is a read-only reference
not a read/write reference. Otherwise one of the referencing sites could
scribble on or even delete the expanded datum before we have evaluated the
others. Commit 1dc5ebc907, which introduced this feature, supposed
that it'd be sufficient to make SubqueryScan nodes force their output
columns to read-only state. The folly of that was revealed by bug #14174
from Andrew Gierth, and really should have been immediately obvious
considering that the planner will happily optimize SubqueryScan nodes
out of the plan without any regard for this issue.
The safest fix seems to be to make ExecProject() force its results into
read-only state; that will cover every case where a plan node returns
expression results. Actually we can delegate this to ExecTargetList()
since we can recursively assume that plain Vars will not reference
read-write datums. That should keep the extra overhead down to something
minimal. We no longer need ExecMakeSlotContentsReadOnly(), which was
introduced only in support of the idea that just a few plan node types
would need to do this.
In the future it would be nice to have the planner account for this problem
and inject force-to-read-only expression evaluation nodes into only the
places where there's a risk of multiple evaluation. That's not a suitable
solution for 9.5 or even 9.6 at this point, though.
Report: <20160603124628.9932.41279@wrigleys.postgresql.org>
The partial paths that get modified may already have been used as
part of a GatherPath which appears in the path list, so modifying
them is not a good idea at this stage - especially because this
code has no check that the PathTarget is in fact parallel-safe.
When partial aggregation is being performed, this is actually
harmless because we'll end up replacing the pathtargets here with
the correct ones within create_grouping_paths(). But if we've got
a query tree containing only scan/join operations then this can
result in incorrectly pushing down parallel-restricted target
list entries. If those are, for example, references to subqueries,
that can crash the server; but it's wrong in any event.
Amit Kapila
The "snapshot too old" condition was not being recognized when
using a copied snapshot, since the original timestamp and lsn were
not being passed along. Noticed when testing the combination of
"snapshot too old" with parallel query execution.
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.
Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
Per post-commit review comments from Andres Freund, improve variable
names, comments, and in one place, slightly improve the code structure.
Masahiko Sawada
Use MAXALIGN size/alignment to guarantee that later uses of memory are
aligned correctly. E.g. epoll_event might need 8 byte alignment on some
platforms, but earlier allocations like WaitEventSet and WaitEvent might
not sized to guarantee that when purely using sizeof().
Found by myself while testing on an Sun Ultra 5 (Sparc IIi) with some
editorializing by Andres Freund.
In passing fix a couple typos in the area
Commit 2ed5b87f96 introduced a bug in
mark/restore, in an attempt to optimize repeated restores to the
same page. This caused an assertion failure during a merge join
which fed directly from an index scan, although the impact would
not be limited to that case. Revert the bad chunk of code from
that commit.
While investigating this bug it was discovered that a particular
"paranoia" set of the mark position field would not prevent bad
behavior; it would just make it harder to diagnose. Change that
into an assertion, which will draw attention to any future problem
in that area more directly.
Backpatch to 9.5, where the bug was introduced.
Bug #14169 reported by Shinta Koyanagi.
Preliminary analysis by Tom Lane identified which commit caused
the bug.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer. But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer. PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
WARNING: specified item offset is too large
PANIC: failed to add BRIN tuple
To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).
Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header. This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.
Backpatch to 9.5.
Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
The IF EXISTS option was documented, and implemented in the grammar, but
it didn't actually work for lack of support in does_not_exist_skipping().
Per bug #14160.
Report and patch by Kouhei Sutou
Report: <20160527070433.19424.81712@wrigleys.postgresql.org>
If both timeout indicators are set when we arrive at ProcessInterrupts,
we've historically just reported "lock timeout". However, some buildfarm
members have been observed to fail isolationtester's timeouts test by
reporting "lock timeout" when the statement timeout was expected to fire
first. The cause seems to be that the process is allowed to sleep longer
than expected (probably due to heavy machine load) so that the lock
timeout happens before we reach the point of reporting the error, and
then this arbitrary tiebreak rule does the wrong thing. We can improve
matters by comparing the scheduled timeout times to decide which error
to report.
I had originally proposed greatly reducing the 1-second window between
the two timeouts in the test cases. On reflection that is a bad idea,
at least for the case where the lock timeout is expected to fire first,
because that would assume that it takes negligible time to get from
statement start to the beginning of the lock wait. Thus, this patch
doesn't completely remove the risk of test failures on slow machines.
Empirically, however, the case this handles is the one we are seeing
in the buildfarm. The explanation may be that the other case requires
the scheduler to take the CPU away from a busy process, whereas the
case fixed here only requires the scheduler to not give the CPU back
right away to a process that has been woken from a multi-second sleep
(and, perhaps, has been swapped out meanwhile).
Back-patch to 9.3 where the isolationtester timeouts test was added.
Discussion: <8693.1464314819@sss.pgh.pa.us>
As part of upper planner pathification (commit 3fc6e2d7f5) I redid
createplan.c's approach to the physical-tlist optimization, in which scan
nodes are allowed to return exactly the underlying table's columns so as
to save doing a projection step at runtime. The logic was intentionally
more aggressive than before about applying the optimization, which is
generally a good thing, but Andres Freund found a case in which it got
too aggressive. Namely, if any column is referenced more than once in
the parent plan node's sorting or grouping column list, we can't optimize
because then that column would need to have more than one ressortgroupref
label, and we only have space for one.
Add logic to detect this situation in use_physical_tlist(), and also add
some error checking in apply_pathtarget_labeling_to_tlist(), which this
example proves was being overly cavalier about whether what it was doing
made any sense.
The added test case exposes the problem only because we do not eliminate
duplicate grouping keys. That might be something to fix someday, but it
doesn't seem like appropriate post-beta work.
Report: <20160526021235.w4nq7k3gnheg7vit@alap3.anarazel.de>
Dating back to commit f10b63923, our grammar has allowed "USING" to
optionally appear before an opclass name in CREATE INDEX (and, lately,
some related places such as ON CONFLICT specifications). Nikolay Shaplov
noticed that this syntax existed but wasn't documented, and proposed
documenting it. But what seems like a better idea is to remove the
production, thereby making the code match the docs not vice versa.
This isn't our usual modus operandi for such cases, but there are a
couple of good reasons to proceed this way:
* So far as I can find, this syntax has never been documented anywhere.
It isn't relied on by any of our own code or test cases, and there seems
little reason to suppose that it's been used in the wild either.
* Documenting it would mean that there would be two separate uses of
USING in the CREATE INDEX syntax, the other being "USING access_method".
That can lead to nothing but confusion.
So, let's just remove it. On the off chance that somebody somewhere
is using it, this isn't something to back-patch, but we can fix it
in HEAD.
Discussion: <1593237.l7oKHRpxSe@nataraj-amd64>
Ever since we split the statistics collector's reports into per-database
files (commit 187492b6c2), backends have been seeing stale statistics
for shared catalogs. This is because the inquiry message only prompts the
collector to write the per-database file for the requesting backend's own
database. Stats for shared catalogs are in a separate file for "DB 0",
which didn't get updated.
In normal operation this was partially masked by the fact that the
autovacuum launcher would send an inquiry message at least once per
autovacuum_naptime that asked for "DB 0"; so the shared-catalog stats would
never be more than a minute out of date. However the problem becomes very
obvious with autovacuum disabled, as reported by Peter Eisentraut.
To fix, redefine the semantics of inquiry messages so that both the
specified DB and DB 0 will be dumped. (This might seem a bit inefficient,
but we have no good way to know whether a backend's transaction will look
at shared-catalog stats, so we have to read both groups of stats whenever
we request stats. Sending two inquiry messages would definitely not be
better.)
Back-patch to 9.3 where the bug was introduced.
Report: <56AD41AC.1030509@gmx.net>
Because vac_update_datfrozenxid() updates datfrozenxid and datminmxid
in-place, it's unsafe to assume that successive reads of those values will
give consistent results. Fetch each one just once to ensure sane behavior
in the minimum calculation. Noted while reviewing Alexander Korotkov's
patch in the same area.
Discussion: <8564.1464116473@sss.pgh.pa.us>
vac_truncate_clog() uses its own transaction ID as the comparison point in
a sanity check that no database's datfrozenxid has already wrapped around
"into the future". That was probably fine when written, but in a lazy
vacuum we won't have assigned an XID, so calling GetCurrentTransactionId()
causes an XID to be assigned when otherwise one would not be. Most of the
time that's not a big problem ... but if we are hard up against the
wraparound limit, consuming XIDs during antiwraparound vacuums is a very
bad thing.
Instead, use ReadNewTransactionId(), which not only avoids this problem
but is in itself a better comparison point to test whether wraparound
has already occurred.
Report and patch by Alexander Korotkov. Back-patch to all versions.
Report: <CAPpHfdspOkmiQsxh-UZw2chM6dRMwXAJGEmmbmqYR=yvM7-s6A@mail.gmail.com>
Commit 1aba62ec moved the range check of that option form guc.c into
bufmgr.c, but introduced a bug by changing a >= 0.0 to > 0.0, which made
the value 0 no longer accepted. Put it back.
Reported by Jeff Janes, diagnosed by Tom Lane
Commit 65c5fcd353 broke this by removing a
header include directive that is conditionally required. Add that back
to nbtree.c, with annotation to keep pgrminclude from re-breaking it.
Peter Geoghegan
Report: <CAM3SWZTNjHFYW_UG8bu0BnogqQ2HfsTgkzXLueuUhfTcYbu5HA@mail.gmail.com>
Needed for cases in which INSERT ... ON CONFLICT appears inside a
recursive CTE item. Per bug #14153 from Thomas Alton.
Patch by Peter Geoghegan, slightly adjusted by me
Report: <20160521232802.22598.13537@wrigleys.postgresql.org>
If RAW_EXPRESSION_COVERAGE_TEST is defined, do a no-op tree walk over
every basic DML statement submitted to parse analysis. If we'd had this
in place earlier, bug #14153 would have been caught by buildfarm testing.
The difficulty is that raw_expression_tree_walker() is only used in
limited cases involving CTEs (particularly recursive ones), so it's
very easy for an oversight in it to not be noticed during testing of a
seemingly-unrelated feature.
The type of error we can expect to catch with this is complete omission
of a node type from raw_expression_tree_walker(), and perhaps also
recursion into a field that doesn't contain a node tree, though that
would be an unlikely mistake. It won't catch failure to add new fields
that need to be recursed into, unfortunately.
I'll go enable this on one or two of my own buildfarm animals once
bug #14153 is dealt with.
Discussion: <27861.1464040417@sss.pgh.pa.us>
do_text_output_multiline() would fail (typically with a null pointer
dereference crash) if its input string did not end with a newline. Such
cases do not arise in our current sources; but it certainly could happen
in future, or in extension code's usage of the function, so we should fix
it. To fix, replace "eol += len" with "eol = text + len".
While at it, make two cosmetic improvements: mark the input string const,
and rename the argument from "text" to "txt" to dodge pgindent strangeness
(since "text" is a typedef name).
Even though this problem is only latent at present, it seems like a good
idea to back-patch the fix, since it's a very simple/safe patch and it's
not out of the realm of possibility that we might in future back-patch
something that expects sane behavior from do_text_output_multiline().
Per report from Hao Lee.
Report: <CAGoxFiFPAGyPAJLcFxTB5cGhTW2yOVBDYeqDugYwV4dEd1L_Ag@mail.gmail.com>
Page image should be MAXALIGN'ed because existing code could directly align
pointers in page instead of align offset from beginning of page.
Found during play with indexes as extenstion, Alexander Korotkov and me
Some comments mentioned XLogReplayBuffer, but there's no such function:
that was an interim name for a function that got renamed to
XLogReadBufferForRedo, before commit 2c03216d83 was pushed.
While it could be argued that rejecting system column mentions in the
ON CONFLICT list is an unsupported feature, falling over altogether
just because the table has a unique index on OID is indubitably a bug.
As far as I can tell, fixing infer_arbiter_indexes() is sufficient to
make ON CONFLICT (oid) actually work, though making a regression test
for that case is problematic because of the impossibility of setting
the OID counter to a known value.
Minor cosmetic cleanups along with the bug fix.
subquery_planner() failed to apply expression preprocessing to the
arbiterElems and arbiterWhere fields of an OnConflictExpr. No doubt the
theory was that this wasn't necessary because we don't actually try to
execute those expressions; but that's wrong, because it results in failure
to match to index expressions or index predicates that are changed at all
by preprocessing. Per bug #14132 from Reynold Smith.
Also add pullup_replace_vars processing for onConflictWhere. Perhaps
it's impossible to have a subquery reference there, but I'm not exactly
convinced; and even if true today it's a failure waiting to happen.
Also add some comments to other places where one or another field of
OnConflictExpr is intentionally ignored, with explanation as to why it's
okay to do so.
Also, catalog/dependency.c failed to record any dependency on the named
constraint in ON CONFLICT ON CONSTRAINT, allowing such a constraint to
be dropped while rules exist that depend on it, and allowing pg_dump to
dump such a rule before the constraint it refers to. The normal execution
path managed to error out reasonably for a dangling constraint reference,
but ruleutils.c dumped core; so in addition to fixing the omission, add
a protective check in ruleutils.c, since we can't retroactively add a
dependency in existing databases.
Back-patch to 9.5 where this code was introduced.
Report: <20160510190350.2608.48667@wrigleys.postgresql.org>
The table-skipping logic in autovacuum would fail to consider that
multiple workers could be processing the same shared catalog in
different databases. This normally wouldn't be a problem: firstly
because autovacuum workers not for wraparound would simply ignore tables
in which they cannot acquire lock, and secondly because most of the time
these tables are small enough that even if multiple for-wraparound
workers are stuck in the same catalog, they would be over pretty
quickly. But in cases where the catalogs are severely bloated it could
become a problem.
Backpatch all the way back, because the problem has been there since the
beginning.
Reported by Ondřej Světlík
Discussion: https://www.postgresql.org/message-id/572B63B1.3030603%40flexibee.euhttps://www.postgresql.org/message-id/572A1072.5080308%40flexibee.eu
This patch essentially reverts commit 4c6780fd17, in favor of a much
simpler solution for the case where the new cluster would choose to create
a TOAST table but the old cluster doesn't have one: just don't create a
TOAST table.
The existing code failed in at least two different ways if the situation
arose: (1) ALTER TABLE RESET didn't grab an exclusive lock, so that the
lock sanity check in create_toast_table failed; (2) pg_upgrade did not
provide a pg_type OID for the new toast table, so that the crosscheck in
TypeCreate failed. While both these problems were introduced by later
patches, they show that the hack being used to cause TOAST table creation
is overwhelmingly fragile (and untested). I also note that before the
TypeCreate crosscheck was added, the code would have resulted in assigning
an indeterminate pg_type OID to the toast table, possibly causing a later
OID conflict in that catalog; so that it didn't really work even when
committed.
If we simply don't create a TOAST table, there will only be a problem if
the code tries to store a tuple that's wider than a page, and field
compression isn't sufficient to get it under a page. Given that the TOAST
creation threshold is intended to be about a quarter of a page, it's very
hard to believe that cross-version differences in the do-we-need-a-toast-
table heuristic could result in an observable problem. So let's just
follow the old version's conclusion about whether a TOAST table is needed.
(If we ever do change needs_toast_table() so much that this conclusion
doesn't apply, we can devise a solution at that time, and hopefully do
it in a less klugy way than 4c6780fd17 did.)
Back-patch to 9.3, like the previous patch.
Discussion: <8110.1462291671@sss.pgh.pa.us>
Limit maintenance of time to xid mapping to once per minute. At
least in the tested case this brings performance within 5% of when
the feature is off, compared to several times slower without this
patch.
While there, fix comments and whitespace.
Ants Aasma, with cosmetic adjustments suggested by Andres Freund
Reviewed by Kevin Grittner and Andres Freund
Discussion is still underway as to whether to revert the entire patch
that added this function, but that discussion may not conclude before
beta1. So, in the meantime, let's do at least this much.
David Rowley
This new limit affects both the max_parallel_degree GUC and the
parallel_degree reloption. There may some day be a use case for using
more than 1024 CPUs for a single query, but that's surely not the case
right now. Not only do not very many people have that many CPUs, but
the code hasn't been tested at that kind of scale and is very unlikely
to perform well, or even work at all, without a lot more work. The
issue addressed by commit 06bd458cb8 is
probably just one problem of many.
The idea of a more reasonable limit here was suggested by Tom Lane;
the value of 1024 was suggested by Amit Kapila.
That way, if the result overflows size_t, you'll get an error instead
of undefined behavior, which seems like a plus. This also has the
effect of casting the number of workers from int to Size, which is
better because it's harder to overflow int than size_t.
Dilip Kumar reported this issue and provided a patch upon which this
patch is based, but his version did use mul_size.