Commit Graph

2862 Commits

Author SHA1 Message Date
Tom Lane 959ea7fa76 Remove useless code.
Apparent copy-and-pasteo in standby_desc_invalidations() had two
entries for msg->id == SHAREDINVALRELMAP_ID.

Aleksander Alekseev

Discussion: <20160923090814.GB1238@e733>
2016-09-23 10:44:50 -04:00
Peter Eisentraut ebdf5bf7d1 Delay updating control file to "in production"
Move the updating of the control file to "in production" status until
the point where WAL writes are allowed.  Before, there could be a
significant gap between the control file update and write transactions
actually being allowed.  This makes it more reliable to use the control
status to verify the end of a promotion.

From: Michael Paquier <michael.paquier@gmail.com>
2016-09-21 12:00:00 -04:00
Heikki Linnakangas 45310221a9 Fix outdated comments, GIST search queue is not an RBTree anymore.
The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.
2016-09-20 11:38:25 +03:00
Robert Haas 445a38aba2 Have heapam.h include lockdefs.h rather than lock.h.
lockdefs.h was only split from lock.h relatively recently, and
represents a minimal subset of the old lock.h.  heapam.h only needs
that smaller subset, so adjust it to include only that.  This requires
some corresponding adjustments elsewhere.

Peter Geoghegan
2016-09-13 09:21:35 -04:00
Tom Lane 24992c6db9 Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 19:00:59 -04:00
Tom Lane b1328d78f8 Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.
PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset.  But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page.  This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
2016-09-09 18:02:36 -04:00
Alvaro Herrera 5c609a742f Fix locking a tuple updated by an aborted (sub)transaction
When heap_lock_tuple decides to follow the update chain, it tried to
also lock any version of the tuple that was created by an update that
was subsequently rolled back.  This is pointless, since for all intents
and purposes that tuple exists no more; and moreover it causes
misbehavior, as reported independently by Marko Tiikkaja and Marti
Raudsepp: some SELECT FOR UPDATE/SHARE queries may fail to return
the tuples, and assertion-enabled builds crash.

Fix by having heap_lock_updated_tuple test the xmin and return success
immediately if the tuple was created by an aborted transaction.

The condition where tuples become invisible occurs when an updated tuple
chain is followed by heap_lock_updated_tuple, which reports the problem
as HeapTupleSelfUpdated to its caller heap_lock_tuple, which in turn
propagates that code outwards possibly leading the calling code
(ExecLockRows) to believe that the tuple exists no longer.

Backpatch to 9.3.  Only on 9.5 and newer this leads to a visible
failure, because of commit 27846f02c176; before that, heap_lock_tuple
skips the whole dance when the tuple is already locked by the same
transaction, because of the ancient HeapTupleSatisfiesUpdate behavior.
Still, the buggy condition may also exist in more convoluted scenarios
involving concurrent transactions, so it seems safer to fix the bug in
the old branches too.

Discussion:
	https://www.postgresql.org/message-id/CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
	https://www.postgresql.org/message-id/48d3eade-98d3-8b9a-477e-1a8dc32a724d@joh.to
2016-09-09 15:54:29 -03:00
Simon Riggs ec253de1fd Fix corruption of 2PC recovery with subxacts
Reading 2PC state files during recovery was borked, causing corruptions during
recovery. Effect limited to servers with 2PC, subtransactions and
recovery/replication.

Stas Kelvich, reviewed by Michael Paquier and Pavan Deolasee
2016-09-09 11:55:12 +01:00
Simon Riggs 67c6bd1ca3 Fix minor memory leak in Standby startup
StandbyRecoverPreparedTransactions() leaked the buffer
used for two phase state file. This was leaked once
at startup and at every shutdown checkpoint seen.

Backpatch to 9.6

Stas Kelvich
2016-09-08 10:32:58 +01:00
Peter Eisentraut 49eb0fd097 Add location field to DefElem
Add a location field to the DefElem struct, used to parse many utility
commands.  Update various error messages to supply error position
information.

To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands.  This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2016-09-06 12:00:00 -04:00
Tom Lane 60893786d5 Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware.
computeLeafRecompressWALData() tried to produce a uint16 WAL log field by
memcpy'ing the first two bytes of an int-sized variable.  That accidentally
works on little-endian hardware, but not at all on big-endian.  Replay then
thinks it's looking at an ADDITEMS action with zero entries, and reads the
first two bytes of the first TID therein as the next segno/action,
typically leading to "unexpected GIN leaf action" errors during replay.
Even if replay failed to crash, the resulting GIN index page would surely
be incorrect.  To fix, just declare the variable as uint16 instead.

Per bug #14295 from Spencer Thomason (much thanks to Spencer for turning
his problem into a self-contained test case).  This likely also explains
a previous report of the same symptom from Bernd Helmle.

Back-patch to 9.4 where the problem was introduced (by commit 14d02f0bb).

Discussion: <20160826072658.15676.7628@wrigleys.postgresql.org>
Possible-Report: <2DA7350F7296B2A142272901@eje.land.credativ.lan>
2016-09-03 13:28:53 -04:00
Simon Riggs 35250b6ad7 New recovery target recovery_target_lsn
Michael Paquier
2016-09-03 17:48:01 +01:00
Heikki Linnakangas 9f85784cae Support multiple iterators in the Red-Black Tree implementation.
While we don't need multiple iterators at the moment, the interface is
nicer and less dangerous this way.

Aleksander Alekseev, with some changes by me.
2016-09-02 08:39:39 +03:00
Tom Lane 0e0f43d6fd Prevent starting a standalone backend with standby_mode on.
This can't really work because standby_mode expects there to be more
WAL arriving, which there will not ever be because there's no WAL
receiver process to fetch it.  Moreover, if standby_mode is on then
hot standby might also be turned on, causing even more strangeness
because that expects read-only sessions to be executing in parallel.
Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid
got confused, but rather than band-aiding individual problems it seems
best to prevent getting anywhere near this state in the first place.
Back-patch to all supported branches.

In passing, also fix some omissions of errcodes in other ereport's in
readRecoveryCommandFile().

Michael Paquier (errcode hacking by me)

Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>
2016-08-31 08:52:13 -04:00
Alvaro Herrera 8e1e3f958f Split hash.h → hash_xlog.h
Since the hash AM is going to be revamped to have WAL, this is a good
opportunity to clean up the include file a little bit to avoid including
a lot of extra stuff in the future.

Author: Amit Kapila
2016-08-29 18:55:49 -03:00
Fujii Masao bd082231ed Fix typos in comments. 2016-08-29 16:06:40 +09:00
Fujii Masao bab7823a49 Fix pg_xlogdump so that it handles cross-page XLP_FIRST_IS_CONTRECORD record.
Previously pg_xlogdump failed to dump the contents of the WAL file
if the file starts with the continuation WAL record which spans
more than one pages. Since pg_xlogdump assumed that the continuation
record always fits on a page, it could not find the valid WAL record to
start reading from in that case.

This patch changes pg_xlogdump so that it can handle a continuation
WAL record which crosses a page boundary and find the valid record
to start reading from.

Back-patch to 9.3 where pg_xlogdump was introduced.

Author: Pavan Deolasee
Reviewed-By: Michael Paquier and Craig Ringer
Discussion: CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
2016-08-29 14:34:58 +09:00
Tom Lane ea268cdc9a Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters.  While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea.  Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.

While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.

In passing, change TopMemoryContext to use the default allocation
parameters.  Formerly it could only be extended 8K at a time.  That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there.  There seems no good reason
not to let it use growing blocks like most other contexts.

Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain.  The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.

Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 17:50:38 -04:00
Tom Lane 78dcd027e8 Fix potential memory leakage from HandleParallelMessages().
HandleParallelMessages leaked memory into the caller's context.  Since it's
called from ProcessInterrupts, there is basically zero certainty as to what
CurrentMemoryContext is, which means we could be leaking into long-lived
contexts.  Over the processing of many worker messages that would grow to
be a problem.  Things could be even worse than just a leak, if we happened
to service the interrupt while ErrorContext is current: elog.c thinks it
can reset that on its own whim, possibly yanking storage out from under
HandleParallelMessages.

Give HandleParallelMessages its own dedicated context instead, which we can
reset during each call to ensure there's no accumulation of wasted memory.

Discussion: <16610.1472222135@sss.pgh.pa.us>
2016-08-26 15:04:05 -04:00
Tom Lane fbf28b6b52 Fix logic for adding "parallel worker" context line to worker errors.
The previous coding here was capable of adding a "parallel worker" context
line to errors that were not, in fact, returned from a parallel worker.
Instead of using an errcontext callback to add that annotation, just paste
it onto the message by hand; this looks uglier but is more reliable.

Discussion: <19757.1472151987@sss.pgh.pa.us>
2016-08-26 10:07:28 -04:00
Tom Lane ae4760d667 Fix small query-lifespan memory leak in bulk updates.
When there is an identifiable REPLICA IDENTITY index on the target table,
heap_update leaks the id_attrs bitmapset.  That's not many bytes, but it
adds up over enough rows, since the code typically runs in a query-lifespan
context.  Bug introduced in commit e55704d8b, which did a rather poor job
of cloning the existing use-pattern for RelationGetIndexAttrBitmap().

Per bug #14293 from Zhou Digoal.  Back-patch to 9.4 where the bug was
introduced.

Report: <20160824114320.15676.45171@wrigleys.postgresql.org>
2016-08-24 22:20:25 -04:00
Tom Lane 71e006f031 Suppress compiler warnings in non-cassert builds.
With Asserts off, these variables are set but never used, resulting
in warnings from pickier compilers.  Fix that with our standard solution.
Per report from Jeff Janes.
2016-08-23 23:21:10 -04:00
Tom Lane d2ddee63b4 Improve SP-GiST opclass API to better support unlabeled nodes.
Previously, the spgSplitTuple action could only create a new upper tuple
containing a single labeled node.  This made it useless for opclasses
that prefer to work with fixed sets of nodes (labeled or otherwise),
which meant that restrictive prefixes could not be used with such
node definitions.  Change the output field set for the choose() method
to allow it to specify any valid node set for the new upper tuple,
and to specify which of these nodes to place the modified lower tuple in.

In addition to its primary use for fixed node sets, this feature could
allow existing opclasses that use variable node sets to skip a separate
spgAddNode action when splitting a tuple, by setting up the node needed
for the incoming value as part of the spgSplitTuple action.  However, care
would have to be taken to add the extra node only when it would not make
the tuple bigger than before.  (spgAddNode can enlarge the tuple,
spgSplitTuple can't.)

This is a prerequisite for an upcoming SP-GiST inet opclass, but is
being committed separately to increase the visibility of the API change.

In passing, improve the documentation about the traverse-values feature
that was added by commit ccd6eb49a.

Emre Hasegeli, with cosmetic adjustments and documentation rework by me

Discussion: <CAE2gYzxtth9qatW_OAqdOjykS0bxq7AYHLuyAQLPgT7H9ZU0Cw@mail.gmail.com>
2016-08-23 12:10:34 -04:00
Andres Freund 07ef035129 Fix deletion of speculatively inserted TOAST on conflict
INSERT ..  ON CONFLICT runs a pre-check of the possible conflicting
constraints before performing the actual speculative insertion.  In case
the inserted tuple included TOASTed columns the ON CONFLICT condition
would be handled correctly in case the conflict was caught by the
pre-check, but if two transactions entered the speculative insertion
phase at the same time, one would have to re-try, and the code for
aborting a speculative insertion did not handle deleting the
speculatively inserted TOAST datums correctly.

TOAST deletion would fail with "ERROR: attempted to delete invisible
tuple" as we attempted to remove the TOAST tuples using
simple_heap_delete which reasoned that the given tuples should not be
visible to the command that wrote them.

This commit updates the heap_abort_speculative() function which aborts
the conflicting tuple to use itself, via toast_delete, for deleting
associated TOAST datums.  Like before, the inserted toast rows are not
marked as being speculative.

This commit also adds a isolationtester spec test, exercising the
relevant code path. Unfortunately 9.5 cannot handle two waiting
sessions, and thus cannot execute this test.

Reported-By: Viren Negi, Oskari Saarenmaa
Author: Oskari Saarenmaa, edited a bit by me
Bug: #14150
Discussion: <20160519123338.12513.20271@wrigleys.postgresql.org>
Backpatch: 9.5, where ON CONFLICT was introduced
2016-08-17 17:03:36 -07:00
Peter Eisentraut f0fe1c8f70 Fix typos
From: Alexander Law <exclusion@gmail.com>
2016-08-16 14:52:29 -04:00
Tom Lane b5bce6c1ec Final pgindent + perltidy run for 9.6. 2016-08-15 13:42:51 -04:00
Tom Lane ed0097e4f9 Add SQL-accessible functions for inspecting index AM properties.
Per discussion, we should provide such functions to replace the lost
ability to discover AM properties by inspecting pg_am (cf commit
65c5fcd35).  The added functionality is also meant to displace any code
that was looking directly at pg_index.indoption, since we'd rather not
believe that the bit meanings in that field are part of any client API
contract.

As future-proofing, define the SQL API to not assume that properties that
are currently AM-wide or index-wide will remain so unless they logically
must be; instead, expose them only when inquiring about a specific index
or even specific index column.  Also provide the ability for an index
AM to override the behavior.

In passing, document pg_am.amtype, overlooked in commit 473b93287.

Andrew Gierth, with kibitzing by me and others

Discussion: <87mvl5on7n.fsf@news-spur.riddles.org.uk>
2016-08-13 18:31:14 -04:00
Tom Lane e89526d4f3 In B-tree page deletion, clean up properly after page deletion failure.
In _bt_unlink_halfdead_page(), we might fail to find an immediate left
sibling of the target page, perhaps because of corruption of the page
sibling links.  The code intends to cope with this by just abandoning
the deletion attempt; but what actually happens is that it fails outright
due to releasing the same buffer lock twice.  (And error recovery masks
a second problem, which is possible leakage of a pin on another page.)
Seems to have been introduced by careless refactoring in commit efada2b8e.
Since there are multiple cases to consider, let's make releasing the buffer
lock in the failure case the responsibility of _bt_unlink_halfdead_page()
not its caller.

Also, avoid fetching the leaf page's left-link again after we've dropped
lock on the page.  This is probably harmless, but it's not exactly good
coding practice.

Per report from Kyotaro Horiguchi.  Back-patch to 9.4 where the faulty code
was introduced.

Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
2016-08-06 14:28:37 -04:00
Robert Haas 81c766b3fd Change InitToastSnapshot to a macro.
tqual.h is included in some front-end compiles, and a static inline
breaks on buildfarm member castoroides.  Since the macro is never
referenced, it should dodge that problem, although this doesn't
seem like the cleanest way of hiding things from front-end compiles.

Report and review by Tom Lane; patch by me.
2016-08-05 11:58:03 -04:00
Andres Freund e7caacf733 Fix hard to hit race condition in heapam's tuple locking code.
As mentioned in its commit message, eca0f1db left open a race condition,
where a page could be marked all-visible, after the code checked
PageIsAllVisible() to pin the VM, but before the page is locked.  Plug
that hole.

Reviewed-By: Robert Haas, Andres Freund
Author: Amit Kapila
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-08-04 20:07:16 -07:00
Robert Haas 3e2f3c2e42 Prevent "snapshot too old" from trying to return pruned TOAST tuples.
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning.  Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.

Reported by Andres Freund, who also sketched out what the fix should
look like.  Patch by me, reviewed by Amit Kapila.
2016-08-03 16:50:01 -04:00
Tom Lane b6a97b91ff Block interrupts during HandleParallelMessages().
As noted by Alvaro, there are CHECK_FOR_INTERRUPTS() calls in the shm_mq.c
functions called by HandleParallelMessages().  I believe they're all
unreachable since we always pass nowait = true, but it doesn't seem like
a great idea to assume that no such call will ever be reachable from
HandleParallelMessages().  If that did happen, there would be a risk of a
recursive call to HandleParallelMessages(), which it does not appear to be
designed for --- for example, there's nothing that would prevent
out-of-order processing of received messages.  And certainly such cases
cannot easily be tested.  So let's prevent it by holding off interrupts for
the duration of the function.  Back-patch to 9.5 which contains identical
code.

Discussion: <14869.1470083848@sss.pgh.pa.us>
2016-08-02 16:39:16 -04:00
Tom Lane a5fe473ad7 Minor cleanup for access/transam/parallel.c.
ParallelMessagePending *must* be marked volatile, because it's set
by a signal handler.  On the other hand, it's pointless for
HandleParallelMessageInterrupt to save/restore errno; that must be,
and is, done at the outer level of the SIGUSR1 signal handler.

Calling CHECK_FOR_INTERRUPTS() inside HandleParallelMessages, which itself
is called from CHECK_FOR_INTERRUPTS(), seems both useless and hazardous.
The comment claiming that this is needed to handle the error queue going
away is certainly misguided, in any case.

Improve a couple of error message texts, and use
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE to report loss of parallel worker
connection, since that's what's used in e.g. tqueue.c.  (Maybe it would be
worth inventing a dedicated ERRCODE for this type of failure?  But I do not
think ERRCODE_INTERNAL_ERROR is appropriate.)

Minor stylistic cleanups.
2016-08-01 16:12:01 -04:00
Peter Eisentraut ef5d4a3cfa Message style improvements 2016-07-28 16:34:44 -04:00
Robert Haas 1091402b5a Remove unused structure member.
Michael Paquier
2016-07-21 11:53:44 -04:00
Andres Freund eca0f1db14 Clear all-frozen visibilitymap status when locking tuples.
Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag.  Tuple locking related code has not done so.

To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.

There's a remaining issue (denoted by XXX): After the PageIsAllVisible()
check in heap_lock_tuple() and heap_lock_updated_tuple_rec() the page
status could theoretically change. Practically that currently seems
impossible, because updaters will hold a page level pin already.  Due to
the next beta coming up, it seems better to get the required WAL magic
bump done before resolving this issue.

The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.

Reviewed-By: Robert Haas, Amit Kapila and Andres Freund
Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-07-18 02:01:13 -07:00
Tom Lane 9563d5b5e4 Add regression test case exercising the sorting path for hash index build.
We've broken this code path at least twice in the past, so it's prudent
to have a test case that covers it.  To allow exercising the code path
without creating a very large (and slow to run) test case, redefine the
sort threshold to be bounded by maintenance_work_mem as well as the number
of available buffers.  While at it, fix an ancient oversight that when
building a temp index, the number of available buffers is not NBuffers but
NLocBuffer.  Also, if assertions are enabled, apply a direct test that the
sort actually does return the tuples in the expected order.

Peter Geoghegan

Patch: <CAM3SWZTBAo4hjbBd780+MrOKiKp_TMo1N3A0Rw9_im8gbD7fQA@mail.gmail.com>
2016-07-16 15:30:15 -04:00
Andres Freund bfa2ab56bb Fix torn-page, unlogged xid and further risks from heap_update().
When heap_update needs to look for a page for the new tuple version,
because the current one doesn't have sufficient free space, or when
columns have to be processed by the tuple toaster, it has to release the
lock on the old page during that. Otherwise there'd be lock ordering and
lock nesting issues.

To avoid concurrent sessions from trying to update / delete / lock the
tuple while the page's content lock is released, the tuple's xmax is set
to the current session's xid.

That unfortunately was done without any WAL logging, thereby violating
the rule that no XIDs may appear on disk, without an according WAL
record.  If the database were to crash / fail over when the page level
lock is released, and some activity lead to the page being written out
to disk, the xid could end up being reused; potentially leading to the
row becoming invisible.

There might be additional risks by not having t_ctid point at the tuple
itself, without having set the appropriate lock infomask fields.

To fix, compute the appropriate xmax/infomask combination for locking
the tuple, and perform WAL logging using the existing XLOG_HEAP_LOCK
record. That allows the fix to be backpatched.

This issue has existed for a long time. There appears to have been
partial attempts at preventing dangers, but these never have fully been
implemented, and were removed a long time ago, in
11919160 (cf. HEAP_XMAX_UNLOGGED).

In master / 9.6, there's an additional issue, namely that the
visibilitymap's freeze bit isn't reset at that point yet. Since that's a
new issue, introduced only in a892234f83, that'll be fixed in a
separate commit.

Author: Masahiko Sawada and Andres Freund
Reported-By: Different aspects by Thomas Munro, Noah Misch, and others
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: 9.1/all supported versions
2016-07-15 17:49:48 -07:00
Andres Freund a4d357bfbd Make HEAP_LOCK/HEAP2_LOCK_UPDATED replay reset HEAP_XMAX_INVALID.
0ac5ad5 started to compress infomask bits in WAL records. Unfortunately
the replay routines for XLOG_HEAP_LOCK/XLOG_HEAP2_LOCK_UPDATED forgot to
reset the HEAP_XMAX_INVALID (and some other) hint bits.

Luckily that's not problematic in the majority of cases, because after a
crash/on a standby row locks aren't meaningful. Unfortunately that does
not hold true in the presence of prepared transactions. This means that
after a crash, or after promotion, row level locks held by a prepared,
but not yet committed, prepared transaction might not be enforced.

Discussion: 20160715192319.ubfuzim4zv3rqnxv@alap3.anarazel.de
Backpatch: 9.3, the oldest branch on which 0ac5ad5 is present.
2016-07-15 14:45:37 -07:00
Alvaro Herrera 533e9c6b06 Avoid serializability errors when locking a tuple with a committed update
When key-share locking a tuple that has been not-key-updated, and the
update is a committed transaction, in some cases we raised
serializability errors:
    ERROR:  could not serialize access due to concurrent update

Because the key-share doesn't conflict with the update, the error is
unnecessary and inconsistent with the case that the update hasn't
committed yet.  This causes problems for some usage patterns, even if it
can be claimed that it's sufficient to retry the aborted transaction:
given a steady stream of updating transactions and a long locking
transaction, the long transaction can be starved indefinitely despite
multiple retries.

To fix, we recognize that HeapTupleSatisfiesUpdate can return
HeapTupleUpdated when an updating transaction has committed, and that we
need to deal with that case exactly as if it were a non-committed
update: verify whether the two operations conflict, and if not, carry on
normally.  If they do conflict, however, there is a difference: in the
HeapTupleBeingUpdated case we can just sleep until the concurrent
transaction is gone, while in the HeapTupleUpdated case this is not
possible and we must raise an error instead.

Per trouble report from Olivier Dony.

In addition to a couple of test cases that verify the changed behavior,
I added a test case to verify the behavior that remains unchanged,
namely that errors are raised when a update that modifies the key is
used.  That must still generate serializability errors.  One
pre-existing test case changes behavior; per discussion, the new
behavior is actually the desired one.

Discussion: https://www.postgresql.org/message-id/560AA479.4080807@odoo.com
  https://www.postgresql.org/message-id/20151014164844.3019.25750@wrigleys.postgresql.org

Backpatch to 9.3, where the problem appeared.
2016-07-15 14:17:20 -04:00
Tom Lane 1acf757255 Fix GiST index build for NaN values in geometric types.
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values.  This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs.  The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons.  Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs.  And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.

To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN.  Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.

I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data.  It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.

Per bug #14238 from Guang-Dih Lei.  Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.

Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
2016-07-14 18:45:59 -04:00
Magnus Hagander 87d84d67bb Fix start WAL filename for concurrent backups from standby
On a standby, ThisTimelineID is always 0, so we would generate a
filename in timeline 0 even for other timelines. Instead, use starttli
which we have retreived from the controlfile.

Report by: Francesco Canovai in bug #14230
Author: Marco Nenciarini
Reviewed by: Michael Paquier and Amit Kapila
2016-07-11 12:02:31 +02:00
Robert Haas 10c0558ffe Fix several mistakes around parallel workers and client_encoding.
Previously, workers sent data to the leader using the client encoding.
That mostly worked, but the leader the converted the data back to the
server encoding.  Since not all encoding conversions are reversible,
that could provoke failures.  Fix by using the database encoding for
all communication between worker and leader.

Also, while temporary changes to GUC settings, as from the SET clause
of a function, are in general OK for parallel query, changing
client_encoding this way inside of a parallel worker is not OK.
Previously, that would have confused the leader; with these changes,
it would not confuse the leader, but it wouldn't do anything either.
So refuse such changes in parallel workers.

Also, the previous code naively assumed that when it received a
NotifyResonse from the worker, it could pass that directly back to the
user.  But now that worker-to-leader communication always uses the
database encoding, that's clearly no longer correct - though,
actually, the old way was always broken for V2 clients.  So
disassemble and reconstitute the message instead.

Issues reported by Peter Eisentraut.  Patch by me, reviewed by
Peter Eisentraut.
2016-06-30 18:35:32 -04:00
Alvaro Herrera b78364df18 Remove unused arguments in two GiST subroutines
These arguments became unused in commit 2c03216d83.  Noticed while
skimming code for unrelated development.

This is cosmetic, so no backpatch.
2016-06-28 16:01:13 -04:00
Alvaro Herrera e3ad3ffa68 Fix handling of multixacts predating pg_upgrade
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore.  In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
    ERROR:  MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails.  Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.

It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.

To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding.  This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean.  Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.

To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact.  (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).

This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.

Bug analysis by Andrew Gierth and Álvaro Herrera.

A number of public reports match this bug:
  https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net
  https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com
  https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk
  https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com
  https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
2016-06-24 18:29:28 -04:00
Tom Lane 8cf739de85 Fix building of large (bigger than shared_buffers) hash indexes.
When the index is predicted to need more than NBuffers buckets,
CREATE INDEX attempts to sort the index entries by hash key before
insertion, so as to reduce thrashing.  This code path got broken by
commit 9f03ca9151, which overlooked that _hash_form_tuple() is not
just an alias for index_form_tuple().  The index got built anyway, but
with garbage data, so that searches for pre-existing tuples always
failed.  Fix by refactoring to separate construction of the indexable
data from calling index_form_tuple().

Per bug #14210 from Daniel Newman.  Back-patch to 9.5 where the
bug was introduced.

Report: <20160623162507.17237.39471@wrigleys.postgresql.org>
2016-06-24 16:57:36 -04:00
Alvaro Herrera 1ca4a1b5d2 Finish up XLOG_HINT renaming
Commit b8fd1a09f3 renamed XLOG_HINT to XLOG_FPI, but neglected two
places.

Backpatch to 9.3, like that commit.
2016-06-17 18:05:55 -04:00
Robert Haas 71d05a2c7b pg_visibility: Add pg_truncate_visibility_map function.
This requires some core changes as well so that we can properly
WAL-log the truncation.  Specifically, it changes the format of the
XLOG_SMGR_TRUNCATE WAL record, so bump XLOG_PAGE_MAGIC.

Patch by me, reviewed but not fully endorsed by Andres Freund.
2016-06-17 17:37:30 -04:00
Robert Haas 9c188a8454 Fix typo.
Thomas Munro
2016-06-17 12:55:30 -04:00
Robert Haas 292794f82b Remove PID from 'parallel worker' context message.
Discussion: <bfd204ab-ab1a-792a-b345-0274a09a4b5f@2ndquadrant.com>
2016-06-17 09:26:17 -04:00