Commit Graph

2332 Commits

Author SHA1 Message Date
Alvaro Herrera f88d4cfc9d Setup error context callback for transaction lock waits
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple.  For example:

LOG:  process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL:  Process holding the lock: 11366. Wait queue: 11367.
CONTEXT:  while updating tuple (0,2) in relation "foo"
STATEMENT:  UPDATE foo SET value = 3;

Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.

Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
2014-03-19 15:10:36 -03:00
Heikki Linnakangas 59a5ab3f42 Remove rm_safe_restartpoint machinery.
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.

Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
2014-03-18 22:10:35 +02:00
Heikki Linnakangas 40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Heikki Linnakangas d663d4399e Fix thinko: have trueTriConsistentFn return GIN_TRUE.
While we're at it, also improve comments in ginlogic.c.
2014-03-17 17:29:04 +02:00
Fujii Masao 2bccced110 Fix typos in comments.
Thom Brown
2014-03-17 20:47:28 +09:00
Heikki Linnakangas efada2b8e9 Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 16:07:19 +02:00
Bruce Momjian 886c0be3f6 C comments: remove odd blank lines after #ifdef WIN32 lines 2014-03-13 01:34:42 -04:00
Heikki Linnakangas a3115f0d9e Only WAL-log the modified portion in an UPDATE, if possible.
When a row is updated, and the new tuple version is put on the same page as
the old one, only WAL-log the part of the new tuple that's not identical to
the old. This saves significantly on the amount of WAL that needs to be
written, in the common case that most fields are not modified.

Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
2014-03-12 23:28:36 +02:00
Heikki Linnakangas c5608ea26a Allow opclasses to provide tri-valued GIN consistent functions.
With the GIN "fast scan" feature, GIN can skip items without fetching all
the keys for them, if it can prove that they don't match regardless of
those keys. So far, it has done the proving by calling the boolean
consistent function with all combinations of TRUE/FALSE for the unfetched
keys, but since that's O(n^2), it becomes unfeasible with more than a few
keys. We can avoid calling consistent with all the combinations, if we can
tell the operator class implementation directly which keys are unknown.

This commit includes a triConsistent function for the built-in array and
tsvector opclasses.

Alexander Korotkov, with some changes by me.
2014-03-12 17:51:30 +02:00
Heikki Linnakangas fecfc2b913 In WAL replay, restore GIN metapage unconditionally to avoid torn page.
We don't take a full-page image of the GIN metapage; instead, the WAL record
contains all the information required to reconstruct it from scratch. But
to avoid torn page hazards, we must re-initialize it from the WAL record
every time, even if it already has a greater LSN, similar to how normal full
page images are restored.

This was highly unlikely to cause any problems in practice, because the GIN
metapage is small. We rely on an update smaller than a 512 byte disk sector
to be atomic elsewhere, at least in pg_control. But better safe than sorry,
and this would be easy to overlook if more fields are added to the metapage
so that it's no longer small.

Reported by Noah Misch. Backpatch to all supported versions.
2014-03-12 10:04:57 +02:00
Heikki Linnakangas 55566c9a74 Fix dangling smgr_owner pointer when a fake relcache entry is freed.
A fake relcache entry can "own" a SmgrRelation object, like a regular
relcache entry. But when it was free'd, the owner field in SmgrRelation
was not cleared, so it was left pointing to free'd memory.

Amazingly this apparently hasn't caused crashes in practice, or we would've
heard about it earlier. Andres found this with Valgrind.

Report and fix by Andres Freund, with minor modifications by me. Backpatch
to all supported versions.
2014-03-07 13:28:52 +02:00
Heikki Linnakangas 956685f82b Do wal_level and hot standby checks when doing crash-then-archive recovery.
CheckRequiredParameterValues() should perform the checks if archive recovery
was requested, even if we are going to perform crash recovery first.

Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive
recovery mode.
2014-03-05 14:48:14 +02:00
Heikki Linnakangas af246c37c0 Fix lastReplayedEndRecPtr calculation when starting from shutdown checkpoint.
When entering crash recovery followed by archive recovery, and the latest
checkpoint is a shutdown checkpoint, and there are no more WAL records to
replay before transitioning from crash to archive recovery, we would not
immediately allow read-only connections in hot standby mode even if we
could. That's because when starting from a shutdown checkpoint, we set
lastReplayedEndRecPtr incorrectly to the record before the checkpoint
record, instead of the checkpoint record itself. We don't run the redo
routine of the shutdown checkpoint record, but starting recovery from it
goes through the same motions, so it should be considered as replayed.

Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected,
so backpatch to 9.0.
2014-03-05 13:51:19 +02:00
Robert Haas b89e151054 Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables.  The output format is controlled by a
so-called "output plugin"; an example is included.  To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.

Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.

Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 16:32:18 -05:00
Heikki Linnakangas d8a42b150f Remove bogus while-loop.
Commit abf5c5c9a4 added a bogus while-
statement after the for(;;)-loop. It went unnoticed in testing, because
it was dead code.

Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced
this was also applied to 9.2, but not the bogus while-loop part, because
the code in 9.2 looks quite different.
2014-02-28 13:33:41 +02:00
Alvaro Herrera 6bfa88acd3 Fix WAL replay of locking an updated tuple
We were resetting the tuple's HEAP_HOT_UPDATED flag as well as t_ctid on
WAL replay of a tuple-lock operation, which is incorrect when the tuple
is already updated.

Back-patch to 9.3.  The clearing of both header elements was there
previously, but since no update could be present on a tuple that was
being locked, it was harmless.

Bug reported by Peter Geoghegan and Greg Stark in
CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com and
CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com
respectively; diagnosis by Andres Freund.
2014-02-27 11:13:39 -03:00
Heikki Linnakangas 00976f202c btbuild no longer calls _bt_doinsert(), update comment.
Peter Geoghegan
2014-02-26 18:49:04 +02:00
Heikki Linnakangas 8f09ca436d Improve comment on setting data_checksum GUC.
There was an extra space there, and "fixed" wasn't very descriptive.
2014-02-20 10:58:30 +02:00
Robert Haas 6f289c2b7d Switch various builtin functions to use pg_lsn instead of text.
The functions in slotfuncs.c don't exist in any released version,
but the changes to xlogfuncs.c represent backward-incompatibilities.
Per discussion, we're hoping that the queries using these functions
are few enough and simple enough that this won't cause too much
breakage for users.

Michael Paquier, reviewed by Andres Freund and further modified
by me.
2014-02-19 11:37:43 -05:00
Heikki Linnakangas 057152b37c Fix comment; checkpointer, not bgwriter, performs checkpoints since 9.2.
Amit Langote
2014-02-18 09:48:18 +02:00
Tom Lane 01824385ae Prevent potential overruns of fixed-size buffers.
Coverity identified a number of places in which it couldn't prove that a
string being copied into a fixed-size buffer would fit.  We believe that
most, perhaps all of these are in fact safe, or are copying data that is
coming from a trusted source so that any overrun is not really a security
issue.  Nonetheless it seems prudent to forestall any risk by using
strlcpy() and similar functions.

Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports.

In addition, fix a potential null-pointer-dereference crash in
contrib/chkpass.  The crypt(3) function is defined to return NULL on
failure, but chkpass.c didn't check for that before using the result.
The main practical case in which this could be an issue is if libc is
configured to refuse to execute unapproved hashing algorithms (e.g.,
"FIPS mode").  This ideally should've been a separate commit, but
since it touches code adjacent to one of the buffer overrun changes,
I included it in this commit to avoid last-minute merge issues.
This issue was reported by Honza Horak.

Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()
2014-02-17 11:20:21 -05:00
Heikki Linnakangas 4d894b41cd Change the order that pg_xlog and WAL archive are polled for WAL segments.
If there is a WAL segment with same ID but different TLI present in both
the WAL archive and pg_xlog, prefer the one with higher TLI. Before this
patch, the archive was polled first, for all expected TLIs, and only if no
file was found was pg_xlog scanned. This was a change in behavior from 9.3,
which first scanned archive and pg_xlog for the highest TLI, then archive
and pg_xlog for the next highest TLI and so forth. This patch reverts the
behavior back to what it was in 9.2.

The reason for this is that if for example you try to do archive recovery
to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is
not archived yet, we would replay past the timeline switch point on
timeline 1 using the archived files, before even looking timeline 2's files
in pg_xlog

Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior
was changed.
2014-02-14 15:15:09 +02:00
Alvaro Herrera 801c2dc72c Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.

Therefore, we now have multixact-specific freezing parameters:

vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)

vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids).  Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.

autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids).  This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.

Backpatch to 9.3 where multixacts were made to persist enough to require
freezing.  To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.

Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 19:36:31 -03:00
Tom Lane 6f2aead1ff In XLogReadBufferExtended, don't assume P_NEW yields consecutive pages.
In a database that's not yet reached consistency, it's possible that some
segments of a relation are not full-size but are not the last ones either.
Because of the way smgrnblocks() works, asking for a new page with P_NEW
will fill in the last not-full-size segment --- and if that makes it full
size, the apparent EOF of the relation will increase by more than one page,
so that the next P_NEW request will yield a page past the next consecutive
one.  This breaks the relation-extension logic in XLogReadBufferExtended,
possibly allowing a page update to be applied to some page far past where
it was intended to go.  This appears to be the explanation for reports of
table bloat on replication slaves compared to their masters, and probably
explains some corrupted-slave reports as well.

Fix the loop to check the page number it actually got, rather than merely
Assert()'ing that dead reckoning got it to the desired place.  AFAICT,
there are no other places that make assumptions about exactly which page
they'll get from P_NEW.

Problem identified by Greg Stark, though this is not the same as his
proposed patch.

It's been like this for a long time, so back-patch to all supported
branches.
2014-02-12 14:52:16 -05:00
Heikki Linnakangas d699ba4134 Fix WakeupWaiters() to not wake up an exclusive locker unnecessarily.
WakeupWaiters() is supposed to wake up all LW_WAIT_UNTIL_FREE waiters of
the slot, but the loop incorrectly also woke up the first LW_EXCLUSIVE
waiter, if there was no LW_WAIT_UNTIL_FREE waiters in the queue.

Noted by Andres Freund. This code is new in 9.4, so no backpatching.
2014-02-10 15:18:18 +02:00
Heikki Linnakangas 6aa2bdf6a0 Initialize the entryRes array between each call to triConsistent.
The shimTriConstistentFn, which calls the opclass's consistent function with
all combinations of TRUE/FALSE for any MAYBE argument, modifies the entryRes
array passed by the caller. Change startScanKey to re-initialize it between
each call to accommodate that.

It's actually a bad habit by shimTriConsistentFn to modify its argument. But
the only caller that doesn't already re-initialize the entryRes array was
startScanKey, and it's easy for startScanKey to do so. Add a comment to
shimTriConsistentFn about that.

Note: this does not give a free pass to opclass-provided consistent
functions to modify the entryRes argument; shimTriConsistent assumes that
they don't, even though it does it itself.

While at it, refactor startScanKey to allocate the requiredEntries and
additionalEntries after it knows exactly how large they need to be. Saves a
little bit of memory, and looks nicer anyway.

Per complaint by Tom Lane, buildfarm and the pg_trgm regression test.
2014-02-07 18:53:31 +02:00
Heikki Linnakangas dbc649fd77 Speed up "rare & frequent" type GIN queries.
If you have a GIN query like "rare & frequent", we currently fetch all the
items that match either rare or frequent, call the consistent function for
each item, and let the consistent function filter out items that only match
one of the terms. However, if we can deduce that "rare" must be present for
the overall qual to be true, we can scan all the rare items, and for each
rare item, skip over to the next frequent item with the same or greater TID.
That greatly speeds up "rare & frequent" type queries.

To implement that, introduce the concept of a tri-state consistent function,
where the 3rd value is MAYBE, indicating that we don't know if that term is
present. Operator classes only provide a boolean consistent function, so we
simulate the tri-state consistent function by calling the boolean function
several times, with the MAYBE arguments set to all combinations of TRUE and
FALSE. Testing all combinations is only feasible for a small number of MAYBE
arguments, but it is envisioned that we'll provide a way for operator
classes to provide a native tri-state consistent function, which can be much
more efficient. But that is not included in this patch.

We were already using that trick to for lossy pages, calling the consistent
function with the lossy entry set to TRUE and FALSE. Now that we have the
tri-state consistent function, use it for lossy pages too.

Alexander Korotkov, with fair amount of refactoring by me.
2014-02-07 15:22:48 +02:00
Tom Lane ac8bc3b6e4 Remove unnecessary relcache flushes after changing btree metapages.
These flushes were added in my commit d2896a9ed, which added the btree
logic that keeps a cached copy of the index metapage data in index relcache
entries.  The idea was to ensure that other backends would promptly update
their cached copies after a change.  However, this is not really necessary,
since _bt_getroot() has adequate defenses against believing a stale root
page link, and _bt_getrootheight() doesn't have to be 100% right.
Moreover, if it were necessary, a relcache flush would be an unreliable way
to do it, since the sinval mechanism believes that relcache flush requests
represent transactional updates, and therefore discards them on transaction
rollback.  Therefore, we might as well drop these flush requests and save
the time to rebuild the whole relcache entry after a metapage change.

If we ever try to support in-place truncation of btree indexes, it might
be necessary to revisit this issue so that _bt_getroot() can't get caught
by trying to follow a metapage link to a page that no longer exists.
A possible solution to that is to make use of an smgr, rather than
relcache, inval request to force other backends to discard their cached
metapages.  But for the moment this is not worth pursuing.
2014-02-05 13:43:46 -05:00
Fujii Masao 0753bdb352 Add primary_slotname to recovery.conf.sample. 2014-02-03 00:41:50 +09:00
Robert Haas 858ec11858 Introduce replication slots.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts.  Slots have some advantages over existing
techniques, as explained in the documentation.

In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.

Andres Freund and Robert Haas
2014-01-31 22:45:36 -05:00
Heikki Linnakangas 626a120656 Further optimize GIN multi-key searches.
When skipping over some items in a posting tree, re-find the new location
by descending the tree from root, rather than walking the right links.
This can save a lot of I/O.

Heavily modified from Alexander Korotkov's fast scan patch.
2014-01-29 21:24:38 +02:00
Heikki Linnakangas 25b1dafab6 Further optimize multi-key GIN searches.
If we're skipping past a certain TID, avoid decoding posting list segments
that only contain smaller TIDs.

Extracted from Alexander Korotkov's fast scan patch, heavily modified.
2014-01-29 18:26:40 +02:00
Heikki Linnakangas e20c70cb0f Allow skipping some items in a multi-key GIN search.
In a multi-key search, ie. something like "col @> 'foo' AND col @> 'bar'",
as soon as we find the next item that matches the first criteria, we don't
need to check the second criteria for TIDs smaller the first match. That
saves a lot of effort, especially if one of the terms is rare, while the
second occurs very frequently.

Based on ideas from Alexander Korotkov's fast scan patch.
2014-01-29 17:53:39 +02:00
Bruce Momjian c871e8f53b Revert C comment change in slot_attisnull()
Revert 89774b58b0
2014-01-28 12:28:14 -05:00
Robert Haas ea9df812d8 Relax the requirement that all lwlocks be stored in a single array.
This makes it possible to store lwlocks as part of some other data
structure in the main shared memory segment, or in a dynamic shared
memory segment.  There is still a main LWLock array and this patch does
not move anything out of it, but it provides necessary infrastructure
for doing that in the future.

This change is likely to increase the size of LWLockPadded on some
platforms, especially 32-bit platforms where it was previously only
16 bytes.

Patch by me.  Review by Andres Freund and KaiGai Kohei.
2014-01-27 11:07:44 -05:00
Bruce Momjian 89774b58b0 Adjust C comment in slot_attisnull() regarding nulls. 2014-01-25 16:43:36 -05:00
Heikki Linnakangas 71c6a8e375 Add recovery_target='immediate' option.
This allows ending recovery as a consistent state has been reached. Without
this, there was no easy way to e.g restore an online backup, without
replaying any extra WAL after the backup ended.

MauMau and me.
2014-01-25 17:34:04 +02:00
Heikki Linnakangas d150ff5781 Reset unused fields in GIN data leaf page footer.
The maxoff field is not used in the new, compressed page format. Let's
reset it when converting an old-format page to the new format. The code
won't care either way, but this makes it possible to use the field for
something else in the future.
2014-01-24 19:10:10 +02:00
Heikki Linnakangas a8f374849f Fix off-by-one in newly-introdcued GIN assertion.
Spotted by Alexander Korotkov
2014-01-24 11:10:09 +02:00
Heikki Linnakangas 398cf255ad In GIN recompression code, use mmemove rather than memcpy, for vacuum.
When vacuuming a data leaf page, any compressed posting lists that are not
modified, are copied back to the buffer from a later location in the same
buffer rather than from  a palloc'd copy. IOW, they are just moved
downwards in the same buffer. Because the source and destination addresses
can overlap, we must use memmove rather than memcpy.

Report and fix by Alexander Korotkov.
2014-01-24 10:48:45 +02:00
Tom Lane ac4ef637ad Allow use of "z" flag in our printf calls, and use it where appropriate.
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has".  Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64).  So let's
start using "z" instead.  To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".

Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead.  This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them.  I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.

It's possible that this change will result in format-string warnings from
pre-C99 compilers.  We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.

Andres Freund, with a little additional work by me
2014-01-23 17:18:33 -05:00
Heikki Linnakangas ec8f692c3c Fix alignment of GIN in-line posting lists stored in entry tuples.
The Sparc machines in the buildfarm are crashing because of misaligned
access to posting lists stored in entry tuples.

I accidentally removed a critical SHORTALIGN() from ginFormTuple, as part
of the packed posting lists patch. Perhaps I thought it was unnecessary,
because the index_form_tuple() call above the SHORTALIGN already aligned
the size, missing the fact that the null-category byte makes it misaligned
again (I think the SHORTALIGN is indeed unnecessary if there's no null-
category byte, but let's just play it safe...)
2014-01-23 22:58:12 +02:00
Heikki Linnakangas 0fdb2f7d7c Silence compiler warning.
Not all compilers understand that elog(ERROR, ...) never returns.
2014-01-23 22:15:31 +02:00
Heikki Linnakangas 6668ad1d70 Fix declaration of GinVacuumState.
gcc 4.8 was happy with having a duplicate typedef, but most compilers seem not
to be, per buildfarm.
2014-01-22 19:55:36 +02:00
Heikki Linnakangas 36a35c550a Compress GIN posting lists, for smaller index size.
GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.

To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.

This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.

Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
2014-01-22 19:20:58 +02:00
Robert Haas d02c0ddb15 Fix missing parentheses resulting in wrong order of dereference.
This could result in referencing uninitialized memory.

Michael Paquier, in response to a complaint from Andres Freund
2014-01-15 11:00:50 -05:00
Tom Lane 061b079f89 Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted.  (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.)  In Hot Standby,
the WAL replay process must likewise lock every leaf page.  There were
several bugs in the code for that:

* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error.  This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages".  To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.

* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about).  Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.

* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page.  However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.

The first two of these bugs were diagnosed by Andres Freund, the other one
by me.  Fixes based on ideas from Heikki Linnakangas and myself.

This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
2014-01-14 17:35:21 -05:00
Alvaro Herrera 423e1211a8 Accept pg_upgraded tuples during multixact freezing
The new MultiXact freezing routines introduced by commit 8e9a16ab8f
neglected to consider tuples that came from a pg_upgrade'd database; a
vacuum run that tried to freeze such tuples would die with an error such
as
ERROR: MultiXactId 11415437 does no longer exist -- apparent wraparound

To fix, ensure that GetMultiXactIdMembers is allowed to return empty
multis when the infomask bits are right, as is done in other callsites.

Per trouble report from F-Secure.

In passing, fix a copy&paste bug reported by Andrey Karpov from VIVA64
from their PVS-Studio static checked, that instead of setting relminmxid
to Invalid, we were setting relfrozenxid twice.  Not an important
mistake because that code branch is about relations for which we don't
use the frozenxid/minmxid values at all in the first place, but seems to
warrants a fix nonetheless.
2014-01-10 18:03:18 -03:00
Heikki Linnakangas c945af80cf Refactor checking whether we've reached the recovery target.
Makes the replay loop slightly more readable, by separating the concerns of
whether to stop and whether to delay, and how to extract the timestamp from
a record.

This has the user-visible change that the timestamp of the last applied
record is now updated after actually applying it. Before, it was updated
just before applying it. That meant that pg_last_xact_replay_timestamp()
could return the timestamp of a commit record that is in process of being
replayed, but not yet applied. Normally the difference is small, but if
min_recovery_apply_delay is set, there could be a significant delay between
reading a record and applying it.

Another behavioral change is that if you recover to a restore point, we stop
after the restore point record, not before it. It makes no difference as far
as running queries on the server is concerned, as applying a restore point
record changes nothing, but if examine the timeline history you will see
that the new timeline branched off just after the restore point record, not
before it. One practical consequence is that if you do PITR to the new
timeline, and set recovery target to the same named restore point again, it
will find and stop recovery at the same restore point. Conceptually, I think
it makes more sense to consider the restore point as part of the new
timeline's history than not.

In principle, setting the last-replayed timestamp before actually applying
the record was a bug all along, but it doesn't seem worth the risk to
backpatch, since min_recovery_apply_delay was only added in 9.4.
2014-01-09 14:00:39 +02:00
Heikki Linnakangas 3739e5ab93 Fix pause_at_recovery_target + recovery_target_inclusive combination.
If pause_at_recovery_target is set, recovery pauses *before* applying the
target record, even if recovery_target_inclusive is set. If you then
continue with pg_xlog_replay_resume(), it will apply the target record
before ending recovery. In other words, if you log in while it's paused
and verify that the database looks OK, ending recovery changes its state
again, possibly destroying data that you were tring to salvage with PITR.

Backpatch to 9.1, this has been broken since pause_at_recovery_target was
added.
2014-01-08 23:28:52 +02:00