Commit Graph

36355 Commits

Author SHA1 Message Date
Heikki Linnakangas 40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Tom Lane b6ec7c92ac Fix some remaining int64 vestiges in contrib/test_shm_mq.
Andres Freund and Tom Lane
2014-03-18 14:26:44 -04:00
Robert Haas c676ac0f3f test_shm_mq: Use Size rather than uint64.
Commit 3bd261ca18 updated the API but
neglected to make the corresponding edits here.

Per Tom Lane and the buildfarm.
2014-03-18 13:30:19 -04:00
Robert Haas 49c0864d7e Documentation for logical decoding.
Craig Ringer, Andres Freund, Christian Kruse, with edits by me.
2014-03-18 13:20:01 -04:00
Robert Haas 8bdd12bbf0 Add pg_recvlogical, a tool to receive data logical decoding data.
This is fairly basic at the moment, but it's at least useful for
testing and debugging, and possibly more.

Andres Freund
2014-03-18 12:25:14 -04:00
Robert Haas 250f8a7bbe Rewrite comment for shm_mq_receive_bytes.
The comment and the code diverged at some point before the initial
commit of this feature, and I failed to notice.

Noted by Tom Lane.
2014-03-18 11:53:28 -04:00
Tom Lane f7271c4427 Fix relcache reference leak in refresh_by_match_merge().
One path through the loop over indexes forgot to do index_close().  Rather
than adding a fourth call, restructure slightly so that there's only one.

In passing, get rid of an unnecessary syscache lookup: the pg_index struct
for the index is already available from its relcache entry.

Per report from YAMAMOTO Takashi, though this is a bit different from his
suggested patch.  This is new code in HEAD, so no need for back-patch.
2014-03-18 11:36:53 -04:00
Robert Haas 3bd261ca18 Improve shm_mq portability around MAXIMUM_ALIGNOF and sizeof(Size).
Revise the original decision to expose a uint64-based interface and
use Size everywhere possible.  Avoid assuming that MAXIMUM_ALIGNOF is
8, or making any assumption about the relationship between that value
and sizeof(Size).  If MAXIMUM_ALIGNOF is bigger, we'll now insert
padding after the length word; if it's smaller, we are now prepared
to read and write the length word in chunks.

Per discussion with Tom Lane.
2014-03-18 11:23:13 -04:00
Tom Lane 19f2d6cdae Fix pg_dumpall option parsing: -i doesn't take an argument.
This used to work properly, but got fat-fingered in commit
3dee636e04.  Per bug #9620 from
Nicolas Payart.
2014-03-18 10:38:25 -04:00
Fujii Masao e726e59dc4 Fix help message and document in pg_receivexlog.
Add SLOTNAME placeholder to --slot option in help message and
document.
2014-03-18 21:15:45 +09:00
Robert Haas 79a4d24f31 Make it easy to detach completely from shared memory.
The new function dsm_detach_all() can be used either by postmaster
children that don't wish to take any risk of accidentally corrupting
shared memory; or by forked children of regular backends with
the same need.  This patch also updates the postmaster children that
already do PGSharedMemoryDetach() to do dsm_detach_all() as well.

Per discussion with Tom Lane.
2014-03-18 07:58:53 -04:00
Tom Lane 551fb5ac74 Release notes for 9.3.4, 9.2.8, 9.1.13, 9.0.17, 8.4.21. 2014-03-17 15:28:22 -04:00
Tom Lane d70cf811f7 During index build, check and elog (not just Assert) for broken HOT chain.
The recently-fixed bug in WAL replay could result in not finding a parent
tuple for a heap-only tuple.  The existing code would either Assert or
generate an invalid index entry, neither of which is desirable.  Throw a
regular error instead.
2014-03-17 12:36:11 -04:00
Heikki Linnakangas d663d4399e Fix thinko: have trueTriConsistentFn return GIN_TRUE.
While we're at it, also improve comments in ginlogic.c.
2014-03-17 17:29:04 +02:00
Fujii Masao 2bccced110 Fix typos in comments.
Thom Brown
2014-03-17 20:47:28 +09:00
Fujii Masao 5c6d9fc4b2 Fix bug in clean shutdown of walsender that pg_receiving is connecting to.
On clean shutdown, walsender waits for all WAL to be replicated to a standby,
and exits. It determined whether that replication had been completed by
checking whether its sent location had been equal to a standby's flush
location. Unfortunately this condition never becomes true when the standby
such as pg_receivexlog which always returns an invalid flush location is
connecting to walsender, and then walsender waits forever.

This commit changes walsender so that it just checks a standby's write
location if a flush location is invalid.

Back-patch to 9.1 where enough infrastructure for this exists.
2014-03-17 20:37:50 +09:00
Magnus Hagander 02703ff227 Fix small typo in comment
Michael Paquier
2014-03-17 09:09:21 +01:00
Alvaro Herrera bd1154edec plperl: Fix memory leak in hek2cstr
Backpatch all the way back to 9.1, where it was introduced by commit
50d89d42.

Reported by Sergey Burladyan in #9223
Author: Alex Hunsaker
2014-03-16 23:22:21 -03:00
Tom Lane 0268d21e5d Fix unportable shell-script syntax in pg_upgrade's test.sh.
I discovered the hard way that on some old shells, the locution
    FOO=""   unset FOO
does not behave the same as
    FOO="";  unset FOO
and in fact leaves FOO set to an empty string.  test.sh was inconsistently
spelling it different ways on adjacent lines.

This got broken relatively recently, in commit c737a2e56, so the lack of
field reports to date doesn't represent a lot of evidence that the problem
is rare.
2014-03-16 21:55:27 -04:00
Peter Eisentraut 2861e8e9cb Make punctuation consistent 2014-03-16 21:47:35 -04:00
Peter Eisentraut e2b959478c Fix whitespace 2014-03-16 21:47:35 -04:00
Tom Lane f4051e363c Fix advertised dispsize for libpq's sslmode connection parameter.
"8" was correct back when "disable" was the longest allowed value, but
since "verify-full" was added, it should be "12".  Given the lack of
complaints, I wouldn't be surprised if nobody is actually using these
values ... but still, if they're in the API, they should be right.

Noticed while pursuing a different problem.  It's been wrong for quite
a long time, so back-patch to all supported branches.
2014-03-16 21:43:40 -04:00
Magnus Hagander 0294023a6b Cleanups from the remove-native-krb5 patch
krb_srvname is actually not available anymore as a parameter server-side, since
with gssapi we accept all principals in our keytab. It's still used in libpq for
client side specification.

In passing remove declaration of krb_server_hostname, where all the functionality
was already removed.

Noted by Stephen Frost, though a different solution than his suggestion
2014-03-16 15:22:45 +01:00
Tom Lane e3c9f23250 First-draft release notes for 9.3.4.
As usual, the release notes for older branches will be made by cutting
these down, but put them up for community review first.
2014-03-15 15:58:59 -04:00
Tom Lane aba7f56779 Update time zone data files to tzdata release 2014a.
DST law changes in Fiji, Turkey; historical changes in Israel, Ukraine.
2014-03-15 13:36:07 -04:00
Heikki Linnakangas efada2b8e9 Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 16:07:19 +02:00
Tom Lane 6c461cb92f Prevent interrupts while reporting non-ERROR elog messages.
This should eliminate the risk of recursive entry to syslog(3), which
appears to be the cause of the hang reported in bug #9551 from James
Morton.

Arguably, the real problem here is auth.c's willingness to turn on
ImmediateInterruptOK while executing fairly wide swaths of backend code.
We may well need to work at narrowing the code ranges in which the
authentication_timeout interrupt is enabled.  For the moment, though,
this is a cheap and reasonably noninvasive fix for a field-reported
failure; the other approach would be complex and not necessarily
bug-free itself.

Back-patch to all supported branches.
2014-03-13 20:59:42 -04:00
Tom Lane f70a78bc1f Allow psql to print COPY command status in more cases.
Previously, psql would print the "COPY nnn" command status only for COPY
commands executed server-side.  Now it will print that for frontend copies
too (including \copy).  However, we continue to suppress the command status
for COPY TO STDOUT, since in that case the copy data has been routed to the
same place that the command status would go, and there is a risk of the
status line being mistaken for another line of COPY data.  Doing that would
break existing scripts, and it doesn't seem worth the benefit --- this case
seems fairly analogous to SELECT, for which we also suppress the command
status.

Kumar Rajeev Rastogi, with substantial review by Amit Khandekar
2014-03-13 13:49:03 -04:00
Tom Lane 7bae0284ee Avoid transaction-commit race condition while receiving a NOTIFY message.
Use TransactionIdIsInProgress, then TransactionIdDidCommit, to distinguish
whether a NOTIFY message's originating transaction is in progress,
committed, or aborted.  The previous coding could accept a message from a
transaction that was still in-progress according to the PGPROC array;
if the client were fast enough at starting a new transaction, it might fail
to see table rows added/updated by the message-sending transaction.  Which
of course would usually be the point of receiving the message.  We noted
this type of race condition long ago in tqual.c, but async.c overlooked it.

The race condition probably cannot occur unless there are multiple NOTIFY
senders in action, since an individual backend doesn't send NOTIFY signals
until well after it's done committing.  But if two senders commit in close
succession, it's certainly possible that we could see the second sender's
message within the race condition window while responding to the signal
from the first one.

Per bug #9557 from Marko Tiikkaja.  This patch is slightly more invasive
than what he proposed, since it removes the now-redundant
TransactionIdDidAbort call.

Back-patch to 9.0, where the current NOTIFY implementation was introduced.
2014-03-13 12:02:54 -04:00
Heikki Linnakangas 16ff08b794 Fix a couple of typos in docs.
Thom Brown
2014-03-13 15:01:45 +02:00
Bruce Momjian 242c2737fb C comments: remove odd blank lines after #ifdef WIN32 lines
A few more
2014-03-13 01:42:24 -04:00
Bruce Momjian 886c0be3f6 C comments: remove odd blank lines after #ifdef WIN32 lines 2014-03-13 01:34:42 -04:00
Heikki Linnakangas a3115f0d9e Only WAL-log the modified portion in an UPDATE, if possible.
When a row is updated, and the new tuple version is put on the same page as
the old one, only WAL-log the part of the new tuple that's not identical to
the old. This saves significantly on the amount of WAL that needs to be
written, in the common case that most fields are not modified.

Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
2014-03-12 23:28:36 +02:00
Heikki Linnakangas 17d787a3b1 Items on GIN data pages are no longer always 6 bytes; update gincostestimate.
Also improve the comments a bit.
2014-03-12 20:52:22 +02:00
Fujii Masao 588fb50715 Show PIDs of lock holders and waiters in log_lock_waits log message.
Christian Kruse, reviewed by Kumar Rajeev Rastogi.
2014-03-13 03:26:47 +09:00
Robert Haas a0b4c355c2 test_decoding: Documentation fix.
Andres Freund
2014-03-12 14:11:06 -04:00
Robert Haas 336a578b8c Fix incorrect assertion about historical snapshots.
Also fix some nearby comments.

Andres Freund
2014-03-12 14:07:41 -04:00
Robert Haas 890194f14d Comment fixes related to logical decoding.
Andres Freund, per complaints by Peter Eisentraut.
2014-03-12 14:03:09 -04:00
Heikki Linnakangas c5608ea26a Allow opclasses to provide tri-valued GIN consistent functions.
With the GIN "fast scan" feature, GIN can skip items without fetching all
the keys for them, if it can prove that they don't match regardless of
those keys. So far, it has done the proving by calling the boolean
consistent function with all combinations of TRUE/FALSE for the unfetched
keys, but since that's O(n^2), it becomes unfeasible with more than a few
keys. We can avoid calling consistent with all the combinations, if we can
tell the operator class implementation directly which keys are unknown.

This commit includes a triConsistent function for the built-in array and
tsvector opclasses.

Alexander Korotkov, with some changes by me.
2014-03-12 17:51:30 +02:00
Heikki Linnakangas fecfc2b913 In WAL replay, restore GIN metapage unconditionally to avoid torn page.
We don't take a full-page image of the GIN metapage; instead, the WAL record
contains all the information required to reconstruct it from scratch. But
to avoid torn page hazards, we must re-initialize it from the WAL record
every time, even if it already has a greater LSN, similar to how normal full
page images are restored.

This was highly unlikely to cause any problems in practice, because the GIN
metapage is small. We rely on an update smaller than a 512 byte disk sector
to be atomic elsewhere, at least in pg_control. But better safe than sorry,
and this would be easy to overlook if more fields are added to the metapage
so that it's no longer small.

Reported by Noah Misch. Backpatch to all supported versions.
2014-03-12 10:04:57 +02:00
Tom Lane e85a5ffba8 Fix tracking of psql script line numbers during \copy from another place.
Commit 08146775ac changed do_copy() to
temporarily scribble on pset.cur_cmd_source.  That was a mighty ugly bit of
code in any case, but in particular it broke handleCopyIn's ability to tell
whether it was reading from the current script source file (in which case
pset.lineno should be incremented for each line of COPY data), or from
someplace else (in which case it shouldn't).  The former case still worked,
the latter not so much.  The visible effect was that line numbers reported
for errors in a script file would be wrong if there were an earlier \copy
that was reading anything other than inline-in-the-script-file data.

To fix, introduce another pset field that holds the file do_copy wants the
COPY code to use.  This is a little bit ugly, but less so than passing the
file down explicitly through several layers that aren't COPY-specific.

Extracted from a larger patch by Kumar Rajeev Rastogi; that patch also
changes printing of COPY command tags, which is not a bug fix and shouldn't
get back-patched.  This particular idea was from a suggestion by Amit
Khandekar, if I'm reading the thread correctly.

Back-patch to 9.2 where the faulty code was introduced.
2014-03-10 15:47:40 -04:00
Robert Haas 8722017bbc Allow dynamic shared memory segments to be kept until shutdown.
Amit Kapila, reviewed by Kyotaro Horiguchi, with some further
changes by me.
2014-03-10 14:04:47 -04:00
Robert Haas 5a991ef869 Allow logical decoding via the walsender interface.
In order for this to work, walsenders need the optional ability to
connect to a database, so the "replication" keyword now allows true
or false, for backward-compatibility, and the new value "database"
(which causes the "dbname" parameter to be respected).

walsender needs to loop not only when idle but also when sending
decoded data to the user and when waiting for more xlog data to decode.
This means that there are now three separate loops inside walsender.c;
although some refactoring has been done here, this is still a bit ugly.

Andres Freund, with contributions from Álvaro Herrera, and further
review by me.
2014-03-10 13:50:28 -04:00
Robert Haas cb9a0c7987 Teach on_exit_reset() to discard pending cleanups for dsm.
If a postmaster child invokes fork() and then calls on_exit_reset, that
should be sufficient to let it exit() without breaking anything, but
dynamic shared memory broke that by not updating on_exit_reset() to
discard callbacks registered with dynamic shared memory segments.

Per investigation of a complaint from Tom Lane.
2014-03-10 10:17:19 -04:00
Simon Riggs 77049443a1 Correct copy/pasto in comment for REPLICA IDENTITY 2014-03-09 09:05:16 +00:00
Bruce Momjian 19026aadd8 doc: remove extra whitespace in SGML markup 2014-03-08 17:08:01 -05:00
Bruce Momjian 5024044a20 C comments: improve description of relfilenode uniqueness
Report by Antonin Houska
2014-03-08 12:20:30 -05:00
Bruce Momjian 11d205e2bd pg_ctl: improve handling of invalid data directory
Return '4' and report a meaningful error message when a non-existent or
invalid data directory is passed.  Previously, pg_ctl would just report
the server was not running.

Patch by me and Amit Kapila
Report from Peter Eisentraut
2014-03-08 12:15:25 -05:00
Bruce Momjian 3624acd342 docs: small adjustements to recent SELECT and pg_upgrade improvements 2014-03-08 11:26:47 -05:00
Bruce Momjian 8879fa09ee pg_upgrade: document delete problems with tablespaces inside the cluster directory
Per report by Marc Mamin
2014-03-07 22:46:38 -05:00