Commit Graph

2884 Commits

Author SHA1 Message Date
Tom Lane 5713f03973 Improve API of GenericXLogRegister().
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future.  Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:42:06 -04:00
Tom Lane bdf7db8192 In generic WAL application and replay, ensure page "hole" is always zero.
The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied.  This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers.  So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:14:00 -04:00
Stephen Frost cd13471f2e Correct copyright for newly added genericdesc.c
It's 2016 these days (no, not entirely sure how we got here either).

Pointed out by Amit Langote
2016-04-12 08:45:09 -04:00
Tom Lane 660d5fb856 Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD.
Since we're requiring pages handled by generic_xlog.c to be standard
format, specify REGBUF_STANDARD when doing a full-page image, so that
xloginsert.c can compress out the "hole" between pd_lower and pd_upper.
Given the current API in which this path will be taken only for a newly
initialized page, the hole is likely to be particularly large in such
cases, so that this oversight could easily be performance-significant.
I don't notice any particular change in the runtime of contrib/bloom's
regression test, though.
2016-04-10 00:24:28 -04:00
Tom Lane 68689c66ef Micro-optimize GenericXLogFinish().
Make the inner comparison loops of computeDelta() as tight as possible by
pulling considerations of valid and invalid ranges out of the inner loops,
and extending a match or non-match detection as far as possible before
deciding what to do next.  To keep this tractable, give up the possibility
of merging fragments across the pd_lower to pd_upper gap.  The fraction of
pages where that could happen (ie, there are 4 or fewer bytes in the gap,
*and* data changes immediately adjacent to it on both sides) is too small
to be worth spending cycles on.

Also, avoid two BLCKSZ-length memcpy()s by computing the delta before
moving data into the target buffer, instead of after.  This doesn't save
nearly as many cycles as being tenser about computeDelta(), but it still
seems worth doing.

On my machine, this patch cuts a full 40% off the runtime of
contrib/bloom's regression test.
2016-04-09 19:30:56 -04:00
Tom Lane 08e785436f Get rid of GenericXLogUnregister().
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.

Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.

While at it, do some wordsmithing on the SGML docs for generic WAL.
2016-04-09 16:39:30 -04:00
Tom Lane db03cf375d Code review/prettification for generic_xlog.c.
Improve commentary, use more specific names for the delta fields,
const-ify pointer arguments where possible, avoid assuming that
initializing only the first element of a local array will guarantee
that the remaining elements end up as we need them.  (I think that
code in generic_redo actually worked, but only because InvalidBuffer
is zero; this is a particularly ugly way of depending on that ...)
2016-04-09 15:02:19 -04:00
Tom Lane 2dd318d277 Run pgindent on generic_xlog.c.
This code desperately needs some micro-optimization, and I'd like it
to be formatted a bit more nicely while I work on it.
2016-04-09 13:33:33 -04:00
Kevin Grittner 848ef42bb8 Add the "snapshot too old" feature
This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
2016-04-08 14:36:30 -05:00
Kevin Grittner 8b65cf4c5e Modify BufferGetPage() to prepare for "snapshot too old" feature
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.
2016-04-08 14:30:10 -05:00
Teodor Sigaev 8b99edefca Revert CREATE INDEX ... INCLUDING ...
It's not ready yet, revert two commits
690c543550 - unstable test output
386e3d7609 - patch itself
2016-04-08 21:52:13 +03:00
Teodor Sigaev 386e3d7609 CREATE INDEX ... INCLUDING (column[, ...])
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
2016-04-08 19:45:59 +03:00
Andres Freund 5364b357fb Increase maximum number of clog buffers.
Benchmarking has shown that the current number of clog buffers limits
scalability. We've previously increased the number in 33aaa139, but
that's not sufficient with a large number of clients.

We've benchmarked the cost of increasing the limit by benchmarking worst
case scenarios; testing showed that 128 buffers don't cause a
regression, even in contrived scenarios, whereas 256 does

There are a number of more complex patches flying around to address
various clog scalability problems, but this is simple enough that we can
get it into 9.6; and is beneficial even after those patches have been
applied.

It is a bit unsatisfactory to increase this in small steps every few
releases, but a better solution seems to require a rewrite of slru.c;
not something done quickly.

Author: Amit Kapila and Andres Freund
Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com
2016-04-08 08:25:59 -07:00
Robert Haas 25fe8b5f1a Add a 'parallel_degree' reloption.
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.

Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me.  Some further hacking by me.
2016-04-08 11:14:56 -04:00
Robert Haas 719c84c1be Extend relations multiple blocks at a time to improve scalability.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate.  Experimentation shows the extending the relation
multiple blocks at a time improves scalability.

Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
2016-04-08 02:04:46 -04:00
Kevin Grittner fcff8a5751 Detect SSI conflicts before reporting constraint violations
While prior to this patch the user-visible effect on the database
of any set of successfully committed serializable transactions was
always consistent with some one-at-a-time order of execution of
those transactions, the presence of declarative constraints could
allow errors to occur which were not possible in any such ordering,
and developers had no good workarounds to prevent user-facing
errors where they were not necessary or desired.  This patch adds
a check for serialization failure ahead of duplicate key checking
so that if a developer explicitly (redundantly) checks for the
pre-existing value they will get the desired serialization failure
where the problem is caused by a concurrent serializable
transaction; otherwise they will get a duplicate key error.

While it would be better if the reads performed by the constraints
could count as part of the work of the transaction for
serialization failure checking, and we will hopefully get there
some day, this patch allows a clean and reliable way for developers
to work around the issue.  In many cases existing code will already
be doing the right thing for this to "just work".

Author: Thomas Munro, with minor editing of docs by me
Reviewed-by: Marko Tiikkaja, Kevin Grittner
2016-04-07 11:12:35 -05:00
Stephen Frost 1574783b4c Use GRANT system to manage access to sensitive functions
Now that pg_dump will properly dump out any ACL changes made to
functions which exist in pg_catalog, switch to using the GRANT system
to manage access to those functions.

This means removing 'if (!superuser()) ereport()' checks from the
functions themselves and then REVOKEing EXECUTE right from 'public' for
these functions in system_views.sql.

Reviews by Alexander Korotkov, Jose Luis Tallon
2016-04-06 21:45:32 -04:00
Simon Riggs 3fe3511d05 Generic Messages for Logical Decoding
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.

Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.

(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)

Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
2016-04-06 10:05:41 +01:00
Magnus Hagander 7117685461 Implement backup API functions for non-exclusive backups
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.

Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.

The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.

Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.

Reviewed by David Steele and Amit Kapila
2016-04-05 20:03:49 +02:00
Alvaro Herrera 890614d2b3 Display WAL pointer in rm_redo error callback
This makes it easier to identify the source of a recovery problem
in case of a bug or data corruption.
2016-04-04 18:12:12 -03:00
Alvaro Herrera c9ff752a85 Silence compiler warning
Reported by Peter Eisentraut to occur on 32bit systems
2016-04-04 17:07:23 -03:00
Simon Riggs 3e4b7d8798 Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the pin scan that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin
scan is skipped only in replay; the VACUUM code path on master is not
touched here and WAL is identical.

The current commit applies in all cases, effectively replacing commit
687f2cd7a0.
2016-04-03 17:46:09 +01:00
Teodor Sigaev 65578341af Add Generic WAL interface
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.

Bump XLOG_PAGE_MAGIC because of new record type.

Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
	minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
	Michael Paquier
2016-04-01 12:21:48 +03:00
Alvaro Herrera 24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera 3b02ea4f07 XLogReader general code cleanup
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.

Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
2016-03-30 18:56:13 -03:00
Teodor Sigaev ccd6eb49a4 Introduce traversalValue for SP-GiST scan
During scan sometimes it would be very helpful to know some information about
parent node or all 	ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).

traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.

Subsequent patches for range opclass and quad4d tree will use it.

Author: Alexander Lebedev, Teodor Sigaev
2016-03-30 18:29:28 +03:00
Robert Haas 314cbfc5da Add new replication mode synchronous_commit = 'remote_apply'.
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk.  That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.

To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby.  But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.

Thomas Munro, reviewed by Michael Paquier and by me.  Some additional
tweaks by me.
2016-03-29 21:29:49 -04:00
Andres Freund 1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Alvaro Herrera 473b932870 Support CREATE ACCESS METHOD
This enables external code to create access methods.  This is useful so
that extensions can add their own access methods which can be formally
tracked for dependencies, so that DROP operates correctly.  Also, having
explicit support makes pg_dump work correctly.

Currently only index AMs are supported, but we expect different types to
be added in the future.

Authors: Alexander Korotkov, Petr Jelínek
Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby
Commitfest-URL: https://commitfest.postgresql.org/9/353/
Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com
2016-03-23 23:01:35 -03:00
Peter Eisentraut b555ed8102 Merge wal_level "archive" and "hot_standby" into new name "replica"
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
2016-03-18 23:56:03 +01:00
Robert Haas 3aff33aa68 Fix typos.
Oskari Saarenmaa
2016-03-15 18:06:11 -04:00
Alvaro Herrera 5bcc413f80 Fix typos in comments 2016-03-15 17:57:17 -03:00
Tom Lane ab4ff2889d Fix memory leak in repeated GIN index searches.
Commit d88976cfa1 removed this code from ginFreeScanKeys():
-		if (entry->list)
-			pfree(entry->list);
evidently in the belief that that ItemPointer array is allocated in the
keyCtx and so would be reclaimed by the following MemoryContextReset.
Unfortunately, it isn't and it won't.  It'd likely be a good idea for
that to become so, but as a simple and back-patchable fix in the
meantime, restore this code to ginFreeScanKeys().

Also, add a similar pfree to where startScanEntry() is about to zero out
entry->list.  I am not sure if there are any code paths where this
change prevents a leak today, but it seems like cheap future-proofing.

In passing, make the initial allocation of so->entries[] use palloc
not palloc0.  The code doesn't depend on unused entries being zero;
if it did, the array-enlargement code in ginFillScanEntry() would be
wrong.  So using palloc0 initially can only serve to confuse readers
about what the invariant is.

Per report from Felipe de Jesús Molina Bravo, via Jaime Casanova in
<CAJGNTeMR1ndMU2Thpr8GPDUfiHTV7idELJRFusA5UXUGY1y-eA@mail.gmail.com>
2016-03-13 16:44:31 -04:00
Robert Haas 481c76abf4 Fix a typo, and remove unnecessary pgstat_report_wait_end().
Per Amit Kapila.
2016-03-11 07:34:00 -05:00
Robert Haas 53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Simon Riggs e0694cf9c7 Reduce size of two phase file header
Previously 2PC header was fixed at 200 bytes, which in most cases wasted
WAL space for a workload using 2PC heavily.

Pavan Deolasee, reviewed by Petr Jelinek
2016-03-10 12:51:46 +00:00
Simon Riggs fcb4bfddb6 Reduce lock level for altering fillfactor
Fabrízio de Royes Mello and Simon Riggs
2016-03-10 12:07:33 +00:00
Andres Freund 1d4a0ab19a Avoid unlikely data-loss scenarios due to rename() without fsync.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.

Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability.  The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.

Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Tom Lane a298a1e06f Fix incorrect handling of NULL index entries in indexed ROW() comparisons.
An index search using a row comparison such as ROW(a, b) > ROW('x', 'y')
would stop upon reaching a NULL entry in the "b" column, ignoring the
fact that there might be non-NULL "b" values associated with later values
of "a".  This happens because _bt_mark_scankey_required() marks the
subsidiary scankey for "b" as required, which is just wrong: it's for
a column after the one with the first inequality key (namely "a"), and
thus can't be considered a required match.

This bit of brain fade dates back to the very beginnings of our support
for indexed ROW() comparisons, in 2006.  Kind of astonishing that no one
came across it before Glen Takahashi, in bug #14010.

Back-patch to all supported versions.

Note: the given test case doesn't actually fail in unpatched 9.1, evidently
because the fix for bug #6278 (i.e., stopping at nulls in either scan
direction) is required to make it fail.  I'm sure I could devise a case
that fails in 9.1 as well, perhaps with something involving making a cursor
back up; but it doesn't seem worth the trouble.
2016-03-09 14:51:22 -05:00
Robert Haas b6fb6471f6 Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.

As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else.  A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting.  But this gets the basic framework in place.

Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 12:08:58 -05:00
Robert Haas 734f86d50d Add new flags argument for xl_heap_visible to heap2_desc.
Masahiko Sawada
2016-03-08 13:28:22 -05:00
Robert Haas 77a1d1e798 Department of second thoughts: remove PD_ALL_FROZEN.
Commit a892234f83 added a second bit per
page to the visibility map, which still seems like a good idea, but it
also added a second page-level bit alongside PD_ALL_VISIBLE to track
whether the visibility map bit was set.  That no longer seems like a
clever plan, because we don't really need that bit for anything.  We
always clear both bits when the page is modified anyway.

Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.
2016-03-08 08:46:48 -05:00
Fujii Masao d34794f7d5 Ignore recovery_min_apply_delay until recovery has reached consistent state
Previously recovery_min_apply_delay was applied even before recovery
had reached consistency. This could cause us to wait a long time
unexpectedly for read-only connections to be allowed. It's problematic
because the standby was useless during that wait time.

This patch changes recovery_min_apply_delay so that it's applied once
the database has reached the consistent state. That is, even if the delay
is set, the standby tries to replay WAL records as fast as possible until
it has reached consistency.

Author: Michael Paquier
Reviewed-By: Julien Rouhaud
Reported-By: Greg Clough
Backpatch: 9.4, where recovery_min_apply_delay was added
Bug: #13770
Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org
2016-03-06 02:29:04 +09:00
Robert Haas 6fcde8a5c8 Minor improvements to transaction manager README.
A simple SELECT is handled by PortalRunSelect, not ProcessQuery.  Also,
the previous indentation was unclear: change it so that a deeper level
of indentation indicates that the outer function calls the inner one.

Stas Kelvich
2016-03-04 14:12:28 -05:00
Robert Haas df4685fb0c Minor optimizations based on ParallelContext having nworkers_launched.
Originally, we didn't have nworkers_launched, so code that used parallel
contexts had to be preprared for the possibility that not all of the
workers requested actually got launched.  But now we can count on knowing
the number of workers that were successfully launched, which can shave
off a few cycles and simplify some code slightly.

Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter
Geoghegan.
2016-03-04 12:59:10 -05:00
Simon Riggs c7111d11b1 Revert buggy optimization of index scans
606c0123d6 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.

Revert whole patch until we find a safe optimization.
2016-03-03 09:53:43 +00:00
Robert Haas a892234f83 Change the format of the VM fork to add a second bit per page.
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.

A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations.  A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be.  This commit does not attempt
that optimization, although that optimization is the goal here.  It
seems better to get the basic infrastructure in place first.

Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format.  That,
too, will be handled in a follow-on patch.

Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
2016-03-01 21:49:41 -05:00
Simon Riggs 481725c0ba Correct StartupSUBTRANS for page wraparound
StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans
data structure, which in some cases could lead to errors in startup for Hot
Standby.
This patch wraps the pageids correctly, avoiding any such errors.
Identified by exhaustive crash testing by Jeff Janes.

Jeff Janes
2016-02-19 08:31:12 +00:00
Andres Freund 7975c5e0a9 Allow the WAL writer to flush WAL at a reduced rate.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.

This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.

To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.

Discussion: 20160118163908.GW10941@awork2.anarazel.de
2016-02-16 00:56:34 +01:00
Joe Conway 59a884e985 Change delimiter used for display of NextXID
NextXID has been rendered in the form of a pg_lsn even though it
really is not. This can cause confusion, so change the format from
%u/%u to %u:%u, per discussion on hackers.

Complaint by me, patch by me and Bruce, reviewed by Michael Paquier
and Alvaro. Applied to HEAD only.

Author: Joe Conway, Bruce Momjian
Reviewed-by: Michael Paquier, Alvaro Herrera
Backpatch-through: master
2016-02-12 14:23:59 -08:00
Robert Haas 63461a63f9 Make builtin lwlock tranche names consistent.
Previously, we had a mix of styles.

Amit Kapila
2016-02-12 08:07:11 -05:00
Tom Lane d18643c4a6 Shift the responsibility for emitting "database system is shut down".
Historically this message has been emitted at the end of ShutdownXLOG().
That's not an insane place for it in a standalone backend, but in the
postmaster environment we've grown a fair amount of stuff that happens
later, including archiver/walsender shutdown, stats collector shutdown,
etc.  Recent buildfarm experimentation showed that on slower machines
there could be many seconds' delay between finishing ShutdownXLOG() and
actual postmaster exit.  That's fairly confusing, both for testing
purposes and for DBAs.  Hence, move the code that prints this message
into UnlinkLockFiles(), so that it comes out just after we remove the
postmaster's pidfile.  That is a more appropriate definition of "is shut
down" from the point of view of "pg_ctl stop", for example.  In general,
removing the pidfile should be the last externally-visible action of
either a postmaster or a standalone backend; compare commit
d73d14c271 for instance.  So this seems
like a reasonably future-proof approach.
2016-02-11 14:14:22 -05:00
Tom Lane c5e9b77127 Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier."
This reverts commit 3971f64843 and a
couple of followon debugging commits; I think we've learned what we can
from them.
2016-02-10 16:01:04 -05:00
Tom Lane 7351e18286 Add more chattiness in server shutdown.
Early returns from the buildfarm show that there's a bit of a gap in the
logging I added in 3971f64843b02e4a: the portion of CreateCheckPoint()
after CheckPointGuts() can take a fair amount of time.  Add a few more
log messages in that section of code.  This too shall be reverted later.
2016-02-09 11:21:46 -05:00
Tom Lane 3971f64843 Temporarily make pg_ctl and server shutdown a whole lot chattier.
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints.  Maybe they
are just really overloaded, but maybe something else is going on.  Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.

No attempt has been made to make this pretty.  I'm not even totally sure
if it will build on Windows, but we'll soon find out.
2016-02-08 18:43:11 -05:00
Robert Haas 7c944bd903 Introduce a new GUC force_parallel_mode for testing purposes.
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe.  For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process.  When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.

Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.

Robert Haas, with help from Amit Kapila and Rushabh Lathia.
2016-02-07 11:41:33 -05:00
Robert Haas a1c1af2a1f Introduce group locking to prevent parallel processes from deadlocking.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes.  There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach.  Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.

Robert Haas, with help from Amit Kapila.  Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
2016-02-07 10:16:13 -05:00
Teodor Sigaev f25d07d99f Fix lossy KNN GiST when ordering operator returns non-float8 value.
KNN GiST with recheck flag should return to executor the same type as ordering
operator, GiST detects this type by looking to return type of function which
implements ordering operator. But occasionally detecting code works after
replacing ordering operator function to distance support function.
Distance support function always returns float8, so, detecting code get float8
instead of actual return type of ordering operator.

Built-in opclasses don't have ordering operator which doesn't return
non-float8 value, so, tests are impossible here, at least now.

Backpatch to 9.5 where lozzy KNN was introduced.

Author: Alexander Korotkov
Report by: Artur Zakirov
2016-02-02 15:20:33 +03:00
Robert Haas 7191ce8bea Make all built-in lwlock tranche IDs fixed.
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.

Alexander Korotkov and Amit Kapila
2016-02-02 06:45:55 -05:00
Heikki Linnakangas 61ce1e8f15 Fix misspelled function name in comment. 2016-02-01 10:10:24 +02:00
Peter Eisentraut 9217bf3961 Fix whitespace 2016-01-30 15:58:20 -05:00
Fujii Masao 7f46eaf035 Add gin_clean_pending_list function to clean up GIN pending list
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.

This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.

Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.

Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
2016-01-28 12:57:52 +09:00
Tom Lane cc988fbb0b Improve ResourceOwners' behavior for large numbers of owned objects.
The original coding was quite fast so long as objects were always
released in reverse order of addition; otherwise, it degenerated into
O(N^2) behavior due to searching for the array element to delete.
Improve matters by switching to hashed storage when the number of
objects of a given type exceeds 64.  (The cutover point is open to
discussion, of course, but some simple performance testing suggests
that hashing has enough overhead to be a loser below there.)

Also, refactor resowner.c so that we don't need N copies of the array
management code.  Since all the resource IDs the code currently needs
to deal with are either pointers or integers, it seems sufficient to
create a one-size-fits-all infrastructure in which everything is
converted to a Datum for storage.

Aleksander Alekseev, reviewed by Stas Kelvich, further fixes by me
2016-01-26 15:20:30 -05:00
Tom Lane d9b9289c83 Suppress compiler warning.
Given the limited range of i, these shifts should not cause any
problem, but that apparently doesn't stop some compilers from
whining about them.

David Rowley
2016-01-21 21:14:07 -05:00
Tom Lane be44ed27b8 Improve index AMs' opclass validation procedures.
The amvalidate functions added in commit 65c5fcd353 were on the
crude side.  Improve them in a few ways:

* Perform signature checking for operators and support functions.

* Apply more thorough checks for missing operators and functions,
where possible.

* Instead of reporting problems as ERRORs, report most problems as INFO
messages and make the amvalidate function return FALSE.  This allows
more than one problem to be discovered per run.

* Report object names rather than OIDs, and work a bit harder on making
the messages understandable.

Also, remove a few more opr_sanity regression test queries that are
now superseded by the amvalidate checks.
2016-01-21 19:47:15 -05:00
Fujii Masao 38710a374e Remove unused argument from ginInsertCleanup()
It's an oversight in commit dc943ad.
2016-01-22 01:22:56 +09:00
Simon Riggs c80b31d557 Refactor headers to split out standby defs
Jeff Janes
2016-01-20 18:51:34 -08:00
Simon Riggs 978b2f65aa Speedup 2PC by skipping two phase state files in normal path
2PC state info is written only to WAL at PREPARE, then read back from WAL at
COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr
checkpoint cycle will be written to disk in the same form as previously. Crash
recovery path is not altered. Measured performance gains of 50-100% for short
2PC transactions by completely avoiding writing files and fsyncing. Other
optimizations still available, further patches in related areas expected.

Stas Kelvich and heavily edited by Simon Riggs

Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas,
a concrete example of how Postgres-XC has fed back ideas into PostgreSQL.

Reviewed by Michael Paquier, Jeff Janes and Andres Freund
Performance testing by Jesper Pedersen
2016-01-20 18:40:44 -08:00
Simon Riggs 422a55a687 Refactor to create generic WAL page read callback
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.

As requested by Michael Paquier, Alvaro Herrera and Andres Freund
2016-01-20 17:18:58 -08:00
Tom Lane 9ff60273e3 Fix assorted inconsistencies in GiST opclass support function declarations.
The conventions specified by the GiST SGML documentation were widely
ignored.  For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum.  None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures.  So let's try to instill
some consistency here.

Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types).  Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same.  So I've specified a convention
that that's what to do always.

Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that.  Standardize
on telling the truth, instead.

Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".

Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.

Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.

So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion.  Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
2016-01-19 12:04:36 -05:00
Tom Lane 65c5fcd353 Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function.  All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function.  This is similar to
the designs we've adopted for FDWs and tablesample methods.  There
are multiple advantages.  For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.

A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL.  We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.

Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-17 19:36:59 -05:00
Simon Riggs e63bb4549a Add new user fn pg_current_xlog_flush_location()
Tomas Vondra, reviewed by Michael Paquier and Amit Kapila
Minor edits by me
2016-01-12 07:54:52 +00:00
Simon Riggs 1e29e6324c Maintain local LogwrtResult consistently
Teach GetFlushRecPtr() to update LogwrtResult cache as performed by all other
functions in xlog.c
2016-01-12 07:33:20 +00:00
Simon Riggs b602842613 Revoke change to rmgr desc of btree vacuum
Per discussion with Andres Freund
2016-01-09 18:31:08 +00:00
Simon Riggs 687f2cd7a0 Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.

The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
2016-01-09 10:10:08 +00:00
Tom Lane 7157fe80f4 Fix overly-strict assertions in spgtextproc.c.
spg_text_inner_consistent is capable of reconstructing an empty string
to pass down to the next index level; this happens if we have an empty
string coming in, no prefix, and a dummy node label.  (In practice, what
is needed to trigger that is insertion of a whole bunch of empty-string
values.)  Then, we will arrive at the next level with in->level == 0
and a non-NULL (but zero length) in->reconstructedValue, which is valid
but the Assert tests weren't expecting it.

Per report from Andreas Seltenreich.  This has no impact in non-Assert
builds, so should not be a problem in production, but back-patch to
all affected branches anyway.

In passing, remove a couple of useless variable initializations and
shorten the code by not duplicating DatumGetPointer() calls.
2016-01-02 16:24:50 -05:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Noah Misch 3cd1ba147e Fix comments about WAL rule "write xlog before data" versus pg_multixact.
Recovery does not achieve its goal of zeroing all pg_multixact entries
whose accompanying WAL records never reached disk.  Remove that claim
and justify its expendability.  Detail the need for TrimMultiXact(),
which has little in common with the TrimCLOG() rationale.  Merge two
tightly-related comments.  Stop presenting pg_multixact as specific to
heap_lock_tuple(); PostgreSQL 9.3 extended its use to heap_update().

Noticed while investigating a report from Andres Freund.
2016-01-01 01:46:46 -05:00
Joe Conway 241448b23a Rename (new|old)estCommitTs to (new|old)estCommitTsXid
The variables newestCommitTs and oldestCommitTs sound as if they are
timestamps, but in fact they are the transaction Ids that correspond
to the newest and oldest timestamps rather than the actual timestamps.
Rename these variables to reflect that they are actually xids: to wit
newestCommitTsXid and oldestCommitTsXid respectively. Also modify
related code in a similar fashion, particularly the user facing output
emitted by pg_controldata and pg_resetxlog.

Complaint and patch by me, review by Tom Lane and Alvaro Herrera.
Backpatch to 9.5 where these variables were first introduced.
2015-12-28 12:34:11 -08:00
Tom Lane 3d2b31e30e Fix brin_summarize_new_values() to check index type and ownership.
brin_summarize_new_values() did not check that the passed OID was for
an index at all, much less that it was a BRIN index, and would fail in
obscure ways if it wasn't (possibly damaging data first?).  It also
lacked any permissions test; by analogy to VACUUM, we should only allow
the table's owner to summarize.

Noted by Jeff Janes, fix by Michael Paquier and me
2015-12-26 12:56:09 -05:00
Robert Haas 9a51698bae Fix typo in comment.
Amit Langote
2015-12-18 12:03:15 -05:00
Robert Haas 3fed417452 Provide a way to predefine LWLock tranche IDs.
It's a bit cumbersome to use LWLockNewTrancheId(), because the returned
value needs to be shared between backends so that each backend can call
LWLockRegisterTranche() with the correct ID.  So, for built-in tranches,
use a hard-coded value instead.

This is motivated by an upcoming patch adding further built-in tranches.

Andres Freund and Robert Haas
2015-12-15 11:48:19 -05:00
Andres Freund cca705a5d9 Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure.
Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would
install 0 into MultiXactState->offsetStopLimit if it previously succeeded.
Luckily, there are no known cases where find_multixact_start() will return
an error in 9.5 and above. But if it were to happen, for example due to
filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId()
could continue allocating mxids even if close to a wraparound, or it could
erroneously stop allocating mxids, even if no wraparound is looming.  The
wrong value would be corrected the next time SetOffsetVacuumLimit() is
called, or by a restart.

Reported-By: Noah Misch, although this is not his preferred fix
Discussion: 20151210140450.GA22278@alap3.anarazel.de
Backpatch: 9.5, where the bug was introduced as part of 4f627f
2015-12-14 11:34:16 +01:00
Alvaro Herrera 69e7235c93 Fix commit timestamp initialization
This module needs explicit initialization in order to replay WAL records
in recovery, but we had broken this recently following changes to make
other (stranger) scenarios work correctly.  To fix, rework the
initialization sequence so that it always takes place before WAL replay
commences for both master and standby.

I could have gone for a more localized fix that just added a "startup"
call for the master server, but it seemed better to restructure the
existing callers as well so that the whole thing made more sense.  As a
drawback, there is more control logic in xlog.c now than previously, but
doing otherwise meant passing down the ControlFile flag, which seemed
uglier as a whole.

This also meant adding a check to not re-execute ActivateCommitTs if it
had already been called.

Reported by Fujii Masao.

Backpatch to 9.5.
2015-12-11 14:30:43 -03:00
Peter Eisentraut a351705d8a Improve some messages 2015-12-10 22:05:27 -05:00
Andres Freund e3f4cfc7aa Fix bug leading to restoring unlogged relations from empty files.
At the end of crash recovery, unlogged relations are reset to the empty
state, using their init fork as the template. The init fork is copied to
the main fork without going through shared buffers. Unfortunately WAL
replay so far has not necessarily flushed writes from shared buffers to
disk at that point. In normal crash recovery, and before the
introduction of 'fast promotions' in fd4ced523 / 9.3, the
END_OF_RECOVERY checkpoint flushes the buffers out in time. But with
fast promotions that's not the case anymore.

To fix, force WAL writes targeting the init fork to be flushed
immediately (using the new FlushOneBuffer() function). In 9.5+ that
flush can centrally be triggered from the code dealing with restoring
full page writes (XLogReadBufferForRedoExtended), in earlier releases
that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay
function.

Backpatch to 9.1, even if this currently is only known to trigger in
9.3+. Flushing earlier is more robust, and it is advantageous to keep
the branches similar.

Typical symptoms of this bug are errors like
'ERROR:  index "..." contains unexpected zero page at block 0'
shortly after promoting a node.

Reported-By: Thom Brown
Author: Andres Freund and Michael Paquier
Discussion: 20150326175024.GJ451@alap3.anarazel.de
Backpatch: 9.1-
2015-12-10 16:29:26 +01:00
Alvaro Herrera 820ddb2c2f Further tweak commit_timestamp behavior
As pointed out by Fujii Masao, we weren't quite there on a standby
behaving sanely: first because we were failing to acquire the correct
state in the case where no XLOG_PARAMETER_CHANGE message was sent
(because a checkpoint had already happened after the setting was changed
in the master, and then the standby was restarted); and second because
promoting the standby with the feature enabled failed to activate it if
the master had the feature disabled.

This patch fixes both those misbehaviors hopefully without
re-introducing any old problems.

Also change the hint emitted in a standby together with the error
message about the feature being disabled, to make it point out that the
place to chance the setting is the master.  Otherwise, if the setting is
already enabled in the standby, it is very confusing to have it say that
the setting must be enabled ...

Authors: Álvaro Herrera, Petr Jelínek.
Backpatch to 9.5.
2015-12-03 19:22:31 -03:00
Robert Haas 74d0d5f3eb Fix typo in comment.
Amit Langote
2015-11-19 16:45:39 -05:00
Andres Freund d3c8ac114f Remove function names from some elog() calls in heapam.c.
At least one of the names was, due to a function renaming late in the
development of ON CONFLICT, wrong. Since including function names in
error messages is against the message style guide anyway, remove them
from the messages.

Discussion: CAM3SWZT8paz=usgMVHm0XOETkQvzjRtAUthATnmaHQQY0obnGw@mail.gmail.com
Backpatch: 9.5, where ON CONFLICT was introduced
2015-11-19 01:37:58 +01:00
Peter Eisentraut 5db837d3f2 Message improvements 2015-11-16 21:39:23 -05:00
Robert Haas fe702a7b3f Move each SLRU's lwlocks to a separate tranche.
This makes it significantly easier to identify these lwlocks in
LWLOCK_STATS or Trace_lwlocks output.  It's also arguably better
from a modularity standpoint, since lwlock.c no longer needs to
know anything about the LWLock needs of the higher-level SLRU
facility.

Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.
2015-11-12 14:59:09 -05:00
Robert Haas 64b2e7ad91 Pass extra data to bgworkers, and use this to fix parallel contexts.
Up until now, the total amount of data that could be passed to a
background worker at startup was one datum, which can be a small as
4 bytes on some systems.  That's enough to pass a dsm_handle or an
array index, but not much else.  Add a bgw_extra flag to the
BackgroundWorker struct, allowing up to 128 bytes to be passed to
a new worker on any platform.

Use this to fix a problem I recently discovered with the parallel
context machinery added in 9.5: the master assigns each worker an
array index, and each worker subsequently assigns itself an array
index, and there's nothing to guarantee that the two sets of indexes
match, leading to chaos.

Normally, I would not back-patch the change to add bgw_extra, since it
is basically a feature addition.  However, since 9.5 is still in beta
and there seems to be no other sensible way to repair the broken
parallel context machinery, back-patch to 9.5.  Existing background
worker code can ignore the bgw_extra field without a problem, but
might need to be recompiled since the structure size has changed.

Report and patch by me.  Review by Amit Kapila.
2015-11-05 12:13:56 -05:00
Kevin Grittner 585e2a3b1a Fix serialization anomalies due to race conditions on INSERT.
On insert the CheckForSerializableConflictIn() test was performed
before the page(s) which were going to be modified had been locked
(with an exclusive buffer content lock).  If another process
acquired a relation SIReadLock on the heap and scanned to a page on
which an insert was going to occur before the page was so locked,
a rw-conflict would be missed, which could allow a serialization
anomaly to be missed.  The window between the check and the page
lock was small, so the bug was generally not noticed unless there
was high concurrency with multiple processes inserting into the
same table.

This was reported by Peter Bailis as bug #11732, by Sean Chittenden
as bug #13667, and by others.

The race condition was eliminated in heap_insert() by moving the
check down below the acquisition of the buffer lock, which had been
the very next statement.  Because of the loop locking and unlocking
multiple buffers in heap_multi_insert() a check was added after all
inserts were completed.  The check before the start of the inserts
was left because it might avoid a large amount of work to detect a
serialization anomaly before performing the all of the inserts and
the related WAL logging.

While investigating this bug, other SSI bugs which were even harder
to hit in practice were noticed and fixed, an unnecessary check
(covered by another check, so redundant) was removed from
heap_update(), and comments were improved.

Back-patch to all supported branches.

Kevin Grittner and Thomas Munro
2015-10-31 14:43:34 -05:00
Robert Haas 3a1f8611f2 Update parallel executor support to reuse the same DSM.
Commit b0b0d84b3d purported to make it
possible to relaunch workers using the same parallel context, but it had
an unpleasant race condition: we might reinitialize after the workers
have sent their last control message but before they have dettached the
DSM, leaving to crashes.  Repair by introducing a new ParallelContext
operation, ReinitializeParallelDSM.

Adjust execParallel.c to use this new support, so that we can rescan a
Gather node by relaunching workers but without needing to recreate the
DSM.

Amit Kapila, with some adjustments by me.  Extracted from latest parallel
sequential scan patch.
2015-10-30 10:44:54 +01:00
Peter Eisentraut a8d585c091 Message style improvements
Message style, plurals, quoting, spelling, consistency with similar
messages
2015-10-28 20:38:36 -04:00
Alvaro Herrera 21a4e4a4c9 Fix BRIN free space computations
A bug in the original free space computation made it possible to
return a page which wasn't actually able to fit the item.  Since the
insertion code isn't prepared to deal with PageAddItem failing, a PANIC
resulted ("failed to add BRIN tuple [to new page]").  Add a macro to
encapsulate the correct computation, and use it in
brin_getinsertbuffer's callers before calling that routine, to raise an
early error.

I became aware of the possiblity of a problem in this area while working
on ccc4c07499.  There's no archived discussion about it, but it's
easy to reproduce a problem in the unpatched code with something like

CREATE TABLE t (a text);
CREATE INDEX ti ON t USING brin (a) WITH (pages_per_range=1);

for length in `seq 8000 8196`
do
	psql -f - <<EOF
TRUNCATE TABLE t;
INSERT INTO t VALUES ('z'), (repeat('a', $length));
EOF
done

Backpatch to 9.5, where BRIN was introduced.
2015-10-27 18:17:55 -03:00
Alvaro Herrera 531d21b75f Cleanup commit timestamp module activaction, again
Further tweak commit_ts.c so that on a standby the state is completely
consistent with what that in the master, rather than behaving
differently in the cases that the settings differ.  Now in standby and
master the module should always be active or inactive in lockstep.

Author: Petr Jelínek, with some further tweaks by Álvaro Herrera.

Backpatch to 9.5, where commit timestamps were introduced.

Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com
2015-10-27 15:06:50 -03:00
Robert Haas 31ba62ce32 Fix typos in comments.
CharSyam
2015-10-22 14:52:23 -04:00
Robert Haas ee7ca559fc Add a C API for parallel heap scans.
Using this API, one backend can set up a ParallelHeapScanDesc to
which multiple backends can then attach.  Each tuple in the relation
will be returned to exactly one of the scanning backends.  Only
forward scans are supported, and rescans must be carefully
coordinated.

This is not exposed to the planner or executor yet.

The original version of this code was written by me.  Amit Kapila
reviewed it, tested it, and improved it, including adding support for
synchronized scans, per review comments from Jeff Davis.  Extensive
testing of this and related patches was performed by Haribabu Kommi.
Final cleanup of this patch by me.
2015-10-16 17:33:18 -04:00
Robert Haas b0b0d84b3d Allow a parallel context to relaunch workers.
This may allow some callers to avoid the overhead involved in tearing
down a parallel context and then setting up a new one, which means
releasing the DSM and then allocating and populating a new one.  I
suspect we'll want to revise the Gather node to make use of this new
capability, but even if not it may be useful elsewhere and requires
very little additional code.
2015-10-16 17:18:05 -04:00
Robert Haas a53c06a13e Prohibit parallel query when the isolation level is serializable.
In order for this to be safe, the code which hands true serializability
will need to taught that the SIRead locks taken by a parallel worker
pertain to the same transaction as those taken by the parallel leader.
Some further changes may be needed as well.  Until the necessary
adaptations are made, don't generate parallel plans in serializable
mode, and if a previously-generated parallel plan is used after
serializable mode has been activated, run it serially.

This fixes a bug in commit 7aea8e4f2d.
2015-10-16 11:58:27 -04:00
Robert Haas 82b37765c7 Fix a problem with parallel workers being unable to restore role.
check_role() tries to verify that the user has permission to become the
requested role, but this is inappropriate in a parallel worker, which
needs to exactly recreate the master's authorization settings.  So skip
the check in that case.

This fixes a bug in commit 924bcf4f16.
2015-10-16 11:37:19 -04:00
Robert Haas 6de6d96d97 Invalidate caches after cranking up a parallel worker transaction.
Starting a parallel worker transaction changes our notion of which XIDs
are in-progress or committed, and our notion of the current command
counter ID.  Therefore, our view of these caches prior to starting
this transaction may no longer valid.  Defend against that by clearing
them.

This fixes a bug in commit 924bcf4f16.
2015-10-16 11:31:23 -04:00
Robert Haas 94b4f7e2a6 Tighten up application of parallel mode checks.
Commit 924bcf4f16 failed to enforce
parallel mode checks during the commit of a parallel worker, because
we exited parallel mode prior to ending the transaction so that we
could pop the active snapshot.  Re-establish parallel mode during
parallel worker commit.  Without this, it's far too easy for unsafe
actions during the pre-commit sequence to crash the server instead of
hitting the error checks as intended.

Just to be extra paranoid, adjust a couple of the sanity checks in
xact.c to check not only IsInParallelMode() but also
IsParallelWorker().
2015-10-16 09:59:57 -04:00
Robert Haas 423ec0877f Transfer current command counter ID to parallel workers.
Commit 924bcf4f16 correctly forbade
parallel workers to modify the command counter while in parallel mode,
but it inexplicably neglected to actually transfer the current command
counter from leader to workers.  This can result in the workers seeing
a different set of tuples from the leader, which is bad.  Repair.
2015-10-16 09:58:42 -04:00
Robert Haas 2ad5c27bb5 Don't send protocol messages to a shm_mq that no longer exists.
Commit 2bd9e412f9 introduced a mechanism
for relaying protocol messages from a background worker to another
backend via a shm_mq.  However, there was no provision for shutting
down the communication channel.  Therefore, a protocol message sent
late in the shutdown sequence, such as a DEBUG message resulting from
cranking up log_min_messages, could crash the server.  To fix, install
an on_dsm_detach callback that disables sending messages to the shm_mq
when the associated DSM is detached.
2015-10-16 09:42:33 -04:00
Andres Freund 2596d705bd Re-Align *_freeze_max_age reloption limits with corresponding GUC limits.
In 020235a575 I lowered the autovacuum_*freeze_max_age minimums to
allow for easier testing of wraparounds. I did not touch the
corresponding per-table limits. While those don't matter for the purpose
of wraparound, it seems more consistent to lower them as well.

It's noteworthy that the previous reloption lower limit for
autovacuum_multixact_freeze_max_age was too high by one magnitude, even
before 020235a575.

Discussion: 26377.1443105453@sss.pgh.pa.us
Backpatch: back to 9.0 (in parts), like the prior patch
2015-10-05 11:53:43 +02:00
Alvaro Herrera e06b2e1d2e Don't disable commit_ts in standby if enabled locally
Bug noticed by Fujii Masao
2015-10-02 12:49:01 -03:00
Peter Eisentraut 87c2b517ac Fix message punctuation according to style guide 2015-10-01 21:39:56 -04:00
Alvaro Herrera f12e814b88 Fix commit_ts for standby
Module initialization was still not completely correct after commit
6b61955135, per crash report from Takashi Ohnishi.  To fix, instead of
trying to monkey around with the value of the GUC setting directly, add
a separate boolean flag that enables the feature on a standby, but only
for the startup (recovery) process, when it sees that its master server
has the feature enabled.
Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp

Also change the deactivation routine to delete all segment files rather
than leaving the last one around.  (This doesn't need separate
WAL-logging, because on recovery we execute the same deactivation
routine anyway.)

In passing, clean up the code structure somewhat, particularly so that
xlog.c doesn't know so much about when to activate/deactivate the
feature.

Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion.

Back-patch to 9.5, where commit_ts was introduced.
2015-10-01 15:06:55 -03:00
Robert Haas 227d57f358 Don't dump core when destroying an unused ParallelContext.
If a transaction or subtransaction creates a ParallelContext but ends
without calling InitializeParallelDSM, the previous code would
seg fault.  Fix that.
2015-09-30 18:36:31 -04:00
Alvaro Herrera 6b61955135 Code review for transaction commit timestamps
There are three main changes here:

1. No longer cause a start failure in a standby if the feature is
disabled in postgresql.conf but enabled in the master.  This reverts one
part of commit 4f3924d9cd43; what we keep is the ability of the standby
to activate/deactivate the module (which includes creating and removing
segments as appropriate) during replay of such actions in the master.

2. Replay WAL records affecting commitTS even if the feature is
disabled.  This means the standby will always have the same state as the
master after replay.

3. Have COMMIT PREPARE record the transaction commit time as well.  We
were previously only applying it in the normal transaction commit path.

Author: Petr Jelínek
Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com
Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com

Additionally, I cleaned up nearby code related to replication origins,
which I found a bit hard to follow, and fixed a couple of typos.

Backpatch to 9.5, where this code was introduced.

Per bug reports from Fujii Masao and subsequent discussion.
2015-09-29 14:40:56 -03:00
Alvaro Herrera 17f5831c81 Fix "sesssion" typo
It was introduced alongside replication origins, by commit
5aa2350426, so backpatch to 9.5.

Pointed out by Fujii Masao
2015-09-28 19:13:42 -03:00
Andres Freund aa29c1ccd9 Remove legacy multixact truncation support.
In 9.5 and master there is no need to support legacy truncation. This is
just committed separately to make it easier to backpatch the WAL logged
multixact truncation to 9.3 and 9.4 if we later decide to do so.

I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to
0xD087 to avoid 9.5 reusing a value that has been in use on master while
keeping the numbers increasing between major versions.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Backpatch: 9.5
2015-09-26 19:04:25 +02:00
Andres Freund 4f627f8973 Rework the way multixact truncations work.
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for multixact truncations.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of additional bugs had to be
fixed:
1) Data was not purged from memory the member's SLRU before deleting
   segments. This happened to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Add code to flush the SLRUs to fix. Not pretty, but it feels
   slightly safer to only make decisions based on actual on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
   and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
   of emergency vacuuming. The problem remains in
   pg_get_multixact_members(), but that's quite harmless.

For now this is going to only get applied to 9.5+, leaving the issues in
the older branches in place. It is quite possible that we need to
backpatch at a later point though.

For the case this gets backpatched we need to handle that an updated
standby may be replaying WAL from a not-yet upgraded primary. We have to
recognize that situation and use "old style" truncation (i.e. looking at
the SLRUs) during WAL replay. In contrast to before, this now happens in
the startup process, when replaying a checkpoint record, instead of the
checkpointer. Doing truncation in the restartpoint is incorrect, they
can happen much later than the original checkpoint, thereby leading to
wraparound.  To avoid "multixact_redo: unknown op code 48" errors
standbys would have to be upgraded before primaries.

A later patch will bump the WAL page magic, and remove the legacy
truncation codepaths. Legacy truncation support is just included to make
a possible future backpatch easier.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: 9.5 for now
2015-09-26 19:04:25 +02:00
Teodor Sigaev dc943ad952 Allow autoanalyze to add pages deleted from pending list to FSM
Commit e956808328 introduces adding pages
to FSM for ordinary insert, but autoanalyze was able just cleanup
pending list without adding to FSM.

Also fix double call of IndexFreeSpaceMapVacuum() during ginvacuumcleanup()

Report from Fujii Masao
Patch by me
Review by Jeff Janes
2015-09-23 15:33:51 +03:00
Peter Eisentraut 4a1e15e4a9 Add missing serial comma 2015-09-18 22:41:42 -04:00
Teodor Sigaev 22f519c92a Fix bug introduced by microvacuum for GiST
Commit 013ebc0a7b introduces microvacuum for
GiST, deletetion of tuple marked LP_DEAD uses IndexPageMultiDelete while
recovery code uses IndexPageTupleDelete in loop. This causes a difference
in offset numbers of tuples to delete. Patch introduces usage of
IndexPageMultiDelete in GiST except gistplacetopage() where only one tuple is
deleted at once. That also slightly improve performance, because
IndexPageMultiDelete is more effective.

Patch changes WAL format, so bump wal page magic.

Bug report from Jeff Janes
Diagnostic and patch by Anastasia Lubennikova and me
2015-09-17 14:22:37 +03:00
Fujii Masao 10fbb79f1a Improve log messages related to tablespace_map file
This patch changes the log message which is logged when the server
successfully renames backup_label file to *.old but fails to rename
tablespace_map file during the shutdown. Previously the WARNING
message "online backup mode was not canceled" was logged in that case.
However this message is confusing because the backup mode is treated
as canceled whenever backup_label is successfully renamed. So this
commit makes the server log the message "online backup mode canceled"
in that case.

Also this commit changes errdetail messages so that they follow the
error message style guide.

Back-patch to 9.5 where tablespace_map file is introduced.

Original patch by Amit Kapila, heavily modified by me.
2015-09-15 23:21:51 +09:00
Alvaro Herrera 5cd6538345 Add missing ReleaseBuffer call in BRIN revmap code
I think this particular branch is actually dead, but the analysis to
prove that is not trivial, so instead take the weasel way.

Reported by Jinyu Zhang
Backpatch to 9.5, where BRIN was introduced.
2015-09-11 15:29:46 -03:00
Teodor Sigaev 223936e226 Fix oversight in 013ebc0a7b commit
Declaration of varibale inside ÓÝ×Õ
2015-09-09 19:21:16 +03:00
Teodor Sigaev 013ebc0a7b Microvacuum for GIST
Mark index tuple as dead if it's pointed by kill_prior_tuple during
ordinary (search) scan and remove it during insert process if there is no
enough space for new tuple to insert. This improves select performance
because index will not return tuple marked as dead and improves insert
performance because it reduces number of page split.

Anastasia Lubennikova <a.lubennikova@postgrespro.ru> with
 minor editorialization by me
2015-09-09 18:43:37 +03:00
Fujii Masao 96f6a0cb41 Remove files signaling a standby promotion request at postmaster startup
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.

This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.

Back-patch to 9.1 where promote signal file was introduced.

Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.

Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
2015-09-09 22:51:44 +09:00
Alvaro Herrera 1aba62ec63 Allow per-tablespace effective_io_concurrency
Per discussion, nowadays it is possible to have tablespaces that have
wildly different I/O characteristics from others.  Setting different
effective_io_concurrency parameters for those has been measured to
improve performance.

Author: Julien Rouhaud
Reviewed by: Andres Freund
2015-09-08 12:51:42 -03:00
Teodor Sigaev e26692248a Make GIN's cleanup pending list process interruptable
Cleanup process could be called by ordinary insert/update and could take a lot
of time. Add vacuum_delay_point() to make this process interruptable. Under
vacuum this call will also throttle a vacuum process to decrease system load,
called from insert/update it will not throttle, and that reduces a latency.

Backpatch for all supported branches.

Jeff Janes <jeff.janes@gmail.com>
2015-09-07 17:16:29 +03:00
Teodor Sigaev e956808328 Add pages deleted from pending list to FSM
Add pages deleted from GIN's pending list during cleanup to free space map
immediately. Clean up process could be initiated by ordinary insert but adding
page to FSM might occur only at vacuum. On some workload like never-vacuumed
insert-only tables it could cause a huge bloat.

Jeff Janes <jeff.janes@gmail.com>
2015-09-07 16:24:01 +03:00
Heikki Linnakangas c80b5f66c6 Fix misc typos.
Oskari Saarenmaa. Backpatch to stable branches where applicable.
2015-09-05 11:35:49 +03:00
Tatsuo Ishii c39f5674df Fix brin index summarizing while vacuuming.
If the number of heap blocks is not multiples of pages per range, the
summarizing produces wrong summary information for the last brin index
tuple while vacuuming.

Problem reported by Tatsuo Ishii and fixed by Amit Langote.

Discussion at "[HACKERS] BRIN INDEX value (message id :20150903.174935.1946402199422994347.t-ishii@sraoss.co.jp)
Backpatched to 9.5 in which brin index was added.
2015-09-05 09:19:25 +09:00
Tom Lane c5454f99c4 Fix subtransaction cleanup after an outer-subtransaction portal fails.
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort.  However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.

To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed.  (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)

In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact.  This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.

The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount.  That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache.  This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.

This has been broken since subtransactions were invented, so back-patch
to all supported branches.

Tom Lane and Michael Paquier
2015-09-04 13:37:14 -04:00
Fujii Masao 1ea5ce5c5f Document that max_worker_processes must be high enough in standby.
The setting values of some parameters including max_worker_processes
must be equal to or higher than the values on the master. However,
previously max_worker_processes was not listed as such parameter
in the document. So this commit adds it to that list.

Back-patch to 9.4 where max_worker_processes was added.
2015-09-03 22:30:16 +09:00
Teodor Sigaev 30bb26b5e0 Allow usage of huge maintenance_work_mem for GIN build.
Currently, in-memory posting list during GIN build process is limited 1GB
because of using repalloc. The patch replaces call of repalloc to repalloc_huge.
It increases limit of posting list from 180 millions
(1GB / sizeof(ItemPointerData)) to 4 billions limited by maxcount/count fields
in GinEntryAccumulator and subsequent calls. Check added.

Also, fix accounting of allocatedMemory during build to prevent integer
overflow with maintenance_work_mem > 4GB.

Robert Abraham <robert.abraham86@googlemail.com> with additions by me
2015-09-02 20:08:58 +03:00
Alvaro Herrera e68be16b0d Do not allow *timestamp to be passed as NULL
The code had bugs that would cause crashes if NULL was passed as that
argument (originally intended to mean not to bother returning its
value), and after inspection it turns out that nothing seems interested
in the case that *ts is NULL anyway.  Therefore, remove the partial
checks intended to support that case.

Author: Michael Paquier
though I didn't include a proposed Assert.

Backpatch to 9.5.
2015-08-21 14:36:54 -03:00
Andres Freund e95126cf04 Don't use function definitions looking like old-style ones.
This fixes a bunch of somewhat pedantic warnings with new
compilers. Since by far the majority of other functions definitions use
the (void) style it just seems to be consistent to do so as well in the
remaining few places.
2015-08-15 17:25:00 +02:00
Simon Riggs 47167b7907 Reduce lock levels for ALTER TABLE SET autovacuum storage options
Reduce lock levels down to ShareUpdateExclusiveLock for all autovacuum-related
relation options when setting them using ALTER TABLE.

Add infrastructure to allow varying lock levels for relation options in later
patches. Setting multiple options together uses the highest lock level required
for any option. Works for both main and toast tables.

Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional regression
tests from myself
2015-08-14 14:19:28 +01:00
Alvaro Herrera fcbf455842 Fix unitialized variables
As complained by clang, reported by Andres Freund.  Brown paper bag bug
in ccc4c07499.

Add some comments, too.

Backpatch to 9.5, like that one.
2015-08-13 00:12:07 -03:00
Alvaro Herrera ccc4c07499 Close some holes in BRIN page assignment
In some corner cases, it is possible for the BRIN index relation to be
extended by brin_getinsertbuffer but the new page not be used
immediately for anything by its callers; when this happens, the page is
initialized and the FSM is updated (by brin_getinsertbuffer) with the
info about that page, but these actions are not WAL-logged.  A later
index insert/update can use the page, but since the page is already
initialized, the initialization itself is not WAL-logged then either.
Replay of this sequence of events causes recovery to fail altogether.

There is a related corner case within brin_getinsertbuffer itself, in
which we extend the relation to put a new index tuple there, but later
find out that we cannot do so, and do not return the buffer; the page
obtained from extension is not even initialized.  The resulting page is
lost forever.

To fix, shuffle the code so that initialization is not the
responsibility of brin_getinsertbuffer anymore, in normal cases;
instead, the initialization is done by its callers (brin_doinsert and
brin_doupdate) once they're certain that the page is going to be used.
When either those functions determine that the new page cannot be used,
before bailing out they initialize the page as an empty regular page,
enter it in FSM and WAL-log all this.  This way, the page is usable for
future index insertions, and WAL replay doesn't find trying to insert
tuples in pages whose initialization didn't make it to the WAL.  The
same strategy is used in brin_getinsertbuffer when it cannot return the
new page.

Additionally, add a new step to vacuuming so that all pages of the index
are scanned; whenever an uninitialized page is found, it is initialized
as empty and WAL-logged.  This closes the hole that the relation is
extended but the system crashes before anything is WAL-logged about it.
We also take this opportunity to update the FSM, in case it has gotten
out of date.

Thanks to Heikki Linnakangas for finding the problem that kicked some
additional analysis of BRIN page assignment code.

Backpatch to 9.5, where BRIN was introduced.

Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org
2015-08-12 14:20:38 -03:00
Andres Freund 18e8613564 Address points made in post-commit review of replication origins.
Amit reviewed the replication origins patch and made some good
points. Address them. This fixes typos in error messages, docs and
comments and adds a missing error check (although in a
should-never-happen scenario).

Discussion: CAA4eK1JqUBVeWWKwUmBPryFaje4190ug0y-OAUHWQ6tD83V4xg@mail.gmail.com
Backpatch: 9.5, where replication origins were introduced.
2015-08-07 15:09:05 +02:00
Robert Haas 0e141c0fbb Reduce ProcArrayLock contention by removing backends in batches.
When a write transaction commits, it must clear its XID advertised via
the ProcArray, which requires that we hold ProcArrayLock in exclusive
mode in order to prevent concurrent processes running GetSnapshotData
from seeing inconsistent results.  When many processes try to commit
at once, ProcArrayLock must change hands repeatedly, with each
concurrent process trying to commit waking up to acquire the lock in
turn.  To make things more efficient, when more than one backend is
trying to commit a write transaction at the same time, have just one
of them acquire ProcArrayLock in exclusive mode and clear the XIDs of
all processes in the group.  Benchmarking reveals that this is much
more efficient at very high client counts.

Amit Kapila, heavily revised by me, with some review also from Pavan
Deolasee.
2015-08-06 12:02:12 -04:00
Alvaro Herrera 2834855cb9 Fix BRIN to use SnapshotAny during summarization
For correctness of summarization results, it is critical that the
snapshot used during the summarization scan is able to see all tuples
that are live to all transactions -- including tuples inserted or
deleted by in-progress transactions.  Otherwise, it would be possible
for a transaction to insert a tuple, then idle for a long time while a
concurrent transaction executes summarization of the range: this would
result in the inserted value not being considered in the summary.
Previously we were trying to use a MVCC snapshot in conjunction with
adding a "placeholder" tuple in the index: the snapshot would see all
committed tuples, and the placeholder tuple would catch insertions by
any new inserters.  The hole is that prior insertions by transactions
that are still in progress by the time the MVCC snapshot was taken were
ignored.

Kevin Grittner reported this as a bogus error message during vacuum with
default transaction isolation mode set to repeatable read (because the
error report mentioned a function name not being invoked during), but
the problem is larger than that.

To fix, tweak IndexBuildHeapRangeScan to have a new mode that behaves
the way we need using SnapshotAny visibility rules.  This change
simplifies the BRIN code a bit, mainly by removing large comments that
were mistaken.  Instead, rely on the SnapshotAny semantics to provide
what it needs.  (The business about a placeholder tuple needs to remain:
that covers the case that a transaction inserts a a tuple in a page that
summarization already scanned.)

Discussion: https://www.postgresql.org/message-id/20150731175700.GX2441@postgresql.org

In passing, remove a couple of unused declarations from brin.h and
reword a comment to be proper English.  This part submitted by Kevin
Grittner.

Backpatch to 9.5, where BRIN was introduced.
2015-08-05 16:20:50 -03:00
Fujii Masao dd85acf0c4 Make recovery rename tablespace_map to *.old if backup_label is not present.
If tablespace_map file is present without backup_label file, there is
no use of such file.  There is no harm in retaining it, but it is better
to get rid of the map file so that we don't have any redundant file
in data directory and it will avoid any sort of confusion. It seems
prudent though to just rename the file out of the way rather than
delete it completely, also we ignore any error that occurs in rename
operation as even if map file is present without backup_label file,
it is harmless.

Back-patch to 9.5 where tablespace_map file was introduced.

Amit Kapila, reviewed by Robert Haas, Alvaro Herrera and me.
2015-08-03 23:04:41 +09:00
Tom Lane 09cecdf285 Fix a number of places that produced XX000 errors in the regression tests.
It's against project policy to use elog() for user-facing errors, or to
omit an errcode() selection for errors that aren't supposed to be "can't
happen" cases.  Fix all the violations of this policy that result in
ERRCODE_INTERNAL_ERROR log entries during the standard regression tests,
as errors that can reliably be triggered from SQL surely should be
considered user-facing.

I also looked through all the files touched by this commit and fixed
other nearby problems of the same ilk.  I do not claim to have fixed
all violations of the policy, just the ones in these files.

In a few places I also changed existing ERRCODE choices that didn't
seem particularly appropriate; mainly replacing ERRCODE_SYNTAX_ERROR
by something more specific.

Back-patch to 9.5, but no further; changing ERRCODE assignments in
stable branches doesn't seem like a good idea.
2015-08-02 23:49:19 -04:00
Heikki Linnakangas 358cde320b Fix race condition that lead to WALInsertLock deadlock with commit_delay.
If a call to WaitForXLogInsertionsToFinish() returned a value in the middle
of a page, and another backend then started to insert a record to the same
page, and then you called WaitXLogInsertionsToFinish() again, the second
call might return a smaller value than the first call. The problem was in
GetXLogBuffer(), which always updated the insertingAt value to the
beginning of the requested page, not the actual requested location. Because
of that, the second call might return a xlog pointer to the beginning of
the page, while the first one returned a later position on the same page.
XLogFlush() performs two calls to WaitXLogInsertionsToFinish() in
succession, and holds WALWriteLock on the second call, which can deadlock
if the second call to WaitXLogInsertionsToFinish() blocks.

Reported by Spiros Ioannou. Backpatch to 9.4, where the more scalable
WALInsertLock mechanism, and this bug, was introduced.
2015-08-02 20:08:10 +03:00
Andres Freund 7039760114 Fix issues around the "variable" support in the lwlock infrastructure.
The lwlock scalability work introduced two race conditions into the
lwlock variable support provided for xlog.c. First, and harmlessly on
most platforms, it set/read the variable without the spinlock in some
places. Secondly, due to the removal of the spinlock, it was possible
that a backend missed changes to the variable's state if it changed in
the wrong moment because checking the lock's state, the variable's state
and the queuing are not protected by a single spinlock acquisition
anymore.

To fix first move resetting the variable's from LWLockAcquireWithVar to
WALInsertLockRelease, via a new function LWLockReleaseClearVar. That
prevents issues around waiting for a variable's value to change when a
new locker has acquired the lock, but not yet set the value. Secondly
re-check that the variable hasn't changed after enqueing, that prevents
the issue that the lock has been released and already re-acquired by the
time the woken up backend checks for the lock's state.

Reported-By: Jeff Janes
Analyzed-By: Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: 5592DB35.2060401@iki.fi
Backpatch: 9.5, where the lwlock scalability went in
2015-08-02 18:41:23 +02:00
Alvaro Herrera c81276241b Fix broken assertion in BRIN code
The code was assuming that any NULL value in scan keys was due to IS
NULL or IS NOT NULL, but it turns out to be possible to get them with
other operators too, if they are used in contrived-enough ways.  Easiest
way out of the problem seems to check explicitely for the IS NOT NULL
flag, instead of assuming it must be set if the IS NULL flag is not set,
when a null scan key is found; if neither flag is set, follow the lead
of other index AMs and assume that all indexable operators must be
strict, and thus the query is never satisfiable.

Also, add a comment to try and lure some future hacker into improving
analysis of scan keys in brin.

Per report from Andreas Seltenreich; diagnosis by Tom Lane.
Backpatch to 9.5.

Discussion: http://www.postgresql.org/message-id/20646.1437919632@sss.pgh.pa.us
2015-07-30 15:07:19 -03:00
Joe Conway 7b4bfc87d5 Plug RLS related information leak in pg_stats view.
The pg_stats view is supposed to be restricted to only show rows
about tables the user can read. However, it sometimes can leak
information which could not otherwise be seen when row level security
is enabled. Fix that by not showing pg_stats rows to users that would
be subject to RLS on the table the row is related to. This is done
by creating/using the newly introduced SQL visible function,
row_security_active().

Along the way, clean up three call sites of check_enable_rls(). The second
argument of that function should only be specified as other than
InvalidOid when we are checking as a different user than the current one,
as in when querying through a view. These sites were passing GetUserId()
instead of InvalidOid, which can cause the function to return incorrect
results if the current user has the BYPASSRLS privilege and row_security
has been set to OFF.

Additionally fix a bug causing RI Trigger error messages to unintentionally
leak information when RLS is enabled, and other minor cleanup and
improvements. Also add WITH (security_barrier) to the definition of pg_stats.

Bumped CATVERSION due to new SQL functions and pg_stats view definition.

Back-patch to 9.5 where RLS was introduced. Reported by Yaroslav.
Patch by Joe Conway and Dean Rasheed with review and input by
Michael Paquier and Stephen Frost.
2015-07-28 13:21:22 -07:00
Heikki Linnakangas 5e65f45c6e Another attempt at fixing memory leak in xlogreader.
max_block_id is also reset between reading records.

Michael Paquier
2015-07-28 09:09:36 +03:00
Heikki Linnakangas 820d1ced1b Don't assume that PageIsEmpty() returns true on an all-zeros page.
It does currently, and I don't see us changing that any time soon, but we
don't make that assumption anywhere else.

Per Tom Lane's suggestion. Backpatch to 9.2, like the previous patch that
added this assumption.
2015-07-27 18:54:09 +03:00
Heikki Linnakangas 61a65c53bd Fix memory leak in xlogreader facility.
XLogReaderFree failed to free the per-block data buffers, when they
happened to not be used by the latest read WAL record.

Michael Paquier. Backpatch to 9.5, where the per-block buffers were added.
2015-07-27 18:29:31 +03:00
Heikki Linnakangas 334445179c Reuse all-zero pages in GIN.
In GIN, an all-zeros page would be leaked forever, and never reused. Just
add them to the FSM in vacuum, and they will be reinitialized when grabbed
from the FSM. On master and 9.5, attempting to access the page's opaque
struct also caused an assertion failure, although that was otherwise
harmless.

Reported by Jeff Janes. Backpatch to all supported versions.
2015-07-27 12:30:26 +03:00
Heikki Linnakangas 023430abf7 Fix handling of all-zero pages in SP-GiST vacuum.
SP-GiST initialized an all-zeros page at vacuum, but that was not
WAL-logged, which is not safe. You might get a torn page write, when it gets
flushed to disk, and end-up with a half-initialized index page. To fix,
leave it in the all-zeros state, and add it to the FSM. It will be
initialized when reused. Also don't set the page-deleted flag when recycling
an empty page. That was also not WAL-logged, and a torn write of that would
cause the page to have an invalid checksum.

Backpatch to 9.2, where SP-GiST indexes were added.
2015-07-27 12:28:21 +03:00