Commit Graph

457 Commits

Author SHA1 Message Date
Tom Lane 004a9702e0 Remove AtEOXact_CatCache().
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1.  We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds.  There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.

However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.

There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run.  That's pretty scary though: it could easily create
more problems than it solves.  Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.

There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing.  We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it.  It hasn't found any bugs for us in many years.

Per report from Jeevan Chalke.  Back-patch to all supported branches.

Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
2017-08-13 16:15:14 -04:00
Peter Eisentraut 7e174fa793 Only kill sync workers at commit time in subscription DDL
This allows a transaction abort to avoid killing those workers.

Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
2017-08-04 21:17:47 -04:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Alvaro Herrera e90ceeaa49 Avoid bogus TwoPhaseState locking sequences
The optimized code in 728bd991c3 contains a few invalid locking
sequences.  To wit, the original code would try to acquire an lwlock
that it already holds.  Avoid this by moving lock acquisitions to
higher-level code, and install appropriate assertions in low-level that
the correct mode is held.

Authors: Michael Paquier, Álvaro Herrera
Reported-By: chuanting wang
Bug: #14680
Discussion: https://postgr.es/m/20170531033228.1487.10124@wrigleys.postgresql.org
2017-06-14 11:29:05 -04:00
Bruce Momjian a6fd7b7a5f Post-PG 10 beta1 pgindent run
perltidy run not included.
2017-05-17 16:31:56 -04:00
Peter Eisentraut 9414e41ea7 Fix logical replication launcher wake up and reset
After the logical replication launcher was told to wake up at
commit (for example, by a CREATE SUBSCRIPTION command), the flag to wake
up was not reset, so it would be woken up at every following commit as
well.  So fix that by resetting the flag.

Also, we don't need to wake up anything if the transaction was rolled
back.  Just reset the flag in that case.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-01 10:18:09 -04:00
Simon Riggs 49e9281549 Rework handling of subtransactions in 2PC recovery
The bug fixed by 0874d4f3e1
caused us to question and rework the handling of
subtransactions in 2PC during and at end of recovery.
Patch adds checks and tests to ensure no further bugs.

This effectively removes the temporary measure put in place
by 546c13e11b.

Author: Simon Riggs
Reviewed-by: Tom Lane, Michael Paquier
Discussion: http://postgr.es/m/CANP8+j+vvXmruL_i2buvdhMeVv5TQu0Hm2+C5N+kdVwHJuor8w@mail.gmail.com
2017-04-27 14:41:22 +02:00
Simon Riggs cd0cebaf7d Always SnapshotResetXmin() during ClearTransaction()
Avoid corner cases during 2PC with 6bad580d9e
2017-04-06 10:30:22 -04:00
Simon Riggs 6bad580d9e Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin,
so we can simply avoid running SnapshotResetXmin() twice.

During performance tests by Alexander Korotkov, diagnosis
by Andres Freund showed PgXact array as a bottleneck. After
manual analysis by me of the code paths that touch those
memory locations, I was able to identify extraneous code
in the main transaction commit path.

Avoiding touching highly contented shmem improves concurrent
performance slightly on all workloads, confirmed by tests
run by Ashutosh Sharma and Alexander Korotkov.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-04-06 08:31:52 -04:00
Simon Riggs 728bd991c3 Speedup 2PC recovery by skipping two phase state files in normal path
2PC state info held in shmem at PREPARE, then cleaned at COMMIT PREPARED/ABORT PREPARED,
avoiding writing/fsyncing any state information to disk in the normal path, greatly enhancing replay speed.
Prepared transactions that live past one checkpoint redo horizon will be written to disk as now.
Similar conceptually to 978b2f65aa and building upon
the infrastructure created by that commit.

Authors, in equal measure: Stas Kelvich, Nikhil Sontakke and Michael Paquier
Discussion: https://postgr.es/m/CAMGcDxf8Bn9ZPBBJZba9wiyQq-Qk5uqq=VjoMnRnW5s+fKST3w@mail.gmail.com
2017-04-04 15:56:56 -04:00
Simon Riggs 3428ef7911 Reverting 42b4b0b241
Buildfarm issues and other reported issues
2017-03-24 17:56:17 +00:00
Simon Riggs 42b4b0b241 Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin
Avoiding touching highly contented shmem improves concurrent
performance.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-03-24 14:20:59 +00:00
Simon Riggs 9b013dc238 Improve performance of replay of AccessExclusiveLocks
A hot standby replica keeps a list of Access Exclusive locks for a top
level transaction. These locks are released when the top level transaction
ends. Searching of this list is O(N^2), and each transaction had to pay the
price of searching this list for locks, even if it didn't take any AE
locks itself.

This patch optimizes this case by having the master server track which
transactions took AE locks, and passes that along to the standby server in
the commit/abort record. This allows the standby to only try to release
locks for transactions which actually took any, avoiding the majority of
the performance issue.

Refactor MyXactAccessedTempRel into MyXactFlags to allow minimal additional
cruft with this.

Analysis and initial patch by David Rowley
Author: David Rowley and Simon Riggs
2017-03-22 13:09:36 +00:00
Robert Haas 88e66d193f Rename "pg_clog" directory to "pg_xact".
Names containing the letters "log" sometimes confuse users into
believing that only non-critical data is present.  It is hoped
this renaming will discourage ill-considered removals of transaction
status data.

Michael Paquier

Discussion: http://postgr.es/m/CA+Tgmoa9xFQyjRZupbdEFuwUerFTvC6HjZq1ud6GYragGDFFgA@mail.gmail.com
2017-03-17 09:48:38 -04:00
Heikki Linnakangas 181bdb90ba Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:33:58 +02:00
Peter Eisentraut 665d1fad99 Logical replication
- Add PUBLICATION catalogs and DDL
- Add SUBSCRIPTION catalog and DDL
- Define logical replication protocol and output plugin
- Add logical replication workers

From: Petr Jelinek <petr@2ndquadrant.com>
Reviewed-by: Steve Singer <steve@ssinger.info>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Erik Rijkers <er@xs4all.nl>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
2017-01-20 09:04:49 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Andres Freund 6ef2eba3f5 Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 11:31:50 -08:00
Robert Haas e8ac886c24 Support condition variables.
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true.  In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time.  However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set.  Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.

Robert Haas and Thomas Munro
2016-11-22 14:27:11 -05:00
Robert Haas a43f1939d5 Remove or reduce verbosity of some debug messages.
The debug messages that merely print StartTransactionCommand,
CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no
additional details seem to be useless.  Get rid of them.

The transaction status messages produced by ShowTransactionState are
occasionally useful, but they are extremely verbose, producing
multiple lines of log output every time they fire, which can happens
multiple times per transaction.  So, reduce the level to DEBUG5; avoid
emitting an extra line just to explain which debug point is at issue;
and tighten up the rest of the message so it doesn't use quite so much
horizontal space.

With these changes, it's possible to run a somewhat busy system with a
log level even as high as DEBUG4, whereas previously anything above
DEBUG2 would flood the log with output that probably wasn't really all
that useful.
2016-11-17 17:05:16 -05:00
Tom Lane ea268cdc9a Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters.  While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea.  Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.

While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.

In passing, change TopMemoryContext to use the default allocation
parameters.  Formerly it could only be extended 8K at a time.  That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there.  There seems no good reason
not to let it use growing blocks like most other contexts.

Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain.  The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.

Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 17:50:38 -04:00
Robert Haas 4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Andres Freund c6ff84b06a Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.

Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.

The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.

Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.

XXX:

Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
    CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
    CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-26 20:21:54 -07:00
Robert Haas 314cbfc5da Add new replication mode synchronous_commit = 'remote_apply'.
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk.  That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.

To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby.  But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.

Thomas Munro, reviewed by Michael Paquier and by me.  Some additional
tweaks by me.
2016-03-29 21:29:49 -04:00
Andres Freund 1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Peter Eisentraut b555ed8102 Merge wal_level "archive" and "hot_standby" into new name "replica"
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
2016-03-18 23:56:03 +01:00
Robert Haas 481c76abf4 Fix a typo, and remove unnecessary pgstat_report_wait_end().
Per Amit Kapila.
2016-03-11 07:34:00 -05:00
Robert Haas 53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Robert Haas b6fb6471f6 Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.

As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else.  A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting.  But this gets the basic framework in place.

Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 12:08:58 -05:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Robert Haas 94b4f7e2a6 Tighten up application of parallel mode checks.
Commit 924bcf4f16 failed to enforce
parallel mode checks during the commit of a parallel worker, because
we exited parallel mode prior to ending the transaction so that we
could pop the active snapshot.  Re-establish parallel mode during
parallel worker commit.  Without this, it's far too easy for unsafe
actions during the pre-commit sequence to crash the server instead of
hitting the error checks as intended.

Just to be extra paranoid, adjust a couple of the sanity checks in
xact.c to check not only IsInParallelMode() but also
IsParallelWorker().
2015-10-16 09:59:57 -04:00
Robert Haas 423ec0877f Transfer current command counter ID to parallel workers.
Commit 924bcf4f16 correctly forbade
parallel workers to modify the command counter while in parallel mode,
but it inexplicably neglected to actually transfer the current command
counter from leader to workers.  This can result in the workers seeing
a different set of tuples from the leader, which is bad.  Repair.
2015-10-16 09:58:42 -04:00
Alvaro Herrera f12e814b88 Fix commit_ts for standby
Module initialization was still not completely correct after commit
6b61955135, per crash report from Takashi Ohnishi.  To fix, instead of
trying to monkey around with the value of the GUC setting directly, add
a separate boolean flag that enables the feature on a standby, but only
for the startup (recovery) process, when it sees that its master server
has the feature enabled.
Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp

Also change the deactivation routine to delete all segment files rather
than leaving the last one around.  (This doesn't need separate
WAL-logging, because on recovery we execute the same deactivation
routine anyway.)

In passing, clean up the code structure somewhat, particularly so that
xlog.c doesn't know so much about when to activate/deactivate the
feature.

Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion.

Back-patch to 9.5, where commit_ts was introduced.
2015-10-01 15:06:55 -03:00
Alvaro Herrera 6b61955135 Code review for transaction commit timestamps
There are three main changes here:

1. No longer cause a start failure in a standby if the feature is
disabled in postgresql.conf but enabled in the master.  This reverts one
part of commit 4f3924d9cd43; what we keep is the ability of the standby
to activate/deactivate the module (which includes creating and removing
segments as appropriate) during replay of such actions in the master.

2. Replay WAL records affecting commitTS even if the feature is
disabled.  This means the standby will always have the same state as the
master after replay.

3. Have COMMIT PREPARE record the transaction commit time as well.  We
were previously only applying it in the normal transaction commit path.

Author: Petr Jelínek
Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com
Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com

Additionally, I cleaned up nearby code related to replication origins,
which I found a bit hard to follow, and fixed a couple of typos.

Backpatch to 9.5, where this code was introduced.

Per bug reports from Fujii Masao and subsequent discussion.
2015-09-29 14:40:56 -03:00
Alvaro Herrera 17f5831c81 Fix "sesssion" typo
It was introduced alongside replication origins, by commit
5aa2350426, so backpatch to 9.5.

Pointed out by Fujii Masao
2015-09-28 19:13:42 -03:00
Heikki Linnakangas c80b5f66c6 Fix misc typos.
Oskari Saarenmaa. Backpatch to stable branches where applicable.
2015-09-05 11:35:49 +03:00
Tom Lane c5454f99c4 Fix subtransaction cleanup after an outer-subtransaction portal fails.
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort.  However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.

To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed.  (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)

In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact.  This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.

The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount.  That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache.  This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.

This has been broken since subtransactions were invented, so back-patch
to all supported branches.

Tom Lane and Michael Paquier
2015-09-04 13:37:14 -04:00
Bruce Momjian 807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Heikki Linnakangas fa60fb63e5 Fix more typos in comments.
Patch by CharSyam, plus a few more I spotted with grep.
2015-05-20 19:45:43 +03:00
Heikki Linnakangas 4fc72cc7bb Collection of typo fixes.
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.

Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.

Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
2015-05-20 16:56:22 +03:00
Magnus Hagander 3b075e9d7b Fix typos in comments
Dmitriy Olshevskiy
2015-05-17 14:58:04 +02:00
Robert Haas 924bcf4f16 Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things.  First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers.  Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes.  Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.

Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 15:02:14 -04:00
Andres Freund 5aa2350426 Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
Andres Freund 4f1b890b13 Merge the various forms of transaction commit & abort records.
Since 465883b0a two versions of commit records have existed. A compact
version that was used when no cache invalidations, smgr unlinks and
similar were needed, and a full version that could deal with all
that. Additionally the full version was embedded into twophase commit
records.

That resulted in a measurable reduction in the size of the logged WAL in
some workloads. But more recently additions like logical decoding, which
e.g. needs information about the database something was executed on,
made it applicable in fewer situations. The static split generally made
it hard to expand the commit record, because concerns over the size made
it hard to add anything to the compact version.

Additionally it's not particularly pretty to have twophase.c insert
RM_XACT records.

Rejigger things so that the commit and abort records only have one form
each, including the twophase equivalents. The presence of the various
optional (in the sense of not being in every record) pieces is indicated
by a bits in the 'xinfo' flag.  That flag previously was not included in
compact commit records. To prevent an increase in size due to its
presence, it's only included if necessary; signalled by a bit in the
xl_info bits available for xact.c, similar to heapam.c's
XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE.

Twophase commit/aborts are now the same as their normal
counterparts. The original transaction's xid is included in an optional
data field.

This means that commit records generally are smaller, except in the case
of a transaction with subtransactions, but no other special cases; the
increase there is four bytes, which seems acceptable given that the more
common case of not having subtransactions shrank.  The savings are
especially measurable for twophase commits, which previously always used
the full version; but will in practice only infrequently have required
that.

The motivation for this work are not the space savings and and
deduplication though; it's that it makes it easier to extend commit
records with additional information. That's just a few lines of code
now; without impacting the common case where that information is not
needed.

Discussion: 20150220152150.GD4149@awork2.anarazel.de,
    235610.92468.qm%40web29004.mail.ird.yahoo.com

Reviewed-By: Heikki Linnakangas, Simon Riggs
2015-03-15 17:37:07 +01:00
Andres Freund fd6a3f3ad4 Reconsider when to wait for WAL flushes/syncrep during commit.
Up to now RecordTransactionCommit() waited for WAL to be flushed (if
synchronous_commit != off) and to be synchronously replicated (if
enabled), even if a transaction did not have a xid assigned. The primary
reason for that is that sequence's nextval() did not assign a xid, but
are worthwhile to wait for on commit.

This can be problematic because sometimes read only transactions do
write WAL, e.g. HOT page prune records. That then could lead to read only
transactions having to wait during commit. Not something people expect
in a read only transaction.

This lead to such strange symptoms as backends being seemingly stuck
during connection establishment when all synchronous replicas are
down. Especially annoying when said stuck connection is the standby
trying to reconnect to allow syncrep again...

This behavior also is involved in a rather complicated <= 9.4 bug where
the transaction started by catchup interrupt processing waited for
syncrep using latches, but didn't get the wakeup because it was already
running inside the same overloaded signal handler. Fix the issue here
doesn't properly solve that issue, merely papers over the problems. In
9.5 catchup interrupts aren't processed out of signal handlers anymore.

To fix all this, make nextval() acquire a top level xid, and only wait for
transaction commit if a transaction both acquired a xid and emitted WAL
records.  If only a xid has been assigned we don't uselessly want to
wait just because of writes to temporary/unlogged tables; if only WAL
has been written we don't want to wait just because of HOT prunes.

The xid assignment in nextval() is unlikely to cause overhead in
real-world workloads. For one it only happens SEQ_LOG_VALS/32 values
anyway, for another only usage of nextval() without using the result in
an insert or similar is affected.

Discussion: 20150223165359.GF30784@awork2.anarazel.de,
    369698E947874884A77849D8FE3680C2@maumau,
    5CF4ABBA67674088B3941894E22A0D25@maumau

Per complaint from maumau and Thom Brown

Backpatch all the way back; 9.0 doesn't have syncrep, but it seems
better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Alvaro Herrera 73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Robert Haas 6cb4afff33 Avoid setup work for invalidation messages at start-of-(sub)xact.
Instead of initializing a new TransInvalidationInfo for every
transaction or subtransaction, we can just do it for those
transactions or subtransactions that actually need to queue
invalidation messages.  That also avoids needing to free those
entries at the end of a transaction or subtransaction that does
not generate any invalidation messages, which is by far the
common case.

Patch by me.  Review by Simon Riggs and Andres Freund.
2014-10-29 12:35:19 -04:00
Heikki Linnakangas bb38fb0d43 Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.

To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.

Backpatch to all supported versions.
2014-05-15 16:37:50 +03:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Tom Lane d19bd29f07 Reset pg_stat_activity.xact_start during PREPARE TRANSACTION.
Once we've completed a PREPARE, our session is not running a transaction,
so its entry in pg_stat_activity should show xact_start as null, rather
than leaving the value as the start time of the now-prepared transaction.

I think possibly this oversight was triggered by faulty extrapolation
from the adjacent comment that says PrepareTransaction should not call
AtEOXact_PgStat, so tweak the wording of that comment.

Noted by Andres Freund while considering bug #10123 from Maxim Boguk,
although this error doesn't seem to explain that report.

Back-patch to all active branches.
2014-04-24 13:29:48 -04:00
Robert Haas b89e151054 Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables.  The output format is controlled by a
so-called "output plugin"; an example is included.  To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.

Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.

Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 16:32:18 -05:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Robert Haas e55704d8b2 Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983.  This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all.  Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.

Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.

Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-10 19:01:40 -05:00
Tom Lane 16e1b7a1b7 Fix assorted race conditions in the new timeout infrastructure.
Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions).  That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.

We could still lose control if an asynchronous SIGINT arrives just as the
function is entered.  This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction.  We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.

Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return.  When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely.  Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.

These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored.  So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues.  (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)

Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.

Per reports from Dan Wood.

Back-patch to 9.3 where the new timeout handling infrastructure was
introduced.  We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
2013-11-29 16:41:00 -05:00
Bruce Momjian a6542a4b68 Change SET LOCAL/CONSTRAINTS/TRANSACTION and ABORT behavior
Change SET LOCAL/CONSTRAINTS/TRANSACTION behavior outside of a
transaction block from error (post-9.3) to warning.  (Was nothing in <=
9.3.)  Also change ABORT outside of a transaction block from notice to
warning.
2013-11-25 19:19:40 -05:00
Robert Haas 71901ab6da Introduce InvalidCommandId.
This allows a 32-bit field to represent an *optional* command ID
without a separate flag bit.

Andres Freund
2013-09-09 16:25:29 -04:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Tom Lane 2b78d101d1 Fix SQL function execution to be safe with long-lived FmgrInfos.
fmgr_sql had been designed on the assumption that the FmgrInfo it's called
with has only query lifespan.  This is demonstrably unsafe in connection
with range types, as shown in bug #7881 from Andrew Gierth.  Fix things
so that we re-generate the function's cache data if the (sub)transaction
it was made in is no longer active.

Back-patch to 9.2.  This might be needed further back, but it's not clear
whether the case can realistically arise without range types, so for now
I'll desist from back-patching further.
2013-03-03 17:39:58 -05:00
Tom Lane fdaf44862b Invent pre-commit/pre-prepare/pre-subcommit events for xact callbacks.
Currently it's only possible for loadable modules to get control during
post-commit cleanup of a transaction.  That doesn't work too well if they
want to do something that could throw an error; for example, an FDW might
need to issue a remote commit, which could well fail.  To improve matters,
extend the existing APIs for XactCallback and SubXactCallback functions
to provide new pre-commit events for this purpose.

The release notes will need to mention that existing callback functions
should be checked to make sure they don't do something unwanted when one
of the new event types occurs.  In the examples within our source tree,
contrib/sepgsql was fine but plpgsql had been a bit too cute.
2013-02-14 20:35:08 -05:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Heikki Linnakangas 7bffc9b7bf Update minimum recovery point on truncation.
If a file is truncated, we must update minRecoveryPoint. Once a file is
truncated, there's no going back; it would not be safe to stop recovery
at a point earlier than that anymore.

Per report from Kyotaro HORIGUCHI. Backpatch to 8.4. Before that,
minRecoveryPoint was not updated during recovery at all.
2012-12-10 16:57:16 +02:00
Simon Riggs f21bb9cfb5 Refactor inCommit flag into generic delayChkpt flag.
Rename PGXACT->inCommit flag into delayChkpt flag,
and generalise comments to allow use in other situations,
such as the forthcoming potential use in checksum patch.
Replace wait loop to look for VXIDs with delayChkpt set.
No user visible changes, not behaviour changes at present.

Simon Riggs, reviewed and rebased by Jeff Davis
2012-12-03 13:13:53 +00:00
Alvaro Herrera 1577b46b7c Split out rmgr rm_desc functions into their own files
This is necessary (but not sufficient) to have them compilable outside
of a backend environment.
2012-11-28 13:01:15 -03:00
Simon Riggs d9fad1076d Skip searching for subxact locks at commit.
At commit all standby locks are released
for the top-level transaction, so searching
for locks for each subtransaction is both
pointless and costly (N^2) in the presence
of many AccessExclusiveLocks.
2012-11-13 16:00:19 -03:00
Tom Lane ff3f9c8de5 Close un-owned SMgrRelations at transaction end.
If an SMgrRelation is not "owned" by a relcache entry, don't allow it to
live past transaction end.  This design allows the same SMgrRelation to be
used for blind writes of multiple blocks during a transaction, but ensures
that we don't hold onto such an SMgrRelation indefinitely.  Because an
SMgrRelation typically corresponds to open file descriptors at the fd.c
level, leaving it open when there's no corresponding relcache entry can
mean that we prevent the kernel from reclaiming deleted disk space.
(While CacheInvalidateSmgr messages usually fix that, there are cases
where they're not issued, such as DROP DATABASE.  We might want to add
some more sinval messaging for that, but I'd be inclined to keep this
type of logic anyway, since allowing VFDs to accumulate indefinitely
for blind-written relations doesn't seem like a good idea.)

This code replaces a previous attempt towards the same goal that proved
to be unreliable.  Back-patch to 9.1 where the previous patch was added.
2012-10-17 12:38:21 -04:00
Alvaro Herrera 45326c5a11 Split resowner.h
This lets files that are mere users of ResourceOwner not automatically
include the headers for stuff that is managed by the resowner mechanism.
2012-08-28 18:02:07 -04:00
Bruce Momjian 41fa3dfb0a Update C comment to NOTICE to reflect previous commit changing the error
level, per report from Tom.
2012-08-15 19:09:37 -04:00
Robert Haas f11e8be3e8 Make commit_delay much smarter.
Instead of letting every backend participating in a group commit wait
independently, have the first one that becomes ready to flush WAL wait
for the configured delay, and let all the others wait just long enough
for that first process to complete its flush.  This greatly increases
the chances of being able to configure a commit_delay setting that
actually improves performance.

As a side consequence of this change, commit_delay now affects all WAL
flushes, rather than just commits.  There was some discussion on
pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay,
but in the absence of consensus I am leaving it alone for now.

Peter Geoghegan, with some changes, mostly to the documentation, by me.
2012-07-02 10:26:31 -04:00
Robert Haas f83b59997d Make walsender more responsive.
Per testing by Andres Freund, this improves replication performance
and reduces replication latency and latency jitter.  I was a bit
concerned about moving more work into XLogInsert, but testing seems
to show that it's not a problem in practice.

Along the way, improve comments for WaitLatchOrSocket.

Andres Freund.  Review and stylistic cleanup by me.
2012-07-02 09:41:01 -04:00
Alvaro Herrera 77ed0c6950 Tighten up includes in sinvaladt.h, twophase.h, proc.h
Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in
proc.h with xlogdefs.h.
2012-06-25 18:40:40 -04:00
Heikki Linnakangas 0ab9d1c4b3 Replace XLogRecPtr struct with a 64-bit integer.
This simplifies code that needs to do arithmetic on XLogRecPtrs.

To avoid changing on-disk format of data pages, the LSN on data pages is
still stored in the old format. That should keep pg_upgrade happy. However,
we have XLogRecPtrs embedded in the control file, and in the structs that
are sent over the replication protocol, so this changes breaks compatibility
of pg_basebackup and server. I didn't do anything about this in this patch,
per discussion on -hackers, the right thing to do would to be to change the
replication protocol to be architecture-independent, so that you could use
a newer version of pg_receivexlog, for example, against an older server
version.
2012-06-24 19:19:45 +03:00
Robert Haas cd80073445 During transaction cleanup, release locks before deleting files.
There's no need to hold onto the locks until the files are needed,
and by doing it this way, we reduce the impact on other backends who
may be awaiting locks we hold.

Noah Misch
2012-06-14 10:19:33 -04:00
Bruce Momjian 927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Tom Lane ece01aae47 Scan the buffer pool just once, not once per fork, during relation drop.
This provides a speedup of about 4X when NBuffers is large enough.
There is also a useful reduction in sinval traffic, since we
only do CacheInvalidateSmgr() once not once per fork.

Simon Riggs, reviewed and somewhat revised by Tom Lane
2012-06-07 17:43:11 -04:00
Tom Lane a04dc87db1 Improve comment for GetStableLatestTransactionId(). 2012-05-31 11:20:02 -04:00
Tom Lane 2755abf386 Teach AbortOutOfAnyTransaction to clean up partially-started transactions.
AbortOutOfAnyTransaction failed to do anything if the state it saw on
entry corresponded to failing partway through StartTransaction.  I fixed
AbortCurrentTransaction to cope with that case way back in commit
60b2444cc3, but evidently overlooked that
AbortOutOfAnyTransaction should do likewise.

Back-patch to all supported branches.  It's not clear that this omission
has any more-than-cosmetic consequences, but it's also not clear that it
doesn't, so back-patching seems the least risky choice.
2012-05-28 23:57:06 -04:00
Simon Riggs 867540b49c Ensure backwards compatibility for GetStableLatestTransactionId() 2012-05-12 13:26:10 +01:00
Simon Riggs b06679e012 Ensure age() returns a stable value rather than the latest value 2012-05-11 14:36:24 +01:00
Robert Haas 53c5b869b4 Tighten up error recovery for fast-path locking.
The previous code could cause a backend crash after BEGIN; SAVEPOINT a;
LOCK TABLE foo (interrupted by ^C or statement timeout); ROLLBACK TO
SAVEPOINT a; LOCK TABLE foo, and might have leaked strong-lock counts
in other situations.

Report by Zoltán Böszörményi; patch review by Jeff Davis.
2012-04-18 11:17:30 -04:00
Heikki Linnakangas fe546f3da6 Don't wait for the commit record to be replicated if we wrote no WAL.
When using synchronous replication, we waited for the commit record to be
replicated, but if we our transaction didn't write any other WAL records,
that's not required because we don't even flush the WAL locally to disk in
that case. This lead to long waits when committing a transaction that only
modified a temporary table. Bug spotted by Thom Brown.
2012-04-17 16:28:31 +03:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Tom Lane d0024cd188 Avoid crashing when we have problems unlinking files post-commit.
smgrdounlink takes care to not throw an ERROR if it fails to unlink
something, but that caution was rendered useless by commit
3396000684, which put an smgrexists call in
front of it; smgrexists *does* throw error if anything looks funny, such
as getting a permissions error from trying to open the file.  If that
happens post-commit, you get a PANIC, and what's worse the same logic
appears in the WAL replay code, so the database even fails to restart.

Restore the intended behavior by removing the smgrexists call --- it isn't
accomplishing anything that we can't do better by adjusting mdunlink's
ideas of whether it ought to warn about ENOENT or not.

Per report from Joseph Shraibman of unrecoverable crash after trying to
drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions.
Backpatch to 8.4, where the bogus coding was introduced.
2011-12-20 15:00:36 -05:00
Robert Haas ed0b409d22 Move "hot" members of PGPROC into a separate PGXACT array.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order.  These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-25 08:02:10 -05:00
Tom Lane bb446b689b Support synchronization of snapshots through an export/import procedure.
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT.  The data does not
leave the server so there are not security issues.  A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.

I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.

Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
2011-10-22 18:23:30 -04:00
Tom Lane 57eb009092 Allow snapshot references to still work during transaction abort.
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction.  This is the root cause of bug #6218 from Yamamoto
Takashi.  While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources.  However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner.  Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward.  predicate.c's dependencies on TopTransactionResourceOwner
go away too.

I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
2011-09-26 22:25:28 -04:00
Tom Lane a7801b62f2 Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead.  Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.

No actual code changes here, just header refactoring.
2011-09-09 13:23:41 -04:00
Simon Riggs df383b03e6 Partially revoke attempt to improve performance with many savepoints.
Maintain difference between subtransaction release and commit introduced
by earlier patch.
2011-09-07 12:11:26 +01:00
Tom Lane 1609797c25 Clean up the #include mess a little.
walsender.h should depend on xlog.h, not vice versa.  (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.)  Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h.  Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.

This episode reinforces my feeling that pgrminclude should not be run
without adult supervision.  Inclusion changes in header files in particular
need to be reviewed with great care.  More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
2011-09-04 01:13:16 -04:00
Bruce Momjian 6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Simon Riggs 7cb7122800 Remove O(N^2) performance issue with multiple SAVEPOINTs.
Subtransaction locks now released en masse at main commit, rather than
repeatedly re-scanning for locks as we ascend the nested transaction tree.
Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT
and TBLOCK_SUBRELEASE to allow the commit path to be optimised using
the existing code in ResourceOwnerRelease() which appears to have been
intended for this usage, judging from comments therein.
2011-07-19 17:21:24 +01:00
Simon Riggs 465883b0a2 Introduce compact WAL record for the common case of commit (non-DDL).
XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.

Leonardo Francalanci and Simon Riggs
2011-06-28 22:58:17 +01:00
Tom Lane d2088ae949 Move RegisterPredicateLockingXid() call to a safer place.
The SSI patch inserted a call of RegisterPredicateLockingXid into
GetNewTransactionId, which was a bad idea on a couple of grounds.  First,
it's not necessary to hold XidGenLock while manipulating that shared
memory, and doing so is bad because XidGenLock is a high-contention lock
that should be held for as short a time as possible.  (Not to mention that
it adds an entirely unnecessary deadlock hazard, since we must take
SerializableXactHashLock as well.)  Second, the specific place where it was
put was between extending CLOG and advancing nextXid, which could result in
unpleasant behavior in case of a failure there.  Pull the call out to
AssignTransactionId, which is much safer and arguably better from a
modularity standpoint too.

There is more work to do to clean up the failure-before-advancing-nextXid
issue, but that is a separate change that will need to be back-patched.
So for the moment I just want to make GetNewTransactionId look the same as
it did in prior versions.
2011-05-06 12:57:28 -04:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Simon Riggs 88f32b7ca2 Avoid assuming there will be only 3 states for synchronous_commit.
Also avoid hardcoding the current default state by giving it the name
"on" and replace with a meaningful name that reflects its behaviour.
Coding only, no change in behaviour.
2011-04-04 23:23:13 +01:00
Robert Haas 240067b3b0 Merge synchronous_replication setting into synchronous_commit.
This means one less thing to configure when setting up synchronous
replication, and also avoids some ambiguity around what the behavior
should be when the settings of these variables conflict.

Fujii Masao, with additional hacking by me.
2011-04-04 16:25:52 -04:00
Simon Riggs a8a8a3e096 Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
2011-03-06 22:49:16 +00:00