1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* vacuum.c
|
2001-07-14 00:55:59 +02:00
|
|
|
* The postgres vacuum cleaner.
|
|
|
|
*
|
2010-02-08 05:33:55 +01:00
|
|
|
* This file now includes only control and dispatch code for VACUUM and
|
|
|
|
* ANALYZE commands. Regular VACUUM is implemented in vacuumlazy.c,
|
|
|
|
* ANALYZE in analyze.c, and VACUUM FULL is a variant of CLUSTER, handled
|
|
|
|
* in cluster.c.
|
2001-07-14 00:55:59 +02:00
|
|
|
*
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2017-01-03 19:48:53 +01:00
|
|
|
* Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/commands/vacuum.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
2000-12-02 20:38:34 +01:00
|
|
|
#include "postgres.h"
|
|
|
|
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
#include <math.h>
|
|
|
|
|
2001-08-26 18:56:03 +02:00
|
|
|
#include "access/clog.h"
|
Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData(). This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.
This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.
A new test in src/test/modules is included.
Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.
Authors: Álvaro Herrera and Petr Jelínek
Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 15:53:02 +01:00
|
|
|
#include "access/commit_ts.h"
|
1998-04-27 06:08:07 +02:00
|
|
|
#include "access/genam.h"
|
|
|
|
#include "access/heapam.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
#include "access/multixact.h"
|
2006-07-13 18:49:20 +02:00
|
|
|
#include "access/transam.h"
|
|
|
|
#include "access/xact.h"
|
2002-04-02 03:03:07 +02:00
|
|
|
#include "catalog/namespace.h"
|
2001-08-26 18:56:03 +02:00
|
|
|
#include "catalog/pg_database.h"
|
2017-03-02 12:48:19 +01:00
|
|
|
#include "catalog/pg_inherits_fn.h"
|
2008-02-20 15:31:35 +01:00
|
|
|
#include "catalog/pg_namespace.h"
|
2010-01-06 06:31:14 +01:00
|
|
|
#include "commands/cluster.h"
|
1998-04-27 06:08:07 +02:00
|
|
|
#include "commands/vacuum.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "miscadmin.h"
|
2008-03-26 22:10:39 +01:00
|
|
|
#include "pgstat.h"
|
2006-01-18 21:35:06 +01:00
|
|
|
#include "postmaster/autovacuum.h"
|
2008-05-12 02:00:54 +02:00
|
|
|
#include "storage/bufmgr.h"
|
|
|
|
#include "storage/lmgr.h"
|
2011-09-04 07:13:16 +02:00
|
|
|
#include "storage/proc.h"
|
2005-05-19 23:35:48 +02:00
|
|
|
#include "storage/procarray.h"
|
1999-11-29 05:43:15 +01:00
|
|
|
#include "utils/acl.h"
|
2000-05-28 19:56:29 +02:00
|
|
|
#include "utils/fmgroids.h"
|
2011-09-04 07:13:16 +02:00
|
|
|
#include "utils/guc.h"
|
2005-05-06 19:24:55 +02:00
|
|
|
#include "utils/memutils.h"
|
2008-03-26 19:48:59 +01:00
|
|
|
#include "utils/snapmgr.h"
|
1998-04-27 06:08:07 +02:00
|
|
|
#include "utils/syscache.h"
|
2008-03-26 22:10:39 +01:00
|
|
|
#include "utils/tqual.h"
|
2001-06-22 21:16:24 +02:00
|
|
|
|
2001-05-07 02:43:27 +02:00
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
|
|
|
* GUC parameters
|
|
|
|
*/
|
|
|
|
int vacuum_freeze_min_age;
|
2009-01-16 14:27:24 +01:00
|
|
|
int vacuum_freeze_table_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
int vacuum_multixact_freeze_min_age;
|
|
|
|
int vacuum_multixact_freeze_table_age;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2001-05-07 02:43:27 +02:00
|
|
|
|
2007-05-30 22:12:03 +02:00
|
|
|
/* A few variables that don't seem worth passing around as parameters */
|
2000-06-28 05:33:33 +02:00
|
|
|
static MemoryContext vac_context = NULL;
|
2007-05-30 22:12:03 +02:00
|
|
|
static BufferAccessStrategy vac_strategy;
|
|
|
|
|
2001-05-07 02:43:27 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* non-export function prototypes */
|
2010-11-18 09:04:02 +01:00
|
|
|
static List *get_rel_oids(Oid relid, const RangeVar *vacrel);
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
static void vac_truncate_clog(TransactionId frozenXID,
|
|
|
|
MultiXactId minMulti,
|
|
|
|
TransactionId lastSaneFrozenXid,
|
|
|
|
MultiXactId lastSaneMinMulti);
|
2015-03-18 15:52:33 +01:00
|
|
|
static bool vacuum_rel(Oid relid, RangeVar *relation, int options,
|
|
|
|
VacuumParams *params);
|
2000-06-28 05:33:33 +02:00
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
/*
|
|
|
|
* Primary entry point for manual VACUUM and ANALYZE commands
|
|
|
|
*
|
|
|
|
* This is mainly a preparation wrapper for the real operations that will
|
|
|
|
* happen in vacuum().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecVacuum(VacuumStmt *vacstmt, bool isTopLevel)
|
|
|
|
{
|
2015-05-24 03:35:49 +02:00
|
|
|
VacuumParams params;
|
2015-03-18 15:52:33 +01:00
|
|
|
|
|
|
|
/* sanity checks on options */
|
|
|
|
Assert(vacstmt->options & (VACOPT_VACUUM | VACOPT_ANALYZE));
|
|
|
|
Assert((vacstmt->options & VACOPT_VACUUM) ||
|
|
|
|
!(vacstmt->options & (VACOPT_FULL | VACOPT_FREEZE)));
|
|
|
|
Assert((vacstmt->options & VACOPT_ANALYZE) || vacstmt->va_cols == NIL);
|
|
|
|
Assert(!(vacstmt->options & VACOPT_SKIPTOAST));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All freeze ages are zero if the FREEZE option is given; otherwise pass
|
|
|
|
* them as -1 which means to use the default values.
|
|
|
|
*/
|
|
|
|
if (vacstmt->options & VACOPT_FREEZE)
|
|
|
|
{
|
|
|
|
params.freeze_min_age = 0;
|
|
|
|
params.freeze_table_age = 0;
|
|
|
|
params.multixact_freeze_min_age = 0;
|
|
|
|
params.multixact_freeze_table_age = 0;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
params.freeze_min_age = -1;
|
|
|
|
params.freeze_table_age = -1;
|
|
|
|
params.multixact_freeze_min_age = -1;
|
|
|
|
params.multixact_freeze_table_age = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* user-invoked vacuum is never "for wraparound" */
|
|
|
|
params.is_wraparound = false;
|
|
|
|
|
2015-04-03 16:55:50 +02:00
|
|
|
/* user-invoked vacuum never uses this parameter */
|
|
|
|
params.log_min_duration = -1;
|
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
/* Now go through the common routine */
|
|
|
|
vacuum(vacstmt->options, vacstmt->relation, InvalidOid, ¶ms,
|
|
|
|
vacstmt->va_cols, NULL, isTopLevel);
|
|
|
|
}
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2001-05-07 02:43:27 +02:00
|
|
|
/*
|
|
|
|
* Primary entry point for VACUUM and ANALYZE commands.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2015-03-18 15:52:33 +01:00
|
|
|
* options is a bitmask of VacuumOption flags, indicating what to do.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2015-03-18 15:52:33 +01:00
|
|
|
* relid, if not InvalidOid, indicate the relation to process; otherwise,
|
|
|
|
* the RangeVar is used. (The latter must always be passed, because it's
|
|
|
|
* used for error messages.)
|
2008-08-13 02:07:50 +02:00
|
|
|
*
|
2015-03-18 15:52:33 +01:00
|
|
|
* params contains a set of parameters that can be used to customize the
|
|
|
|
* behavior.
|
|
|
|
*
|
|
|
|
* va_cols is a list of columns to analyze, or NIL to process them all.
|
2008-03-14 18:25:59 +01:00
|
|
|
*
|
2007-05-30 22:12:03 +02:00
|
|
|
* bstrategy is normally given as NULL, but in autovacuum it can be passed
|
|
|
|
* in to use the same buffer strategy object across multiple vacuum() calls.
|
|
|
|
*
|
2007-03-13 01:33:44 +01:00
|
|
|
* isTopLevel should be passed down from ProcessUtility.
|
|
|
|
*
|
2015-03-18 15:52:33 +01:00
|
|
|
* It is the caller's responsibility that all parameters are allocated in a
|
|
|
|
* memory context that will not disappear at transaction commit.
|
2001-05-07 02:43:27 +02:00
|
|
|
*/
|
1996-07-09 08:22:35 +02:00
|
|
|
void
|
2015-03-18 15:52:33 +01:00
|
|
|
vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params,
|
|
|
|
List *va_cols, BufferAccessStrategy bstrategy, bool isTopLevel)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2009-11-16 22:32:07 +01:00
|
|
|
const char *stmttype;
|
2011-04-11 21:28:45 +02:00
|
|
|
volatile bool in_outer_xact,
|
2004-05-23 01:14:38 +02:00
|
|
|
use_own_xacts;
|
2004-05-26 06:41:50 +02:00
|
|
|
List *relations;
|
2015-01-08 04:33:58 +01:00
|
|
|
static bool in_vacuum = false;
|
2002-04-02 03:03:07 +02:00
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
Assert(params != NULL);
|
2009-11-16 22:32:07 +01:00
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";
|
2009-11-16 22:32:07 +01:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We cannot run VACUUM inside a user transaction block; if we were inside
|
|
|
|
* a transaction, then our commit- and start-transaction-command calls
|
2010-02-26 03:01:40 +01:00
|
|
|
* would not have the intended effect! There are numerous other subtle
|
2010-02-08 05:33:55 +01:00
|
|
|
* dependencies on this, too.
|
2004-05-23 01:14:38 +02:00
|
|
|
*
|
|
|
|
* ANALYZE (without VACUUM) can run either way.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
if (options & VACOPT_VACUUM)
|
2004-05-23 01:14:38 +02:00
|
|
|
{
|
2007-03-13 01:33:44 +01:00
|
|
|
PreventTransactionChain(isTopLevel, stmttype);
|
2004-05-23 01:14:38 +02:00
|
|
|
in_outer_xact = false;
|
|
|
|
}
|
|
|
|
else
|
2007-03-13 01:33:44 +01:00
|
|
|
in_outer_xact = IsInTransactionChain(isTopLevel);
|
2002-06-13 21:52:02 +02:00
|
|
|
|
2015-01-08 04:33:58 +01:00
|
|
|
/*
|
|
|
|
* Due to static variables vac_context, anl_context and vac_strategy,
|
|
|
|
* vacuum() is not reentrant. This matters when VACUUM FULL or ANALYZE
|
|
|
|
* calls a hostile index expression that itself calls ANALYZE.
|
|
|
|
*/
|
|
|
|
if (in_vacuum)
|
2015-08-03 05:49:19 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("%s cannot be executed from VACUUM or ANALYZE",
|
|
|
|
stmttype)));
|
2015-01-08 04:33:58 +01:00
|
|
|
|
2016-06-17 21:48:57 +02:00
|
|
|
/*
|
|
|
|
* Sanity check DISABLE_PAGE_SKIPPING option.
|
|
|
|
*/
|
|
|
|
if ((options & VACOPT_FULL) != 0 &&
|
|
|
|
(options & VACOPT_DISABLE_PAGE_SKIPPING) != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
|
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* Send info about dead objects to the statistics collector, unless we are
|
|
|
|
* in autovacuum --- autovacuum.c does this for itself.
|
2004-10-07 16:19:58 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
if ((options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
|
2008-05-15 02:17:41 +02:00
|
|
|
pgstat_vacuum_stat();
|
2001-06-22 21:16:24 +02:00
|
|
|
|
2000-06-28 05:33:33 +02:00
|
|
|
/*
|
|
|
|
* Create special memory context for cross-transaction storage.
|
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Since it is a child of PortalContext, it will go away eventually even
|
|
|
|
* if we suffer an error; there's no need for special abort cleanup logic.
|
2000-06-28 05:33:33 +02:00
|
|
|
*/
|
2003-05-02 22:54:36 +02:00
|
|
|
vac_context = AllocSetContextCreate(PortalContext,
|
2000-06-28 05:33:33 +02:00
|
|
|
"Vacuum",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2000-03-09 00:41:00 +01:00
|
|
|
|
2007-05-30 22:12:03 +02:00
|
|
|
/*
|
|
|
|
* If caller didn't give us a buffer strategy object, make one in the
|
|
|
|
* cross-transaction memory context.
|
|
|
|
*/
|
|
|
|
if (bstrategy == NULL)
|
|
|
|
{
|
|
|
|
MemoryContext old_context = MemoryContextSwitchTo(vac_context);
|
|
|
|
|
|
|
|
bstrategy = GetAccessStrategy(BAS_VACUUM);
|
|
|
|
MemoryContextSwitchTo(old_context);
|
|
|
|
}
|
|
|
|
vac_strategy = bstrategy;
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Build list of relations to process, unless caller gave us one. (If we
|
|
|
|
* build one, we put it in vac_context for safekeeping.)
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
relations = get_rel_oids(relid, relation);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-05-23 01:14:38 +02:00
|
|
|
/*
|
|
|
|
* Decide whether we need to start/commit our own transactions.
|
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* For VACUUM (with or without ANALYZE): always do so, so that we can
|
|
|
|
* release locks as soon as possible. (We could possibly use the outer
|
2005-10-15 04:49:52 +02:00
|
|
|
* transaction for a one-table VACUUM, but handling TOAST tables would be
|
|
|
|
* problematic.)
|
2004-05-23 01:14:38 +02:00
|
|
|
*
|
|
|
|
* For ANALYZE (no VACUUM): if inside a transaction block, we cannot
|
2005-10-15 04:49:52 +02:00
|
|
|
* start/commit our own transactions. Also, there's no need to do so if
|
|
|
|
* only processing one relation. For multiple relations when not within a
|
2007-06-14 15:53:14 +02:00
|
|
|
* transaction block, and also in an autovacuum worker, use own
|
|
|
|
* transactions so we can release locks sooner.
|
2004-05-23 01:14:38 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
if (options & VACOPT_VACUUM)
|
2004-05-23 01:14:38 +02:00
|
|
|
use_own_xacts = true;
|
|
|
|
else
|
|
|
|
{
|
2015-03-18 15:52:33 +01:00
|
|
|
Assert(options & VACOPT_ANALYZE);
|
2007-06-14 15:53:14 +02:00
|
|
|
if (IsAutoVacuumWorkerProcess())
|
|
|
|
use_own_xacts = true;
|
|
|
|
else if (in_outer_xact)
|
2004-05-23 01:14:38 +02:00
|
|
|
use_own_xacts = false;
|
2004-05-26 06:41:50 +02:00
|
|
|
else if (list_length(relations) > 1)
|
2004-05-23 01:14:38 +02:00
|
|
|
use_own_xacts = true;
|
|
|
|
else
|
|
|
|
use_own_xacts = false;
|
|
|
|
}
|
|
|
|
|
2002-06-15 23:52:31 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* vacuum_rel expects to be entered with no transaction active; it will
|
|
|
|
* start and commit its own transaction. But we are called by an SQL
|
|
|
|
* command, and so we are executing inside a transaction already. We
|
|
|
|
* commit the transaction started in PostgresMain() here, and start
|
|
|
|
* another one before exiting to match the commit waiting for us back in
|
|
|
|
* PostgresMain().
|
2000-03-09 00:41:00 +01:00
|
|
|
*/
|
2004-05-23 01:14:38 +02:00
|
|
|
if (use_own_xacts)
|
2002-06-13 21:52:02 +02:00
|
|
|
{
|
2014-10-30 18:03:22 +01:00
|
|
|
Assert(!in_outer_xact);
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/* ActiveSnapshot is not set by autovacuum */
|
|
|
|
if (ActiveSnapshotSet())
|
|
|
|
PopActiveSnapshot();
|
|
|
|
|
2002-06-13 21:52:02 +02:00
|
|
|
/* matches the StartTransaction in PostgresMain() */
|
2003-05-14 05:26:03 +02:00
|
|
|
CommitTransactionCommand();
|
2002-06-13 21:52:02 +02:00
|
|
|
}
|
2000-03-09 00:41:00 +01:00
|
|
|
|
2004-07-31 02:45:57 +02:00
|
|
|
/* Turn vacuum cost accounting on or off */
|
|
|
|
PG_TRY();
|
2001-05-07 02:43:27 +02:00
|
|
|
{
|
2004-07-31 02:45:57 +02:00
|
|
|
ListCell *cur;
|
2002-04-02 03:03:07 +02:00
|
|
|
|
2015-01-08 04:33:58 +01:00
|
|
|
in_vacuum = true;
|
2004-08-06 06:15:09 +02:00
|
|
|
VacuumCostActive = (VacuumCostDelay > 0);
|
2004-07-31 02:45:57 +02:00
|
|
|
VacuumCostBalance = 0;
|
2011-11-25 16:10:46 +01:00
|
|
|
VacuumPageHit = 0;
|
|
|
|
VacuumPageMiss = 0;
|
|
|
|
VacuumPageDirty = 0;
|
2004-07-31 02:45:57 +02:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
|
|
|
* Loop to process each selected relation.
|
|
|
|
*/
|
2004-07-31 02:45:57 +02:00
|
|
|
foreach(cur, relations)
|
2002-06-13 21:52:02 +02:00
|
|
|
{
|
2004-07-31 02:45:57 +02:00
|
|
|
Oid relid = lfirst_oid(cur);
|
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
if (options & VACOPT_VACUUM)
|
2011-02-08 04:04:29 +01:00
|
|
|
{
|
2015-03-18 15:52:33 +01:00
|
|
|
if (!vacuum_rel(relid, relation, options, params))
|
2011-02-08 04:04:29 +01:00
|
|
|
continue;
|
|
|
|
}
|
2006-07-10 18:20:52 +02:00
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
if (options & VACOPT_ANALYZE)
|
2002-06-13 21:52:02 +02:00
|
|
|
{
|
2004-07-31 02:45:57 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* If using separate xacts, start one for analyze. Otherwise,
|
2009-12-29 21:11:45 +01:00
|
|
|
* we can use the outer transaction.
|
2004-07-31 02:45:57 +02:00
|
|
|
*/
|
|
|
|
if (use_own_xacts)
|
|
|
|
{
|
|
|
|
StartTransactionCommand();
|
2004-09-13 22:10:13 +02:00
|
|
|
/* functions in indexes may want a snapshot set */
|
2008-05-12 22:02:02 +02:00
|
|
|
PushActiveSnapshot(GetTransactionSnapshot());
|
2004-07-31 02:45:57 +02:00
|
|
|
}
|
|
|
|
|
2015-04-03 16:55:50 +02:00
|
|
|
analyze_rel(relid, relation, options, params,
|
2015-03-18 15:52:33 +01:00
|
|
|
va_cols, in_outer_xact, vac_strategy);
|
2004-07-31 02:45:57 +02:00
|
|
|
|
|
|
|
if (use_own_xacts)
|
2008-05-12 22:02:02 +02:00
|
|
|
{
|
|
|
|
PopActiveSnapshot();
|
2004-07-31 02:45:57 +02:00
|
|
|
CommitTransactionCommand();
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
2002-06-13 21:52:02 +02:00
|
|
|
}
|
|
|
|
}
|
2001-05-07 02:43:27 +02:00
|
|
|
}
|
2004-07-31 02:45:57 +02:00
|
|
|
PG_CATCH();
|
|
|
|
{
|
2015-01-08 04:33:58 +01:00
|
|
|
in_vacuum = false;
|
2004-07-31 02:45:57 +02:00
|
|
|
VacuumCostActive = false;
|
|
|
|
PG_RE_THROW();
|
|
|
|
}
|
|
|
|
PG_END_TRY();
|
|
|
|
|
2015-01-08 04:33:58 +01:00
|
|
|
in_vacuum = false;
|
2004-07-31 02:45:57 +02:00
|
|
|
VacuumCostActive = false;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
|
|
|
* Finish up processing.
|
|
|
|
*/
|
2004-05-23 01:14:38 +02:00
|
|
|
if (use_own_xacts)
|
2001-08-26 18:56:03 +02:00
|
|
|
{
|
2002-06-15 23:52:31 +02:00
|
|
|
/* here, we are not in a transaction */
|
1999-05-01 21:09:46 +02:00
|
|
|
|
2002-08-31 00:18:07 +02:00
|
|
|
/*
|
2002-09-04 22:31:48 +02:00
|
|
|
* This matches the CommitTransaction waiting for us in
|
2003-05-14 05:26:03 +02:00
|
|
|
* PostgresMain().
|
2002-08-31 00:18:07 +02:00
|
|
|
*/
|
2003-05-14 05:26:03 +02:00
|
|
|
StartTransactionCommand();
|
2004-05-23 01:14:38 +02:00
|
|
|
}
|
2000-06-28 05:33:33 +02:00
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
if ((options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
|
2004-05-23 01:14:38 +02:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
2017-03-17 14:46:58 +01:00
|
|
|
* Update pg_database.datfrozenxid, and truncate pg_xact if possible.
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* (autovacuum.c does this for itself.)
|
|
|
|
*/
|
|
|
|
vac_update_datfrozenxid();
|
2001-08-26 18:56:03 +02:00
|
|
|
}
|
|
|
|
|
2000-06-28 05:33:33 +02:00
|
|
|
/*
|
|
|
|
* Clean up working storage --- note we must do this after
|
2005-10-15 04:49:52 +02:00
|
|
|
* StartTransactionCommand, else we might be trying to delete the active
|
|
|
|
* context!
|
2000-06-28 05:33:33 +02:00
|
|
|
*/
|
|
|
|
MemoryContextDelete(vac_context);
|
|
|
|
vac_context = NULL;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2002-04-02 03:03:07 +02:00
|
|
|
* Build a list of Oids for each relation to be processed
|
2001-07-12 06:11:13 +02:00
|
|
|
*
|
|
|
|
* The list is built in vac_context so that it will survive across our
|
|
|
|
* per-relation transactions.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2002-04-02 03:03:07 +02:00
|
|
|
static List *
|
2010-11-18 09:04:02 +01:00
|
|
|
get_rel_oids(Oid relid, const RangeVar *vacrel)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-01-06 19:07:32 +01:00
|
|
|
List *oid_list = NIL;
|
2002-04-02 03:03:07 +02:00
|
|
|
MemoryContext oldcontext;
|
|
|
|
|
2008-06-05 17:47:32 +02:00
|
|
|
/* OID supplied by VACUUM's caller? */
|
|
|
|
if (OidIsValid(relid))
|
|
|
|
{
|
|
|
|
oldcontext = MemoryContextSwitchTo(vac_context);
|
|
|
|
oid_list = lappend_oid(oid_list, relid);
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
else if (vacrel)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2004-01-06 19:07:32 +01:00
|
|
|
/* Process a specific relation */
|
2002-09-04 22:31:48 +02:00
|
|
|
Oid relid;
|
2017-03-02 12:48:19 +01:00
|
|
|
HeapTuple tuple;
|
|
|
|
Form_pg_class classForm;
|
|
|
|
bool include_parts;
|
2002-04-02 03:03:07 +02:00
|
|
|
|
2011-07-09 04:19:30 +02:00
|
|
|
/*
|
2012-06-10 21:20:04 +02:00
|
|
|
* Since we don't take a lock here, the relation might be gone, or the
|
|
|
|
* RangeVar might no longer refer to the OID we look up here. In the
|
|
|
|
* former case, VACUUM will do nothing; in the latter case, it will
|
2013-05-29 22:58:43 +02:00
|
|
|
* process the OID we looked up here, rather than the new one. Neither
|
|
|
|
* is ideal, but there's little practical alternative, since we're
|
|
|
|
* going to commit this transaction and begin a new one between now
|
|
|
|
* and then.
|
2011-07-09 04:19:30 +02:00
|
|
|
*/
|
Improve table locking behavior in the face of current DDL.
In the previous coding, callers were faced with an awkward choice:
look up the name, do permissions checks, and then lock the table; or
look up the name, lock the table, and then do permissions checks.
The first choice was wrong because the results of the name lookup
and permissions checks might be out-of-date by the time the table
lock was acquired, while the second allowed a user with no privileges
to interfere with access to a table by users who do have privileges
(e.g. if a malicious backend queues up for an AccessExclusiveLock on
a table on which AccessShareLock is already held, further attempts
to access the table will be blocked until the AccessExclusiveLock
is obtained and the malicious backend's transaction rolls back).
To fix, allow callers of RangeVarGetRelid() to pass a callback which
gets executed after performing the name lookup but before acquiring
the relation lock. If the name lookup is retried (because
invalidation messages are received), the callback will be re-executed
as well, so we get the best of both worlds. RangeVarGetRelid() is
renamed to RangeVarGetRelidExtended(); callers not wishing to supply
a callback can continue to invoke it as RangeVarGetRelid(), which is
now a macro. Since the only one caller that uses nowait = true now
passes a callback anyway, the RangeVarGetRelid() macro defaults nowait
as well. The callback can also be used for supplemental locking - for
example, REINDEX INDEX needs to acquire the table lock before the index
lock to reduce deadlock possibilities.
There's a lot more work to be done here to fix all the cases where this
can be a problem, but this commit provides the general infrastructure
and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE,
LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE.
Per discussion with Noah Misch and Alvaro Herrera.
2011-11-30 16:12:27 +01:00
|
|
|
relid = RangeVarGetRelid(vacrel, NoLock, false);
|
2002-04-02 03:03:07 +02:00
|
|
|
|
2017-03-02 12:48:19 +01:00
|
|
|
/*
|
|
|
|
* To check whether the relation is a partitioned table, fetch its
|
|
|
|
* syscache entry.
|
|
|
|
*/
|
|
|
|
tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
|
|
|
|
if (!HeapTupleIsValid(tuple))
|
|
|
|
elog(ERROR, "cache lookup failed for relation %u", relid);
|
|
|
|
classForm = (Form_pg_class) GETSTRUCT(tuple);
|
|
|
|
include_parts = (classForm->relkind == RELKIND_PARTITIONED_TABLE);
|
|
|
|
ReleaseSysCache(tuple);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make relation list entries for this guy and its partitions, if any.
|
|
|
|
* Note that the list returned by find_all_inheritors() include the
|
|
|
|
* passed-in OID at its head. Also note that we did not request a
|
|
|
|
* lock to be taken to match what would be done otherwise.
|
|
|
|
*/
|
2002-04-02 03:03:07 +02:00
|
|
|
oldcontext = MemoryContextSwitchTo(vac_context);
|
2017-03-02 12:48:19 +01:00
|
|
|
if (include_parts)
|
|
|
|
oid_list = list_concat(oid_list,
|
|
|
|
find_all_inheritors(relid, NoLock, NULL));
|
|
|
|
else
|
|
|
|
oid_list = lappend_oid(oid_list, relid);
|
2002-04-02 03:03:07 +02:00
|
|
|
MemoryContextSwitchTo(oldcontext);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2013-03-04 01:23:31 +01:00
|
|
|
/*
|
|
|
|
* Process all plain relations and materialized views listed in
|
|
|
|
* pg_class
|
|
|
|
*/
|
2002-04-02 03:03:07 +02:00
|
|
|
Relation pgclass;
|
|
|
|
HeapScanDesc scan;
|
|
|
|
HeapTuple tuple;
|
1996-10-03 22:11:41 +02:00
|
|
|
|
2005-04-14 22:03:27 +02:00
|
|
|
pgclass = heap_open(RelationRelationId, AccessShareLock);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
scan = heap_beginscan_catalog(pgclass, 0, NULL);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2002-05-21 01:51:44 +02:00
|
|
|
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2013-03-04 01:23:31 +01:00
|
|
|
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
|
|
|
|
|
2017-03-02 12:48:19 +01:00
|
|
|
/*
|
|
|
|
* We include partitioned tables here; depending on which
|
|
|
|
* operation is to be performed, caller will decide whether to
|
|
|
|
* process or ignore them.
|
|
|
|
*/
|
2013-03-04 01:23:31 +01:00
|
|
|
if (classForm->relkind != RELKIND_RELATION &&
|
2017-03-02 12:48:19 +01:00
|
|
|
classForm->relkind != RELKIND_MATVIEW &&
|
|
|
|
classForm->relkind != RELKIND_PARTITIONED_TABLE)
|
2013-03-04 01:23:31 +01:00
|
|
|
continue;
|
|
|
|
|
2002-04-02 03:03:07 +02:00
|
|
|
/* Make a relation list entry for this guy */
|
|
|
|
oldcontext = MemoryContextSwitchTo(vac_context);
|
2004-05-26 06:41:50 +02:00
|
|
|
oid_list = lappend_oid(oid_list, HeapTupleGetOid(tuple));
|
2002-04-02 03:03:07 +02:00
|
|
|
MemoryContextSwitchTo(oldcontext);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2002-04-02 03:03:07 +02:00
|
|
|
heap_endscan(scan);
|
|
|
|
heap_close(pgclass, AccessShareLock);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
|
|
|
|
2004-01-06 19:07:32 +01:00
|
|
|
return oid_list;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2001-08-26 18:56:03 +02:00
|
|
|
/*
|
|
|
|
* vacuum_set_xid_limits() -- compute oldest-Xmin and freeze cutoff points
|
2013-11-28 20:52:54 +01:00
|
|
|
*
|
|
|
|
* The output parameters are:
|
|
|
|
* - oldestXmin is the cutoff value used to distinguish whether tuples are
|
2014-05-06 18:12:18 +02:00
|
|
|
* DEAD or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
|
2013-11-28 20:52:54 +01:00
|
|
|
* - freezeLimit is the Xid below which all Xids are replaced by
|
2014-05-06 18:12:18 +02:00
|
|
|
* FrozenTransactionId during vacuum.
|
2013-12-13 18:58:48 +01:00
|
|
|
* - xidFullScanLimit (computed from table_freeze_age parameter)
|
2014-05-06 18:12:18 +02:00
|
|
|
* represents a minimum Xid value; a table whose relfrozenxid is older than
|
|
|
|
* this will have a full-table vacuum applied to it, to freeze tuples across
|
|
|
|
* the whole table. Vacuuming a table younger than this value can use a
|
|
|
|
* partial scan.
|
2013-11-28 20:52:54 +01:00
|
|
|
* - multiXactCutoff is the value below which all MultiXactIds are removed from
|
2014-05-06 18:12:18 +02:00
|
|
|
* Xmax.
|
2013-11-28 20:52:54 +01:00
|
|
|
* - mxactFullScanLimit is a value against which a table's relminmxid value is
|
2014-05-06 18:12:18 +02:00
|
|
|
* compared to produce a full-table vacuum, as with xidFullScanLimit.
|
2013-11-28 20:52:54 +01:00
|
|
|
*
|
|
|
|
* xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is
|
|
|
|
* not interested.
|
2001-08-26 18:56:03 +02:00
|
|
|
*/
|
|
|
|
void
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
vacuum_set_xid_limits(Relation rel,
|
|
|
|
int freeze_min_age,
|
2009-01-16 14:27:24 +01:00
|
|
|
int freeze_table_age,
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
int multixact_freeze_min_age,
|
|
|
|
int multixact_freeze_table_age,
|
2001-08-26 18:56:03 +02:00
|
|
|
TransactionId *oldestXmin,
|
2009-01-16 14:27:24 +01:00
|
|
|
TransactionId *freezeLimit,
|
2013-11-28 20:52:54 +01:00
|
|
|
TransactionId *xidFullScanLimit,
|
|
|
|
MultiXactId *multiXactCutoff,
|
|
|
|
MultiXactId *mxactFullScanLimit)
|
2001-08-26 18:56:03 +02:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
int freezemin;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
int mxid_freezemin;
|
2015-05-08 18:09:14 +02:00
|
|
|
int effective_multixact_freeze_max_age;
|
2001-08-26 18:56:03 +02:00
|
|
|
TransactionId limit;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
TransactionId safeLimit;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
MultiXactId mxactLimit;
|
|
|
|
MultiXactId safeMxactLimit;
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2006-07-30 04:07:18 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* We can always ignore processes running lazy vacuum. This is because we
|
2006-07-30 04:07:18 +02:00
|
|
|
* use these values only for deciding which tuples we must keep in the
|
2014-05-06 18:12:18 +02:00
|
|
|
* tables. Since lazy vacuum doesn't write its XID anywhere, it's safe to
|
2010-02-08 05:33:55 +01:00
|
|
|
* ignore it. In theory it could be problematic to ignore lazy vacuums in
|
2007-11-15 22:14:46 +01:00
|
|
|
* a full vacuum, but keep in mind that only one vacuum process can be
|
|
|
|
* working on a particular table at any time, and that each vacuum is
|
|
|
|
* always an independent transaction.
|
2006-07-30 04:07:18 +02:00
|
|
|
*/
|
2016-04-08 21:36:30 +02:00
|
|
|
*oldestXmin =
|
2017-03-22 17:51:01 +01:00
|
|
|
TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
|
2001-08-26 18:56:03 +02:00
|
|
|
|
|
|
|
Assert(TransactionIdIsNormal(*oldestXmin));
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Determine the minimum freeze age to use: as specified by the caller, or
|
|
|
|
* vacuum_freeze_min_age, but in any case not more than half
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* autovacuum_freeze_max_age, so that autovacuums to prevent XID
|
|
|
|
* wraparound won't occur too frequently.
|
|
|
|
*/
|
2007-05-17 17:28:29 +02:00
|
|
|
freezemin = freeze_min_age;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
if (freezemin < 0)
|
|
|
|
freezemin = vacuum_freeze_min_age;
|
|
|
|
freezemin = Min(freezemin, autovacuum_freeze_max_age / 2);
|
|
|
|
Assert(freezemin >= 0);
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* Compute the cutoff XID, being careful not to generate a "permanent" XID
|
2004-10-07 16:19:58 +02:00
|
|
|
*/
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
limit = *oldestXmin - freezemin;
|
2001-08-26 18:56:03 +02:00
|
|
|
if (!TransactionIdIsNormal(limit))
|
|
|
|
limit = FirstNormalTransactionId;
|
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* If oldestXmin is very far back (in practice, more than
|
2007-11-15 22:14:46 +01:00
|
|
|
* autovacuum_freeze_max_age / 2 XIDs old), complain and force a minimum
|
|
|
|
* freeze age of zero.
|
2004-10-07 16:19:58 +02:00
|
|
|
*/
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
safeLimit = ReadNewTransactionId() - autovacuum_freeze_max_age;
|
|
|
|
if (!TransactionIdIsNormal(safeLimit))
|
|
|
|
safeLimit = FirstNormalTransactionId;
|
|
|
|
|
|
|
|
if (TransactionIdPrecedes(limit, safeLimit))
|
2001-08-26 18:56:03 +02:00
|
|
|
{
|
2003-07-20 23:56:35 +02:00
|
|
|
ereport(WARNING,
|
2003-09-25 08:58:07 +02:00
|
|
|
(errmsg("oldest xmin is far in the past"),
|
2003-07-20 23:56:35 +02:00
|
|
|
errhint("Close open transactions soon to avoid wraparound problems.")));
|
2001-08-26 18:56:03 +02:00
|
|
|
limit = *oldestXmin;
|
|
|
|
}
|
|
|
|
|
|
|
|
*freezeLimit = limit;
|
2009-01-16 14:27:24 +01:00
|
|
|
|
2015-05-08 18:09:14 +02:00
|
|
|
/*
|
|
|
|
* Compute the multixact age for which freezing is urgent. This is
|
2015-05-24 03:35:49 +02:00
|
|
|
* normally autovacuum_multixact_freeze_max_age, but may be less if we are
|
|
|
|
* short of multixact member space.
|
2015-05-08 18:09:14 +02:00
|
|
|
*/
|
|
|
|
effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
|
|
|
|
|
2013-11-28 20:52:54 +01:00
|
|
|
/*
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
* Determine the minimum multixact freeze age to use: as specified by
|
|
|
|
* caller, or vacuum_multixact_freeze_min_age, but in any case not more
|
2015-05-08 18:09:14 +02:00
|
|
|
* than half effective_multixact_freeze_max_age, so that autovacuums to
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
* prevent MultiXact wraparound won't occur too frequently.
|
2013-11-28 20:52:54 +01:00
|
|
|
*/
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
mxid_freezemin = multixact_freeze_min_age;
|
|
|
|
if (mxid_freezemin < 0)
|
|
|
|
mxid_freezemin = vacuum_multixact_freeze_min_age;
|
|
|
|
mxid_freezemin = Min(mxid_freezemin,
|
2015-05-08 18:09:14 +02:00
|
|
|
effective_multixact_freeze_max_age / 2);
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
Assert(mxid_freezemin >= 0);
|
|
|
|
|
|
|
|
/* compute the cutoff multi, being careful to generate a valid value */
|
|
|
|
mxactLimit = GetOldestMultiXactId() - mxid_freezemin;
|
2013-11-28 20:52:54 +01:00
|
|
|
if (mxactLimit < FirstMultiXactId)
|
|
|
|
mxactLimit = FirstMultiXactId;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
|
|
|
|
safeMxactLimit =
|
2015-05-08 18:09:14 +02:00
|
|
|
ReadNextMultiXactId() - effective_multixact_freeze_max_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
if (safeMxactLimit < FirstMultiXactId)
|
|
|
|
safeMxactLimit = FirstMultiXactId;
|
|
|
|
|
|
|
|
if (MultiXactIdPrecedes(mxactLimit, safeMxactLimit))
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("oldest multixact is far in the past"),
|
|
|
|
errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
|
|
|
|
mxactLimit = safeMxactLimit;
|
|
|
|
}
|
|
|
|
|
2013-11-28 20:52:54 +01:00
|
|
|
*multiXactCutoff = mxactLimit;
|
|
|
|
|
|
|
|
if (xidFullScanLimit != NULL)
|
2009-01-16 14:27:24 +01:00
|
|
|
{
|
2009-06-11 16:49:15 +02:00
|
|
|
int freezetable;
|
2009-01-16 14:27:24 +01:00
|
|
|
|
2013-11-28 20:52:54 +01:00
|
|
|
Assert(mxactFullScanLimit != NULL);
|
|
|
|
|
2009-01-16 14:27:24 +01:00
|
|
|
/*
|
|
|
|
* Determine the table freeze age to use: as specified by the caller,
|
|
|
|
* or vacuum_freeze_table_age, but in any case not more than
|
|
|
|
* autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
|
|
|
|
* VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
|
|
|
|
* before anti-wraparound autovacuum is launched.
|
|
|
|
*/
|
2013-02-01 16:00:40 +01:00
|
|
|
freezetable = freeze_table_age;
|
2009-01-16 14:27:24 +01:00
|
|
|
if (freezetable < 0)
|
|
|
|
freezetable = vacuum_freeze_table_age;
|
|
|
|
freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
|
|
|
|
Assert(freezetable >= 0);
|
|
|
|
|
|
|
|
/*
|
2013-11-28 20:52:54 +01:00
|
|
|
* Compute XID limit causing a full-table vacuum, being careful not to
|
|
|
|
* generate a "permanent" XID.
|
2009-01-16 14:27:24 +01:00
|
|
|
*/
|
|
|
|
limit = ReadNewTransactionId() - freezetable;
|
|
|
|
if (!TransactionIdIsNormal(limit))
|
|
|
|
limit = FirstNormalTransactionId;
|
|
|
|
|
2013-11-28 20:52:54 +01:00
|
|
|
*xidFullScanLimit = limit;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
|
|
|
|
/*
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
* Similar to the above, determine the table freeze age to use for
|
|
|
|
* multixacts: as specified by the caller, or
|
|
|
|
* vacuum_multixact_freeze_table_age, but in any case not more than
|
|
|
|
* autovacuum_multixact_freeze_table_age * 0.95, so that if you have
|
|
|
|
* e.g. nightly VACUUM schedule, the nightly VACUUM gets a chance to
|
|
|
|
* freeze multixacts before anti-wraparound autovacuum is launched.
|
|
|
|
*/
|
|
|
|
freezetable = multixact_freeze_table_age;
|
|
|
|
if (freezetable < 0)
|
|
|
|
freezetable = vacuum_multixact_freeze_table_age;
|
|
|
|
freezetable = Min(freezetable,
|
2015-05-08 18:09:14 +02:00
|
|
|
effective_multixact_freeze_max_age * 0.95);
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
Assert(freezetable >= 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute MultiXact limit causing a full-table vacuum, being careful
|
|
|
|
* to generate a valid MultiXact value.
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
*/
|
2013-11-28 20:52:54 +01:00
|
|
|
mxactLimit = ReadNextMultiXactId() - freezetable;
|
|
|
|
if (mxactLimit < FirstMultiXactId)
|
|
|
|
mxactLimit = FirstMultiXactId;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
|
2013-11-28 20:52:54 +01:00
|
|
|
*mxactFullScanLimit = mxactLimit;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
}
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
else
|
|
|
|
{
|
|
|
|
Assert(mxactFullScanLimit == NULL);
|
|
|
|
}
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
}
|
2001-07-12 06:11:13 +02:00
|
|
|
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
/*
|
|
|
|
* vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
|
|
|
|
*
|
|
|
|
* If we scanned the whole relation then we should just use the count of
|
|
|
|
* live tuples seen; but if we did not, we should not trust the count
|
|
|
|
* unreservedly, especially not in VACUUM, which may have scanned a quite
|
2014-05-06 18:12:18 +02:00
|
|
|
* nonrandom subset of the table. When we have only partial information,
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
* we take the old value of pg_class.reltuples as a measurement of the
|
|
|
|
* tuple density in the unscanned pages.
|
|
|
|
*
|
|
|
|
* This routine is shared by VACUUM and ANALYZE.
|
|
|
|
*/
|
|
|
|
double
|
|
|
|
vac_estimate_reltuples(Relation relation, bool is_analyze,
|
|
|
|
BlockNumber total_pages,
|
|
|
|
BlockNumber scanned_pages,
|
|
|
|
double scanned_tuples)
|
|
|
|
{
|
2011-06-09 20:32:50 +02:00
|
|
|
BlockNumber old_rel_pages = relation->rd_rel->relpages;
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
double old_rel_tuples = relation->rd_rel->reltuples;
|
|
|
|
double old_density;
|
|
|
|
double new_density;
|
|
|
|
double multiplier;
|
|
|
|
double updated_density;
|
|
|
|
|
|
|
|
/* If we did scan the whole table, just use the count as-is */
|
|
|
|
if (scanned_pages >= total_pages)
|
|
|
|
return scanned_tuples;
|
|
|
|
|
|
|
|
/*
|
2011-06-09 20:32:50 +02:00
|
|
|
* If scanned_pages is zero but total_pages isn't, keep the existing value
|
Fix a missed case in code for "moving average" estimate of reltuples.
It is possible for VACUUM to scan no pages at all, if the visibility map
shows that all pages are all-visible. In this situation VACUUM has no new
information to report about the relation's tuple density, so it wasn't
changing pg_class.reltuples ... but it updated pg_class.relpages anyway.
That's wrong in general, since there is no evidence to justify changing the
density ratio reltuples/relpages, but it's particularly bad if the previous
state was relpages=reltuples=0, which means "unknown tuple density".
We just replaced "unknown" with "zero". ANALYZE would eventually recover
from this, but it could take a lot of repetitions of ANALYZE to do so if
the relation size is much larger than the maximum number of pages ANALYZE
will scan, because of the moving-average behavior introduced by commit
b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8.
The only known situation where we could have relpages=reltuples=0 and yet
the visibility map asserts everything's visible is immediately following
a pg_upgrade. It might be advisable for pg_upgrade to try to preserve the
relpages/reltuples statistics; but in any case this code is wrong on its
own terms, so fix it. Per report from Sergey Koposov.
Back-patch to 8.4, where the visibility map was introduced, same as the
previous change.
2011-08-30 20:49:45 +02:00
|
|
|
* of reltuples. (Note: callers should avoid updating the pg_class
|
|
|
|
* statistics in this situation, since no new information has been
|
|
|
|
* provided.)
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
*/
|
|
|
|
if (scanned_pages == 0)
|
|
|
|
return old_rel_tuples;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If old value of relpages is zero, old density is indeterminate; we
|
|
|
|
* can't do much except scale up scanned_tuples to match total_pages.
|
|
|
|
*/
|
|
|
|
if (old_rel_pages == 0)
|
|
|
|
return floor((scanned_tuples / scanned_pages) * total_pages + 0.5);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Okay, we've covered the corner cases. The normal calculation is to
|
2011-06-09 20:32:50 +02:00
|
|
|
* convert the old measurement to a density (tuples per page), then update
|
|
|
|
* the density using an exponential-moving-average approach, and finally
|
|
|
|
* compute reltuples as updated_density * total_pages.
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
*
|
2011-06-09 20:32:50 +02:00
|
|
|
* For ANALYZE, the moving average multiplier is just the fraction of the
|
|
|
|
* table's pages we scanned. This is equivalent to assuming that the
|
|
|
|
* tuple density in the unscanned pages didn't change. Of course, it
|
|
|
|
* probably did, if the new density measurement is different. But over
|
|
|
|
* repeated cycles, the value of reltuples will converge towards the
|
|
|
|
* correct value, if repeated measurements show the same new density.
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
*
|
|
|
|
* For VACUUM, the situation is a bit different: we have looked at a
|
|
|
|
* nonrandom sample of pages, but we know for certain that the pages we
|
|
|
|
* didn't look at are precisely the ones that haven't changed lately.
|
|
|
|
* Thus, there is a reasonable argument for doing exactly the same thing
|
2011-06-09 20:32:50 +02:00
|
|
|
* as for the ANALYZE case, that is use the old density measurement as the
|
|
|
|
* value for the unscanned pages.
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
*
|
|
|
|
* This logic could probably use further refinement.
|
|
|
|
*/
|
|
|
|
old_density = old_rel_tuples / old_rel_pages;
|
|
|
|
new_density = scanned_tuples / scanned_pages;
|
|
|
|
multiplier = (double) scanned_pages / (double) total_pages;
|
|
|
|
updated_density = old_density + (new_density - old_density) * multiplier;
|
|
|
|
return floor(updated_density * total_pages + 0.5);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2001-07-12 06:11:13 +02:00
|
|
|
* vac_update_relstats() -- update statistics for one relation
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2001-07-12 06:11:13 +02:00
|
|
|
* Update the whole-relation statistics that are kept in its pg_class
|
|
|
|
* row. There are additional stats that will be updated if we are
|
|
|
|
* doing ANALYZE, but we always update these stats. This routine works
|
|
|
|
* for both index and heap relation entries in pg_class.
|
|
|
|
*
|
2006-05-11 01:18:39 +02:00
|
|
|
* We violate transaction semantics here by overwriting the rel's
|
|
|
|
* existing pg_class tuple with the new values. This is reasonably
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
* safe as long as we're sure that the new values are correct whether or
|
|
|
|
* not this transaction commits. The reason for doing this is that if
|
|
|
|
* we updated these tuples in the usual way, vacuuming pg_class itself
|
|
|
|
* wouldn't work very well --- by the time we got done with a vacuum
|
|
|
|
* cycle, most of the tuples in pg_class would've been obsoleted. Of
|
|
|
|
* course, this only works for fixed-size not-null columns, but these are.
|
2008-11-10 01:49:37 +01:00
|
|
|
*
|
2006-07-30 04:07:18 +02:00
|
|
|
* Another reason for doing it this way is that when we are in a lazy
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
* VACUUM and have PROC_IN_VACUUM set, we mustn't do any regular updates.
|
|
|
|
* Somebody vacuuming pg_class might think they could delete a tuple
|
2007-10-24 22:55:36 +02:00
|
|
|
* marked with xmin = our xid.
|
2006-07-30 04:07:18 +02:00
|
|
|
*
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
* In addition to fundamentally nontransactional statistics such as
|
|
|
|
* relpages and relallvisible, we try to maintain certain lazily-updated
|
|
|
|
* DDL flags such as relhasindex, by clearing them if no longer correct.
|
|
|
|
* It's safe to do this in VACUUM, which can't run in parallel with
|
|
|
|
* CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
|
2014-10-30 18:03:22 +01:00
|
|
|
* However, it's *not* safe to do it in an ANALYZE that's within an
|
|
|
|
* outer transaction, because for example the current transaction might
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
* have dropped the last index; then we'd think relhasindex should be
|
|
|
|
* cleared, but if the transaction later rolls back this would be wrong.
|
2014-10-30 18:03:22 +01:00
|
|
|
* So we refrain from updating the DDL flags if we're inside an outer
|
|
|
|
* transaction. This is OK since postponing the flag maintenance is
|
|
|
|
* always allowable.
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
*
|
Fix VACUUM so that it always updates pg_class.reltuples/relpages.
When we added the ability for vacuum to skip heap pages by consulting the
visibility map, we made it just not update the reltuples/relpages
statistics if it skipped any pages. But this could leave us with extremely
out-of-date stats for a table that contains any unchanging areas,
especially for TOAST tables which never get processed by ANALYZE. In
particular this could result in autovacuum making poor decisions about when
to process the table, as in recent report from Florian Helmberger. And in
general it's a bad idea to not update the stats at all. Instead, use the
previous values of reltuples/relpages as an estimate of the tuple density
in unvisited pages. This approach results in a "moving average" estimate
of reltuples, which should converge to the correct value over multiple
VACUUM and ANALYZE cycles even when individual measurements aren't very
good.
This new method for updating reltuples is used by both VACUUM and ANALYZE,
with the result that we no longer need the grotty interconnections that
caused ANALYZE to not update the stats depending on what had happened
in the parent VACUUM command.
Also, fix the logic for skipping all-visible pages during VACUUM so that it
looks ahead rather than behind to decide what to do, as per a suggestion
from Greg Stark. This eliminates useless scanning of all-visible pages at
the start of the relation or just after a not-all-visible page. In
particular, the first few pages of the relation will not be invariably
included in the scanned pages, which seems to help in not overweighting
them in the reltuples estimate.
Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
|
|
|
* This routine is shared by VACUUM and ANALYZE.
|
2001-07-12 06:11:13 +02:00
|
|
|
*/
|
|
|
|
void
|
2008-11-10 01:49:37 +01:00
|
|
|
vac_update_relstats(Relation relation,
|
|
|
|
BlockNumber num_pages, double num_tuples,
|
2011-10-14 23:23:01 +02:00
|
|
|
BlockNumber num_all_visible_pages,
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
bool hasindex, TransactionId frozenxid,
|
2014-10-30 18:03:22 +01:00
|
|
|
MultiXactId minmulti,
|
|
|
|
bool in_outer_xact)
|
2001-07-12 06:11:13 +02:00
|
|
|
{
|
2008-11-10 01:49:37 +01:00
|
|
|
Oid relid = RelationGetRelid(relation);
|
2001-07-12 06:11:13 +02:00
|
|
|
Relation rd;
|
|
|
|
HeapTuple ctup;
|
|
|
|
Form_pg_class pgcform;
|
2006-05-11 01:18:39 +02:00
|
|
|
bool dirty;
|
2001-07-12 06:11:13 +02:00
|
|
|
|
2005-04-14 22:03:27 +02:00
|
|
|
rd = heap_open(RelationRelationId, RowExclusiveLock);
|
2001-07-12 06:11:13 +02:00
|
|
|
|
2006-05-11 01:18:39 +02:00
|
|
|
/* Fetch a copy of the tuple to scribble on */
|
2010-02-14 19:42:19 +01:00
|
|
|
ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
|
2001-07-12 06:11:13 +02:00
|
|
|
if (!HeapTupleIsValid(ctup))
|
|
|
|
elog(ERROR, "pg_class entry for relid %u vanished during vacuuming",
|
|
|
|
relid);
|
2006-05-11 01:18:39 +02:00
|
|
|
pgcform = (Form_pg_class) GETSTRUCT(ctup);
|
2001-07-12 06:11:13 +02:00
|
|
|
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
/* Apply statistical updates, if any, to copied tuple */
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2006-05-11 01:18:39 +02:00
|
|
|
dirty = false;
|
|
|
|
if (pgcform->relpages != (int32) num_pages)
|
|
|
|
{
|
|
|
|
pgcform->relpages = (int32) num_pages;
|
|
|
|
dirty = true;
|
|
|
|
}
|
|
|
|
if (pgcform->reltuples != (float4) num_tuples)
|
|
|
|
{
|
|
|
|
pgcform->reltuples = (float4) num_tuples;
|
|
|
|
dirty = true;
|
|
|
|
}
|
2011-10-14 23:23:01 +02:00
|
|
|
if (pgcform->relallvisible != (int32) num_all_visible_pages)
|
|
|
|
{
|
|
|
|
pgcform->relallvisible = (int32) num_all_visible_pages;
|
|
|
|
dirty = true;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2014-10-30 18:03:22 +01:00
|
|
|
/* Apply DDL updates, but not inside an outer transaction (see above) */
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2014-10-30 18:03:22 +01:00
|
|
|
if (!in_outer_xact)
|
2008-11-10 01:49:37 +01:00
|
|
|
{
|
Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 23:12:02 +01:00
|
|
|
/*
|
|
|
|
* If we didn't find any indexes, reset relhasindex.
|
|
|
|
*/
|
|
|
|
if (pgcform->relhasindex && !hasindex)
|
|
|
|
{
|
|
|
|
pgcform->relhasindex = false;
|
|
|
|
dirty = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have discovered that there are no indexes, then there's no
|
|
|
|
* primary key either. This could be done more thoroughly...
|
|
|
|
*/
|
|
|
|
if (pgcform->relhaspkey && !hasindex)
|
|
|
|
{
|
|
|
|
pgcform->relhaspkey = false;
|
|
|
|
dirty = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We also clear relhasrules and relhastriggers if needed */
|
|
|
|
if (pgcform->relhasrules && relation->rd_rules == NULL)
|
|
|
|
{
|
|
|
|
pgcform->relhasrules = false;
|
|
|
|
dirty = true;
|
|
|
|
}
|
|
|
|
if (pgcform->relhastriggers && relation->trigdesc == NULL)
|
|
|
|
{
|
|
|
|
pgcform->relhastriggers = false;
|
|
|
|
dirty = true;
|
|
|
|
}
|
2008-11-10 01:49:37 +01:00
|
|
|
}
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* Update relfrozenxid, unless caller passed InvalidTransactionId
|
|
|
|
* indicating it has no new data.
|
|
|
|
*
|
|
|
|
* Ordinarily, we don't let relfrozenxid go backwards: if things are
|
|
|
|
* working correctly, the only way the new frozenxid could be older would
|
|
|
|
* be if a previous VACUUM was done with a tighter freeze_min_age, in
|
|
|
|
* which case we don't want to forget the work it already did. However,
|
|
|
|
* if the stored relfrozenxid is "in the future", then it must be corrupt
|
|
|
|
* and it seems best to overwrite it with the cutoff we used this time.
|
2014-07-21 18:58:41 +02:00
|
|
|
* This should match vac_update_datfrozenxid() concerning what we consider
|
|
|
|
* to be "in the future".
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*/
|
|
|
|
if (TransactionIdIsNormal(frozenxid) &&
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
pgcform->relfrozenxid != frozenxid &&
|
|
|
|
(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
|
2014-07-21 18:58:41 +02:00
|
|
|
TransactionIdPrecedes(ReadNewTransactionId(),
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
pgcform->relfrozenxid)))
|
2006-07-10 18:20:52 +02:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
pgcform->relfrozenxid = frozenxid;
|
2006-07-10 18:20:52 +02:00
|
|
|
dirty = true;
|
|
|
|
}
|
2003-10-03 01:19:44 +02:00
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/* Similarly for relminmxid */
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
if (MultiXactIdIsValid(minmulti) &&
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
pgcform->relminmxid != minmulti &&
|
|
|
|
(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
|
2014-07-21 18:58:41 +02:00
|
|
|
MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
{
|
|
|
|
pgcform->relminmxid = minmulti;
|
|
|
|
dirty = true;
|
|
|
|
}
|
|
|
|
|
2008-12-17 10:15:03 +01:00
|
|
|
/* If anything changed, write out the tuple. */
|
2006-05-11 01:18:39 +02:00
|
|
|
if (dirty)
|
|
|
|
heap_inplace_update(rd, ctup);
|
2001-07-12 06:11:13 +02:00
|
|
|
|
|
|
|
heap_close(rd, RowExclusiveLock);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2001-08-26 18:56:03 +02:00
|
|
|
/*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* vac_update_datfrozenxid() -- update pg_database.datfrozenxid for our DB
|
2001-08-26 18:56:03 +02:00
|
|
|
*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* Update pg_database's datfrozenxid entry for our database to be the
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* minimum of the pg_class.relfrozenxid values.
|
|
|
|
*
|
2013-09-16 20:45:00 +02:00
|
|
|
* Similarly, update our datminmxid to be the minimum of the
|
|
|
|
* pg_class.relminmxid values.
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
*
|
|
|
|
* If we are able to advance either pg_database value, also try to
|
2017-03-17 14:46:58 +01:00
|
|
|
* truncate pg_xact and pg_multixact.
|
2001-08-26 18:56:03 +02:00
|
|
|
*
|
2006-05-11 01:18:39 +02:00
|
|
|
* We violate transaction semantics here by overwriting the database's
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* existing pg_database tuple with the new values. This is reasonably
|
|
|
|
* safe since the new values are correct whether or not this transaction
|
2006-05-11 01:18:39 +02:00
|
|
|
* commits. As with vac_update_relstats, this avoids leaving dead tuples
|
|
|
|
* behind after a VACUUM.
|
2001-08-26 18:56:03 +02:00
|
|
|
*/
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
void
|
|
|
|
vac_update_datfrozenxid(void)
|
2001-08-26 18:56:03 +02:00
|
|
|
{
|
|
|
|
HeapTuple tuple;
|
|
|
|
Form_pg_database dbform;
|
2006-07-10 18:20:52 +02:00
|
|
|
Relation relation;
|
2006-10-04 02:30:14 +02:00
|
|
|
SysScanDesc scan;
|
2006-07-10 18:20:52 +02:00
|
|
|
HeapTuple classTup;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
TransactionId newFrozenXid;
|
2013-09-16 20:45:00 +02:00
|
|
|
MultiXactId newMinMulti;
|
2014-07-21 18:58:41 +02:00
|
|
|
TransactionId lastSaneFrozenXid;
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
MultiXactId lastSaneMinMulti;
|
|
|
|
bool bogus = false;
|
2006-07-10 18:20:52 +02:00
|
|
|
bool dirty = false;
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
2008-09-11 16:01:10 +02:00
|
|
|
* Initialize the "min" calculation with GetOldestXmin, which is a
|
|
|
|
* reasonable approximation to the minimum relfrozenxid for not-yet-
|
|
|
|
* committed pg_class entries for new tables; see AddNewRelationTuple().
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* So we cannot produce a wrong minimum by starting with this.
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*/
|
2017-03-22 17:51:01 +01:00
|
|
|
newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
/*
|
2013-05-29 22:58:43 +02:00
|
|
|
* Similarly, initialize the MultiXact "min" with the value that would be
|
|
|
|
* used on pg_class for new tables. See AddNewRelationTuple().
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
*/
|
2014-07-21 18:58:41 +02:00
|
|
|
newMinMulti = GetOldestMultiXactId();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Identify the latest relfrozenxid and relminmxid values that we could
|
|
|
|
* validly see during the scan. These are conservative values, but it's
|
|
|
|
* not really worth trying to be more exact.
|
|
|
|
*/
|
|
|
|
lastSaneFrozenXid = ReadNewTransactionId();
|
|
|
|
lastSaneMinMulti = ReadNextMultiXactId();
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
|
2006-10-04 02:30:14 +02:00
|
|
|
/*
|
|
|
|
* We must seqscan pg_class to find the minimum Xid, because there is no
|
|
|
|
* index that can help us here.
|
2006-07-10 18:20:52 +02:00
|
|
|
*/
|
|
|
|
relation = heap_open(RelationRelationId, AccessShareLock);
|
|
|
|
|
|
|
|
scan = systable_beginscan(relation, InvalidOid, false,
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
NULL, 0, NULL);
|
2006-07-10 18:20:52 +02:00
|
|
|
|
|
|
|
while ((classTup = systable_getnext(scan)) != NULL)
|
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(classTup);
|
2006-07-10 18:20:52 +02:00
|
|
|
|
|
|
|
/*
|
2013-07-05 21:25:51 +02:00
|
|
|
* Only consider relations able to hold unfrozen XIDs (anything else
|
|
|
|
* should have InvalidTransactionId in relfrozenxid anyway.)
|
2006-07-10 18:20:52 +02:00
|
|
|
*/
|
|
|
|
if (classForm->relkind != RELKIND_RELATION &&
|
2013-03-04 01:23:31 +01:00
|
|
|
classForm->relkind != RELKIND_MATVIEW &&
|
2006-07-10 18:20:52 +02:00
|
|
|
classForm->relkind != RELKIND_TOASTVALUE)
|
|
|
|
continue;
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
Assert(TransactionIdIsNormal(classForm->relfrozenxid));
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
Assert(MultiXactIdIsValid(classForm->relminmxid));
|
2006-07-10 18:20:52 +02:00
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/*
|
|
|
|
* If things are working properly, no relation should have a
|
|
|
|
* relfrozenxid or relminmxid that is "in the future". However, such
|
|
|
|
* cases have been known to arise due to bugs in pg_upgrade. If we
|
|
|
|
* see any entries that are "in the future", chicken out and don't do
|
|
|
|
* anything. This ensures we won't truncate clog before those
|
|
|
|
* relations have been scanned and cleaned up.
|
|
|
|
*/
|
|
|
|
if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
|
|
|
|
MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid))
|
|
|
|
{
|
|
|
|
bogus = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
if (TransactionIdPrecedes(classForm->relfrozenxid, newFrozenXid))
|
|
|
|
newFrozenXid = classForm->relfrozenxid;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
|
2013-09-16 20:45:00 +02:00
|
|
|
if (MultiXactIdPrecedes(classForm->relminmxid, newMinMulti))
|
|
|
|
newMinMulti = classForm->relminmxid;
|
2006-07-10 18:20:52 +02:00
|
|
|
}
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2006-07-10 18:20:52 +02:00
|
|
|
/* we're done with pg_class */
|
|
|
|
systable_endscan(scan);
|
|
|
|
heap_close(relation, AccessShareLock);
|
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/* chicken out if bogus data found */
|
|
|
|
if (bogus)
|
|
|
|
return;
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
Assert(TransactionIdIsNormal(newFrozenXid));
|
2013-09-16 20:45:00 +02:00
|
|
|
Assert(MultiXactIdIsValid(newMinMulti));
|
2006-07-10 18:20:52 +02:00
|
|
|
|
|
|
|
/* Now fetch the pg_database tuple we need to update. */
|
2005-04-14 22:03:27 +02:00
|
|
|
relation = heap_open(DatabaseRelationId, RowExclusiveLock);
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2006-05-11 01:18:39 +02:00
|
|
|
/* Fetch a copy of the tuple to scribble on */
|
2010-02-14 19:42:19 +01:00
|
|
|
tuple = SearchSysCacheCopy1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
|
2001-08-26 18:56:03 +02:00
|
|
|
if (!HeapTupleIsValid(tuple))
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
elog(ERROR, "could not find tuple for database %u", MyDatabaseId);
|
2001-08-26 18:56:03 +02:00
|
|
|
dbform = (Form_pg_database) GETSTRUCT(tuple);
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* As in vac_update_relstats(), we ordinarily don't want to let
|
|
|
|
* datfrozenxid go backward; but if it's "in the future" then it must be
|
|
|
|
* corrupt and it seems best to overwrite it.
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*/
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
if (dbform->datfrozenxid != newFrozenXid &&
|
|
|
|
(TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid) ||
|
|
|
|
TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid)))
|
2006-07-10 18:20:52 +02:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
dbform->datfrozenxid = newFrozenXid;
|
2006-07-10 18:20:52 +02:00
|
|
|
dirty = true;
|
|
|
|
}
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
else
|
|
|
|
newFrozenXid = dbform->datfrozenxid;
|
2001-08-26 18:56:03 +02:00
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/* Ditto for datminmxid */
|
|
|
|
if (dbform->datminmxid != newMinMulti &&
|
|
|
|
(MultiXactIdPrecedes(dbform->datminmxid, newMinMulti) ||
|
|
|
|
MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid)))
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
{
|
2013-09-16 20:45:00 +02:00
|
|
|
dbform->datminmxid = newMinMulti;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
dirty = true;
|
|
|
|
}
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
else
|
|
|
|
newMinMulti = dbform->datminmxid;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
|
2006-07-10 18:20:52 +02:00
|
|
|
if (dirty)
|
|
|
|
heap_inplace_update(relation, tuple);
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2006-07-10 18:20:52 +02:00
|
|
|
heap_freetuple(tuple);
|
2001-08-26 18:56:03 +02:00
|
|
|
heap_close(relation, RowExclusiveLock);
|
2005-07-29 21:30:09 +02:00
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* If we were able to advance datfrozenxid or datminmxid, see if we can
|
2017-03-17 14:46:58 +01:00
|
|
|
* truncate pg_xact and/or pg_multixact. Also do it if the shared
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* XID-wrap-limit info is stale, since this action will update that too.
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*/
|
2009-09-01 06:46:49 +02:00
|
|
|
if (dirty || ForceTransactionIdLimitUpdate())
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
vac_truncate_clog(newFrozenXid, newMinMulti,
|
|
|
|
lastSaneFrozenXid, lastSaneMinMulti);
|
2001-08-26 18:56:03 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vac_truncate_clog() -- attempt to truncate the commit log
|
|
|
|
*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* Scan pg_database to determine the system-wide oldest datfrozenxid,
|
2017-03-17 14:46:58 +01:00
|
|
|
* and use it to truncate the transaction commit log (pg_xact).
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* Also update the XID wrap limit info maintained by varsup.c.
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* Likewise for datminmxid.
|
2001-08-26 18:56:03 +02:00
|
|
|
*
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* The passed frozenXID and minMulti are the updated values for my own
|
|
|
|
* pg_database entry. They're used to initialize the "min" calculations.
|
|
|
|
* The caller also passes the "last sane" XID and MXID, since it has
|
|
|
|
* those at hand already.
|
2001-08-26 18:56:03 +02:00
|
|
|
*
|
2010-02-15 17:10:34 +01:00
|
|
|
* This routine is only invoked when we've managed to change our
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* DB's datfrozenxid/datminmxid values, or we found that the shared
|
|
|
|
* XID-wrap-limit info is stale.
|
2001-08-26 18:56:03 +02:00
|
|
|
*/
|
|
|
|
static void
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
vac_truncate_clog(TransactionId frozenXID,
|
|
|
|
MultiXactId minMulti,
|
|
|
|
TransactionId lastSaneFrozenXid,
|
|
|
|
MultiXactId lastSaneMinMulti)
|
2001-08-26 18:56:03 +02:00
|
|
|
{
|
2016-05-24 21:20:12 +02:00
|
|
|
TransactionId nextXID = ReadNewTransactionId();
|
2001-08-26 18:56:03 +02:00
|
|
|
Relation relation;
|
|
|
|
HeapScanDesc scan;
|
|
|
|
HeapTuple tuple;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
Oid oldestxid_datoid;
|
2013-09-16 20:45:00 +02:00
|
|
|
Oid minmulti_datoid;
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
bool bogus = false;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
bool frozenAlreadyWrapped = false;
|
2002-04-02 07:11:55 +02:00
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/* init oldest datoids to sync with my frozenXID/minMulti values */
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
oldestxid_datoid = MyDatabaseId;
|
2013-09-16 20:45:00 +02:00
|
|
|
minmulti_datoid = MyDatabaseId;
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2005-02-20 03:22:07 +01:00
|
|
|
/*
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
* Scan pg_database to compute the minimum datfrozenxid/datminmxid
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*
|
2016-05-24 21:47:51 +02:00
|
|
|
* Since vac_update_datfrozenxid updates datfrozenxid/datminmxid in-place,
|
|
|
|
* the values could change while we look at them. Fetch each one just
|
|
|
|
* once to ensure sane behavior of the comparison logic. (Here, as in
|
|
|
|
* many other places, we assume that fetching or updating an XID in shared
|
|
|
|
* storage is atomic.)
|
|
|
|
*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* Note: we need not worry about a race condition with new entries being
|
|
|
|
* inserted by CREATE DATABASE. Any such entry will have a copy of some
|
|
|
|
* existing DB's datfrozenxid, and that source DB cannot be ours because
|
|
|
|
* of the interlock against copying a DB containing an active backend.
|
2007-11-15 22:14:46 +01:00
|
|
|
* Hence the new entry will not reduce the minimum. Also, if two VACUUMs
|
|
|
|
* concurrently modify the datfrozenxid's of different databases, the
|
2017-03-17 14:46:58 +01:00
|
|
|
* worst possible outcome is that pg_xact is not truncated as aggressively
|
2007-11-15 22:14:46 +01:00
|
|
|
* as it could be.
|
2005-02-20 03:22:07 +01:00
|
|
|
*/
|
2005-04-14 22:03:27 +02:00
|
|
|
relation = heap_open(DatabaseRelationId, AccessShareLock);
|
2001-08-26 18:56:03 +02:00
|
|
|
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
scan = heap_beginscan_catalog(relation, 0, NULL);
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2002-05-21 01:51:44 +02:00
|
|
|
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
|
2001-08-26 18:56:03 +02:00
|
|
|
{
|
2016-05-24 21:47:51 +02:00
|
|
|
volatile FormData_pg_database *dbform = (Form_pg_database) GETSTRUCT(tuple);
|
|
|
|
TransactionId datfrozenxid = dbform->datfrozenxid;
|
|
|
|
TransactionId datminmxid = dbform->datminmxid;
|
2001-08-26 18:56:03 +02:00
|
|
|
|
2016-05-24 21:47:51 +02:00
|
|
|
Assert(TransactionIdIsNormal(datfrozenxid));
|
|
|
|
Assert(MultiXactIdIsValid(datminmxid));
|
2006-07-10 18:20:52 +02:00
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/*
|
|
|
|
* If things are working properly, no database should have a
|
|
|
|
* datfrozenxid or datminmxid that is "in the future". However, such
|
|
|
|
* cases have been known to arise due to bugs in pg_upgrade. If we
|
|
|
|
* see any entries that are "in the future", chicken out and don't do
|
|
|
|
* anything. This ensures we won't truncate clog before those
|
|
|
|
* databases have been scanned and cleaned up. (We will issue the
|
|
|
|
* "already wrapped" warning if appropriate, though.)
|
|
|
|
*/
|
2016-05-24 21:47:51 +02:00
|
|
|
if (TransactionIdPrecedes(lastSaneFrozenXid, datfrozenxid) ||
|
|
|
|
MultiXactIdPrecedes(lastSaneMinMulti, datminmxid))
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
bogus = true;
|
|
|
|
|
2016-05-24 21:47:51 +02:00
|
|
|
if (TransactionIdPrecedes(nextXID, datfrozenxid))
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
frozenAlreadyWrapped = true;
|
2016-05-24 21:47:51 +02:00
|
|
|
else if (TransactionIdPrecedes(datfrozenxid, frozenXID))
|
2002-04-02 07:11:55 +02:00
|
|
|
{
|
2016-05-24 21:47:51 +02:00
|
|
|
frozenXID = datfrozenxid;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
oldestxid_datoid = HeapTupleGetOid(tuple);
|
|
|
|
}
|
|
|
|
|
2016-05-24 21:47:51 +02:00
|
|
|
if (MultiXactIdPrecedes(datminmxid, minMulti))
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
{
|
2016-05-24 21:47:51 +02:00
|
|
|
minMulti = datminmxid;
|
2013-09-16 20:45:00 +02:00
|
|
|
minmulti_datoid = HeapTupleGetOid(tuple);
|
2002-04-02 07:11:55 +02:00
|
|
|
}
|
2001-08-26 18:56:03 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
heap_endscan(scan);
|
|
|
|
|
|
|
|
heap_close(relation, AccessShareLock);
|
|
|
|
|
2002-04-02 07:11:55 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Do not truncate CLOG if we seem to have suffered wraparound already;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* the computed minimum XID might be bogus. This case should now be
|
|
|
|
* impossible due to the defenses in GetNewTransactionId, but we keep the
|
|
|
|
* test anyway.
|
2002-04-02 07:11:55 +02:00
|
|
|
*/
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
if (frozenAlreadyWrapped)
|
2002-04-02 07:11:55 +02:00
|
|
|
{
|
2003-07-20 23:56:35 +02:00
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("some databases have not been vacuumed in over 2 billion transactions"),
|
Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
2007-02-01 20:10:30 +01:00
|
|
|
errdetail("You might have already suffered transaction-wraparound data loss.")));
|
2002-04-02 07:11:55 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Defend against bad relfrozenxid/relminmxid/datfrozenxid/datminmxid values.
In commit a61daa14d56867e90dc011bbba52ef771cea6770, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
2014-07-21 17:41:27 +02:00
|
|
|
/* chicken out if data is bogus in any other way */
|
|
|
|
if (bogus)
|
|
|
|
return;
|
|
|
|
|
Fix race condition in reading commit timestamps
If a user requests the commit timestamp for a transaction old enough
that its data is concurrently being truncated away by vacuum at just the
right time, they would receive an ugly internal file-not-found error
message from slru.c rather than the expected NULL return value.
In a primary server, the window for the race is very small: the lookup
has to occur exactly between the two calls by vacuum, and there's not a
lot that happens between them (mostly just a multixact truncate). In a
standby server, however, the window is larger because the truncation is
executed as soon as the WAL record for it is replayed, but the advance
of the oldest-Xid is not executed until the next checkpoint record.
To fix in the primary, simply reverse the order of operations in
vac_truncate_clog. To fix in the standby, augment the WAL truncation
record so that the standby is aware of the new oldest-XID value and can
apply the update immediately. WAL version bumped because of this.
No backpatch, because of the low importance of the bug and its rarity.
Author: Craig Ringer
Reviewed-By: Petr Jelínek, Peter Eisentraut
Discussion: https://postgr.es/m/CAMsr+YFhVtRQT1VAwC+WGbbxZZRzNou=N9Ed-FrCqkwQ8H8oJQ@mail.gmail.com
2017-01-19 22:23:09 +01:00
|
|
|
/*
|
|
|
|
* Advance the oldest value for commit timestamps before truncating, so
|
|
|
|
* that if a user requests a timestamp for a transaction we're truncating
|
|
|
|
* away right after this point, they get NULL instead of an ugly "file not
|
|
|
|
* found" error from slru.c. This doesn't matter for xact/multixact
|
|
|
|
* because they are not subject to arbitrary lookups from users.
|
|
|
|
*/
|
|
|
|
AdvanceOldestCommitTsXid(frozenXID);
|
|
|
|
|
2014-06-27 20:43:53 +02:00
|
|
|
/*
|
Rework the way multixact truncations work.
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.
We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for multixact truncations.
This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
this has gotten much easier
During the course of fixing this a bunch of additional bugs had to be
fixed:
1) Data was not purged from memory the member's SLRU before deleting
segments. This happened to be hard or impossible to hit due to the
interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
that doesn't work for offsets that haven't yet been flushed to
disk. Add code to flush the SLRUs to fix. Not pretty, but it feels
slightly safer to only make decisions based on actual on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
of emergency vacuuming. The problem remains in
pg_get_multixact_members(), but that's quite harmless.
For now this is going to only get applied to 9.5+, leaving the issues in
the older branches in place. It is quite possible that we need to
backpatch at a later point though.
For the case this gets backpatched we need to handle that an updated
standby may be replaying WAL from a not-yet upgraded primary. We have to
recognize that situation and use "old style" truncation (i.e. looking at
the SLRUs) during WAL replay. In contrast to before, this now happens in
the startup process, when replaying a checkpoint record, instead of the
checkpointer. Doing truncation in the restartpoint is incorrect, they
can happen much later than the original checkpoint, thereby leading to
wraparound. To avoid "multixact_redo: unknown op code 48" errors
standbys would have to be upgraded before primaries.
A later patch will bump the WAL page magic, and remove the legacy
truncation codepaths. Legacy truncation support is just included to make
a possible future backpatch easier.
Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: 9.5 for now
2015-09-26 19:04:25 +02:00
|
|
|
* Truncate CLOG, multixact and CommitTs to the oldest computed value.
|
2014-06-27 20:43:53 +02:00
|
|
|
*/
|
2017-03-23 19:08:23 +01:00
|
|
|
TruncateCLOG(frozenXID, oldestxid_datoid);
|
2015-10-27 19:06:50 +01:00
|
|
|
TruncateCommitTs(frozenXID);
|
2015-09-26 19:04:25 +02:00
|
|
|
TruncateMultiXact(minMulti, minmulti_datoid);
|
2006-07-10 18:20:52 +02:00
|
|
|
|
|
|
|
/*
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* Update the wrap limit for GetNewTransactionId and creation of new
|
2013-05-29 22:58:43 +02:00
|
|
|
* MultiXactIds. Note: these functions will also signal the postmaster
|
2014-05-06 18:12:18 +02:00
|
|
|
* for an(other) autovac cycle if needed. XXX should we avoid possibly
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* signalling twice?
|
2006-07-10 18:20:52 +02:00
|
|
|
*/
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
SetTransactionIdLimit(frozenXID, oldestxid_datoid);
|
2017-03-14 17:47:46 +01:00
|
|
|
SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
|
2001-08-26 18:56:03 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2001-07-12 06:11:13 +02:00
|
|
|
/*
|
|
|
|
* vacuum_rel() -- vacuum one heap relation
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* Doing one heap at a time incurs extra overhead, since we need to
|
2014-05-06 18:12:18 +02:00
|
|
|
* check that the heap exists again just before we vacuum it. The
|
1997-09-07 07:04:48 +02:00
|
|
|
* reason that we do this is so that vacuuming can be spread across
|
|
|
|
* many small transactions. Otherwise, two-phase locking would require
|
|
|
|
* us to lock the entire database during one pass of the vacuum cleaner.
|
2000-12-22 01:51:54 +01:00
|
|
|
*
|
|
|
|
* At entry and exit, we are not inside a transaction.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2011-02-08 04:04:29 +01:00
|
|
|
static bool
|
2015-03-18 15:52:33 +01:00
|
|
|
vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2001-07-12 06:11:13 +02:00
|
|
|
LOCKMODE lmode;
|
1997-09-08 04:41:22 +02:00
|
|
|
Relation onerel;
|
2000-12-22 01:51:54 +01:00
|
|
|
LockRelId onerelid;
|
2000-07-05 18:17:43 +02:00
|
|
|
Oid toast_relid;
|
2008-01-03 22:23:15 +01:00
|
|
|
Oid save_userid;
|
Prevent indirect security attacks via changing session-local state within
an allegedly immutable index function. It was previously recognized that
we had to prevent such a function from executing SET/RESET ROLE/SESSION
AUTHORIZATION, or it could trivially obtain the privileges of the session
user. However, since there is in general no privilege checking for changes
of session-local state, it is also possible for such a function to change
settings in a way that might subvert later operations in the same session.
Examples include changing search_path to cause an unexpected function to
be called, or replacing an existing prepared statement with another one
that will execute a function of the attacker's choosing.
The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
these threats, which are the same places previously deemed to need protection
against the SET ROLE issue. GUC changes are still allowed, since there are
many useful cases for that, but we prevent security problems by forcing a
rollback of any GUC change after completing the operation. Other cases are
handled by throwing an error if any change is attempted; these include temp
table creation, closing a cursor, and creating or deleting a prepared
statement. (In 7.4, the infrastructure to roll back GUC changes doesn't
exist, so we settle for rejecting changes of "search_path" in these contexts.)
Original report and patch by Gurjeet Singh, additional analysis by
Tom Lane.
Security: CVE-2009-4136
2009-12-09 22:57:51 +01:00
|
|
|
int save_sec_context;
|
|
|
|
int save_nestlevel;
|
1998-09-01 06:40:42 +02:00
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
Assert(params != NULL);
|
|
|
|
|
2000-12-22 01:51:54 +01:00
|
|
|
/* Begin a transaction for vacuuming this relation */
|
2003-05-14 05:26:03 +02:00
|
|
|
StartTransactionCommand();
|
2006-07-30 04:07:18 +02:00
|
|
|
|
2008-09-11 16:01:10 +02:00
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* Functions in indexes may want a snapshot set. Also, setting a snapshot
|
|
|
|
* ensures that RecentGlobalXmin is kept truly recent.
|
2008-09-11 16:01:10 +02:00
|
|
|
*/
|
|
|
|
PushActiveSnapshot(GetTransactionSnapshot());
|
|
|
|
|
2015-03-18 15:52:33 +01:00
|
|
|
if (!(options & VACOPT_FULL))
|
2006-07-30 04:07:18 +02:00
|
|
|
{
|
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* In lazy vacuum, we can set the PROC_IN_VACUUM flag, which lets
|
|
|
|
* other concurrent VACUUMs know that they can ignore this one while
|
2007-11-15 22:14:46 +01:00
|
|
|
* determining their OldestXmin. (The reason we don't set it during a
|
2010-02-08 05:33:55 +01:00
|
|
|
* full VACUUM is exactly that we may have to run user-defined
|
2007-11-15 22:14:46 +01:00
|
|
|
* functions for functional indexes, and we want to make sure that if
|
|
|
|
* they use the snapshot set above, any tuples it requires can't get
|
|
|
|
* removed from other tables. An index function that depends on the
|
|
|
|
* contents of other tables is arguably broken, but we won't break it
|
|
|
|
* here by violating transaction semantics.)
|
2006-07-30 04:07:18 +02:00
|
|
|
*
|
2009-06-11 16:49:15 +02:00
|
|
|
* We also set the VACUUM_FOR_WRAPAROUND flag, which is passed down by
|
2011-06-29 08:26:14 +02:00
|
|
|
* autovacuum; it's used to avoid canceling a vacuum that was invoked
|
2009-06-11 16:49:15 +02:00
|
|
|
* in an emergency.
|
2008-03-14 18:25:59 +01:00
|
|
|
*
|
2008-08-13 02:07:50 +02:00
|
|
|
* Note: these flags remain set until CommitTransaction or
|
|
|
|
* AbortTransaction. We don't want to clear them until we reset
|
2012-05-14 09:22:44 +02:00
|
|
|
* MyPgXact->xid/xmin, else OldestXmin might appear to go backwards,
|
2006-07-30 04:07:18 +02:00
|
|
|
* which is probably Not Good.
|
|
|
|
*/
|
2007-10-24 22:55:36 +02:00
|
|
|
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
|
2011-11-25 14:02:10 +01:00
|
|
|
MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
|
2015-03-18 15:52:33 +01:00
|
|
|
if (params->is_wraparound)
|
2011-11-25 14:02:10 +01:00
|
|
|
MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
|
2007-10-24 22:55:36 +02:00
|
|
|
LWLockRelease(ProcArrayLock);
|
2006-07-30 04:07:18 +02:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-11-28 03:10:01 +01:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Check for user-requested abort. Note we want this to be inside a
|
2002-03-06 07:10:59 +01:00
|
|
|
* transaction, so xact.c doesn't issue useless WARNING.
|
1999-11-28 03:10:01 +01:00
|
|
|
*/
|
2001-01-14 06:08:17 +01:00
|
|
|
CHECK_FOR_INTERRUPTS();
|
1999-11-28 03:10:01 +01:00
|
|
|
|
1999-09-18 21:08:25 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Determine the type of lock we want --- hard exclusive lock for a FULL
|
2005-11-22 19:17:34 +01:00
|
|
|
* vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
|
|
|
|
* way, we can be sure that no other backend is vacuuming the same table.
|
2001-07-12 06:11:13 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
lmode = (options & VACOPT_FULL) ? AccessExclusiveLock : ShareUpdateExclusiveLock;
|
2001-07-12 06:11:13 +02:00
|
|
|
|
|
|
|
/*
|
2006-08-18 18:09:13 +02:00
|
|
|
* Open the relation and get the appropriate lock on it.
|
|
|
|
*
|
2006-10-04 02:30:14 +02:00
|
|
|
* There's a race condition here: the rel may have gone away since the
|
|
|
|
* last time we saw it. If so, we don't need to vacuum it.
|
2011-02-08 04:04:29 +01:00
|
|
|
*
|
2011-04-10 17:42:00 +02:00
|
|
|
* If we've been asked not to wait for the relation lock, acquire it first
|
|
|
|
* in non-blocking mode, before calling try_relation_open().
|
2006-08-18 18:09:13 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
if (!(options & VACOPT_NOWAIT))
|
2011-02-08 04:04:29 +01:00
|
|
|
onerel = try_relation_open(relid, lmode);
|
|
|
|
else if (ConditionalLockRelationOid(relid, lmode))
|
|
|
|
onerel = try_relation_open(relid, NoLock);
|
|
|
|
else
|
|
|
|
{
|
|
|
|
onerel = NULL;
|
2015-04-03 16:55:50 +02:00
|
|
|
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
|
2011-02-08 04:04:29 +01:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("skipping vacuum of \"%s\" --- lock not available",
|
|
|
|
relation->relname)));
|
2011-02-08 04:04:29 +01:00
|
|
|
}
|
2006-08-18 18:09:13 +02:00
|
|
|
|
|
|
|
if (!onerel)
|
|
|
|
{
|
2008-09-11 16:01:10 +02:00
|
|
|
PopActiveSnapshot();
|
2006-08-18 18:09:13 +02:00
|
|
|
CommitTransactionCommand();
|
2011-02-08 04:04:29 +01:00
|
|
|
return false;
|
2006-08-18 18:09:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check permissions.
|
1999-11-29 05:43:15 +01:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* We allow the user to vacuum a table if he is superuser, the table
|
|
|
|
* owner, or the database owner (but in the latter case, only if it's not
|
2014-05-06 18:12:18 +02:00
|
|
|
* a shared relation). pg_class_ownercheck includes the superuser case.
|
2001-06-13 23:44:41 +02:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Note we choose to treat permissions failure as a WARNING and keep
|
|
|
|
* trying to vacuum the rest of the DB --- is this appropriate?
|
1999-09-18 21:08:25 +02:00
|
|
|
*/
|
2002-03-22 00:27:25 +01:00
|
|
|
if (!(pg_class_ownercheck(RelationGetRelid(onerel), GetUserId()) ||
|
2003-06-27 16:45:32 +02:00
|
|
|
(pg_database_ownercheck(MyDatabaseId, GetUserId()) && !onerel->rd_rel->relisshared)))
|
1999-11-29 05:43:15 +01:00
|
|
|
{
|
2008-02-20 15:31:35 +01:00
|
|
|
if (onerel->rd_rel->relisshared)
|
|
|
|
ereport(WARNING,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(errmsg("skipping \"%s\" --- only superuser can vacuum it",
|
|
|
|
RelationGetRelationName(onerel))));
|
2008-02-20 15:31:35 +01:00
|
|
|
else if (onerel->rd_rel->relnamespace == PG_CATALOG_NAMESPACE)
|
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("skipping \"%s\" --- only superuser or database owner can vacuum it",
|
|
|
|
RelationGetRelationName(onerel))));
|
|
|
|
else
|
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("skipping \"%s\" --- only table or database owner can vacuum it",
|
|
|
|
RelationGetRelationName(onerel))));
|
2002-04-02 03:03:07 +02:00
|
|
|
relation_close(onerel, lmode);
|
2008-09-11 16:01:10 +02:00
|
|
|
PopActiveSnapshot();
|
2003-05-14 05:26:03 +02:00
|
|
|
CommitTransactionCommand();
|
2011-02-08 04:04:29 +01:00
|
|
|
return false;
|
2002-04-02 03:03:07 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-07-05 21:25:51 +02:00
|
|
|
* Check that it's a vacuumable relation; we used to do this in
|
2009-06-11 16:49:15 +02:00
|
|
|
* get_rel_oids() but seems safer to check after we've locked the
|
|
|
|
* relation.
|
2002-04-02 03:03:07 +02:00
|
|
|
*/
|
2008-08-13 02:07:50 +02:00
|
|
|
if (onerel->rd_rel->relkind != RELKIND_RELATION &&
|
2013-03-04 01:23:31 +01:00
|
|
|
onerel->rd_rel->relkind != RELKIND_MATVIEW &&
|
Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own. The children are called
partitions and contain all of the actual data. Each partition has an
implicit partitioning constraint. Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed. Partitions
can't have extra columns and may not allow nulls unless the parent
does. Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.
Currently, tables can be range-partitioned or list-partitioned. List
partitioning is limited to a single column, but range partitioning can
involve multiple columns. A partitioning "column" can be an
expression.
Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations. The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.
Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
2016-12-07 19:17:43 +01:00
|
|
|
onerel->rd_rel->relkind != RELKIND_TOASTVALUE &&
|
|
|
|
onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
|
2002-04-02 03:03:07 +02:00
|
|
|
{
|
2003-07-20 23:56:35 +02:00
|
|
|
ereport(WARNING,
|
2011-01-02 13:26:10 +01:00
|
|
|
(errmsg("skipping \"%s\" --- cannot vacuum non-tables or special system tables",
|
2003-07-20 23:56:35 +02:00
|
|
|
RelationGetRelationName(onerel))));
|
2002-04-02 03:03:07 +02:00
|
|
|
relation_close(onerel, lmode);
|
2008-09-11 16:01:10 +02:00
|
|
|
PopActiveSnapshot();
|
2003-05-14 05:26:03 +02:00
|
|
|
CommitTransactionCommand();
|
2011-02-08 04:04:29 +01:00
|
|
|
return false;
|
1999-11-29 05:43:15 +01:00
|
|
|
}
|
|
|
|
|
2002-09-23 22:43:41 +02:00
|
|
|
/*
|
|
|
|
* Silently ignore tables that are temp tables of other backends ---
|
|
|
|
* trying to vacuum these will lead to great unhappiness, since their
|
|
|
|
* contents are probably not up-to-date on disk. (We don't throw a
|
|
|
|
* warning here; it would just lead to chatter during a database-wide
|
|
|
|
* VACUUM.)
|
|
|
|
*/
|
2009-04-01 00:12:48 +02:00
|
|
|
if (RELATION_IS_OTHER_TEMP(onerel))
|
2002-09-23 22:43:41 +02:00
|
|
|
{
|
|
|
|
relation_close(onerel, lmode);
|
2008-09-11 16:01:10 +02:00
|
|
|
PopActiveSnapshot();
|
2003-05-14 05:26:03 +02:00
|
|
|
CommitTransactionCommand();
|
2011-02-08 04:04:29 +01:00
|
|
|
return false;
|
2002-09-23 22:43:41 +02:00
|
|
|
}
|
|
|
|
|
2017-03-02 12:48:19 +01:00
|
|
|
/*
|
|
|
|
* Ignore partitioned tables as there is no work to be done. Since we
|
|
|
|
* release the lock here, it's possible that any partitions added from
|
|
|
|
* this point on will not get processed, but that seems harmless.
|
|
|
|
*/
|
|
|
|
if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
|
|
|
|
{
|
|
|
|
relation_close(onerel, lmode);
|
|
|
|
PopActiveSnapshot();
|
|
|
|
CommitTransactionCommand();
|
|
|
|
|
|
|
|
/* It's OK for other commands to look at this table */
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2000-09-19 21:30:03 +02:00
|
|
|
/*
|
2001-07-12 06:11:13 +02:00
|
|
|
* Get a session-level lock too. This will protect our access to the
|
|
|
|
* relation across multiple transactions, so that we can vacuum the
|
2005-10-15 04:49:52 +02:00
|
|
|
* relation's TOAST table (if any) secure in the knowledge that no one is
|
|
|
|
* deleting the parent relation.
|
2000-12-22 01:51:54 +01:00
|
|
|
*
|
|
|
|
* NOTE: this cannot block, even if someone else is waiting for access,
|
|
|
|
* because the lock manager knows that both lock requests are from the
|
|
|
|
* same process.
|
|
|
|
*/
|
|
|
|
onerelid = onerel->rd_lockInfo.lockRelId;
|
2006-07-31 22:09:10 +02:00
|
|
|
LockRelationIdForSession(&onerelid, lmode);
|
2000-12-22 01:51:54 +01:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
2008-08-13 02:07:50 +02:00
|
|
|
* Remember the relation's TOAST relation for later, if the caller asked
|
2010-02-08 17:50:21 +01:00
|
|
|
* us to process it. In VACUUM FULL, though, the toast table is
|
|
|
|
* automatically rebuilt by cluster_rel so we shouldn't recurse to it.
|
2004-10-07 16:19:58 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
if (!(options & VACOPT_SKIPTOAST) && !(options & VACOPT_FULL))
|
2008-08-13 02:07:50 +02:00
|
|
|
toast_relid = onerel->rd_rel->reltoastrelid;
|
|
|
|
else
|
|
|
|
toast_relid = InvalidOid;
|
2000-09-19 21:30:03 +02:00
|
|
|
|
2008-01-03 22:23:15 +01:00
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* Switch to the table owner's userid, so that any index functions are run
|
Prevent indirect security attacks via changing session-local state within
an allegedly immutable index function. It was previously recognized that
we had to prevent such a function from executing SET/RESET ROLE/SESSION
AUTHORIZATION, or it could trivially obtain the privileges of the session
user. However, since there is in general no privilege checking for changes
of session-local state, it is also possible for such a function to change
settings in a way that might subvert later operations in the same session.
Examples include changing search_path to cause an unexpected function to
be called, or replacing an existing prepared statement with another one
that will execute a function of the attacker's choosing.
The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
these threats, which are the same places previously deemed to need protection
against the SET ROLE issue. GUC changes are still allowed, since there are
many useful cases for that, but we prevent security problems by forcing a
rollback of any GUC change after completing the operation. Other cases are
handled by throwing an error if any change is attempted; these include temp
table creation, closing a cursor, and creating or deleting a prepared
statement. (In 7.4, the infrastructure to roll back GUC changes doesn't
exist, so we settle for rejecting changes of "search_path" in these contexts.)
Original report and patch by Gurjeet Singh, additional analysis by
Tom Lane.
Security: CVE-2009-4136
2009-12-09 22:57:51 +01:00
|
|
|
* as that user. Also lock down security-restricted operations and
|
2010-02-26 03:01:40 +01:00
|
|
|
* arrange to make GUC variable changes local to this command. (This is
|
|
|
|
* unnecessary, but harmless, for lazy VACUUM.)
|
2008-01-03 22:23:15 +01:00
|
|
|
*/
|
Prevent indirect security attacks via changing session-local state within
an allegedly immutable index function. It was previously recognized that
we had to prevent such a function from executing SET/RESET ROLE/SESSION
AUTHORIZATION, or it could trivially obtain the privileges of the session
user. However, since there is in general no privilege checking for changes
of session-local state, it is also possible for such a function to change
settings in a way that might subvert later operations in the same session.
Examples include changing search_path to cause an unexpected function to
be called, or replacing an existing prepared statement with another one
that will execute a function of the attacker's choosing.
The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
these threats, which are the same places previously deemed to need protection
against the SET ROLE issue. GUC changes are still allowed, since there are
many useful cases for that, but we prevent security problems by forcing a
rollback of any GUC change after completing the operation. Other cases are
handled by throwing an error if any change is attempted; these include temp
table creation, closing a cursor, and creating or deleting a prepared
statement. (In 7.4, the infrastructure to roll back GUC changes doesn't
exist, so we settle for rejecting changes of "search_path" in these contexts.)
Original report and patch by Gurjeet Singh, additional analysis by
Tom Lane.
Security: CVE-2009-4136
2009-12-09 22:57:51 +01:00
|
|
|
GetUserIdAndSecContext(&save_userid, &save_sec_context);
|
|
|
|
SetUserIdAndSecContext(onerel->rd_rel->relowner,
|
|
|
|
save_sec_context | SECURITY_RESTRICTED_OPERATION);
|
|
|
|
save_nestlevel = NewGUCNestLevel();
|
2008-01-03 22:23:15 +01:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
2010-02-08 05:33:55 +01:00
|
|
|
* Do the actual work --- either FULL or "lazy" vacuum
|
2004-10-07 16:19:58 +02:00
|
|
|
*/
|
2015-03-18 15:52:33 +01:00
|
|
|
if (options & VACOPT_FULL)
|
2010-01-06 06:31:14 +01:00
|
|
|
{
|
2010-02-08 05:33:55 +01:00
|
|
|
/* close relation before vacuuming, but hold lock until commit */
|
2010-01-06 06:31:14 +01:00
|
|
|
relation_close(onerel, NoLock);
|
|
|
|
onerel = NULL;
|
|
|
|
|
2010-02-08 05:33:55 +01:00
|
|
|
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
|
2010-01-06 06:31:14 +01:00
|
|
|
cluster_rel(relid, InvalidOid, false,
|
2015-03-18 15:52:33 +01:00
|
|
|
(options & VACOPT_VERBOSE) != 0);
|
2010-01-06 06:31:14 +01:00
|
|
|
}
|
2010-02-08 05:33:55 +01:00
|
|
|
else
|
2015-03-18 15:52:33 +01:00
|
|
|
lazy_vacuum_rel(onerel, options, params, vac_strategy);
|
2006-08-18 18:09:13 +02:00
|
|
|
|
Prevent indirect security attacks via changing session-local state within
an allegedly immutable index function. It was previously recognized that
we had to prevent such a function from executing SET/RESET ROLE/SESSION
AUTHORIZATION, or it could trivially obtain the privileges of the session
user. However, since there is in general no privilege checking for changes
of session-local state, it is also possible for such a function to change
settings in a way that might subvert later operations in the same session.
Examples include changing search_path to cause an unexpected function to
be called, or replacing an existing prepared statement with another one
that will execute a function of the attacker's choosing.
The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
these threats, which are the same places previously deemed to need protection
against the SET ROLE issue. GUC changes are still allowed, since there are
many useful cases for that, but we prevent security problems by forcing a
rollback of any GUC change after completing the operation. Other cases are
handled by throwing an error if any change is attempted; these include temp
table creation, closing a cursor, and creating or deleting a prepared
statement. (In 7.4, the infrastructure to roll back GUC changes doesn't
exist, so we settle for rejecting changes of "search_path" in these contexts.)
Original report and patch by Gurjeet Singh, additional analysis by
Tom Lane.
Security: CVE-2009-4136
2009-12-09 22:57:51 +01:00
|
|
|
/* Roll back any GUC changes executed by index functions */
|
|
|
|
AtEOXact_GUC(false, save_nestlevel);
|
|
|
|
|
|
|
|
/* Restore userid and security context */
|
|
|
|
SetUserIdAndSecContext(save_userid, save_sec_context);
|
2008-01-03 22:23:15 +01:00
|
|
|
|
2001-07-12 06:11:13 +02:00
|
|
|
/* all done with this class, but hold lock until commit */
|
2010-01-06 06:31:14 +01:00
|
|
|
if (onerel)
|
|
|
|
relation_close(onerel, NoLock);
|
2001-07-12 06:11:13 +02:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
|
|
|
* Complete the transaction and free all temporary memory used.
|
|
|
|
*/
|
2008-09-11 16:01:10 +02:00
|
|
|
PopActiveSnapshot();
|
2003-05-14 05:26:03 +02:00
|
|
|
CommitTransactionCommand();
|
2001-07-12 06:11:13 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the relation has a secondary toast rel, vacuum that too while we
|
|
|
|
* still hold the session lock on the master table. Note however that
|
2014-05-06 18:12:18 +02:00
|
|
|
* "analyze" will not get done on the toast table. This is good, because
|
2005-10-15 04:49:52 +02:00
|
|
|
* the toaster always uses hardcoded index access and statistics are
|
|
|
|
* totally unimportant for toast relations.
|
2001-07-12 06:11:13 +02:00
|
|
|
*/
|
|
|
|
if (toast_relid != InvalidOid)
|
2015-03-18 15:52:33 +01:00
|
|
|
vacuum_rel(toast_relid, relation, options, params);
|
2001-07-12 06:11:13 +02:00
|
|
|
|
2004-10-07 16:19:58 +02:00
|
|
|
/*
|
|
|
|
* Now release the session-level lock on the master table.
|
|
|
|
*/
|
2006-07-31 22:09:10 +02:00
|
|
|
UnlockRelationIdForSession(&onerelid, lmode);
|
2011-02-08 04:04:29 +01:00
|
|
|
|
|
|
|
/* Report that we really did it. */
|
|
|
|
return true;
|
2001-07-12 06:11:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
* Open all the vacuumable indexes of the given relation, obtaining the
|
2014-05-06 18:12:18 +02:00
|
|
|
* specified kind of lock on each. Return an array of Relation pointers for
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
* the indexes into *Irel, and the number of indexes into *nindexes.
|
|
|
|
*
|
|
|
|
* We consider an index vacuumable if it is marked insertable (IndexIsReady).
|
|
|
|
* If it isn't, probably a CREATE INDEX CONCURRENTLY command failed early in
|
|
|
|
* execution, and what we have is too corrupt to be processable. We will
|
|
|
|
* vacuum even if the index isn't indisvalid; this is important because in a
|
|
|
|
* unique index, uniqueness checks will be performed anyway and had better not
|
|
|
|
* hit dangling index pointers.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2010-02-08 05:33:55 +01:00
|
|
|
void
|
|
|
|
vac_open_indexes(Relation relation, LOCKMODE lockmode,
|
|
|
|
int *nindexes, Relation **Irel)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2010-02-08 05:33:55 +01:00
|
|
|
List *indexoidlist;
|
|
|
|
ListCell *indexoidscan;
|
|
|
|
int i;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2010-02-08 05:33:55 +01:00
|
|
|
Assert(lockmode != NoLock);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2010-02-08 05:33:55 +01:00
|
|
|
indexoidlist = RelationGetIndexList(relation);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
/* allocate enough memory for all indexes */
|
|
|
|
i = list_length(indexoidlist);
|
2004-06-08 15:59:36 +02:00
|
|
|
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
if (i > 0)
|
|
|
|
*Irel = (Relation *) palloc(i * sizeof(Relation));
|
2001-06-29 22:14:27 +02:00
|
|
|
else
|
2010-02-08 05:33:55 +01:00
|
|
|
*Irel = NULL;
|
1996-10-18 10:13:36 +02:00
|
|
|
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
/* collect just the ready indexes */
|
2010-02-08 05:33:55 +01:00
|
|
|
i = 0;
|
|
|
|
foreach(indexoidscan, indexoidlist)
|
1999-03-28 22:32:42 +02:00
|
|
|
{
|
2010-02-08 05:33:55 +01:00
|
|
|
Oid indexoid = lfirst_oid(indexoidscan);
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
Relation indrel;
|
2010-02-08 05:33:55 +01:00
|
|
|
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
indrel = index_open(indexoid, lockmode);
|
|
|
|
if (IndexIsReady(indrel->rd_index))
|
|
|
|
(*Irel)[i++] = indrel;
|
|
|
|
else
|
|
|
|
index_close(indrel, lockmode);
|
1999-03-28 22:32:42 +02:00
|
|
|
}
|
|
|
|
|
Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-29 03:25:27 +01:00
|
|
|
*nindexes = i;
|
|
|
|
|
2010-02-08 05:33:55 +01:00
|
|
|
list_free(indexoidlist);
|
2000-05-29 03:46:00 +02:00
|
|
|
}
|
1996-11-27 08:27:20 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Release the resources acquired by vac_open_indexes. Optionally release
|
2010-02-08 05:33:55 +01:00
|
|
|
* the locks (say NoLock to keep 'em).
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2010-02-08 05:33:55 +01:00
|
|
|
void
|
|
|
|
vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2010-02-08 05:33:55 +01:00
|
|
|
if (Irel == NULL)
|
|
|
|
return;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2010-02-08 05:33:55 +01:00
|
|
|
while (nindexes--)
|
1996-11-27 08:27:20 +01:00
|
|
|
{
|
2010-02-08 05:33:55 +01:00
|
|
|
Relation ind = Irel[nindexes];
|
2004-10-01 01:21:26 +02:00
|
|
|
|
2006-07-31 22:09:10 +02:00
|
|
|
index_close(ind, lockmode);
|
2004-10-01 01:21:26 +02:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
pfree(Irel);
|
2000-05-29 03:46:00 +02:00
|
|
|
}
|
1996-11-27 08:27:20 +01:00
|
|
|
|
2004-02-10 04:42:45 +01:00
|
|
|
/*
|
|
|
|
* vacuum_delay_point --- check for interrupts and cost-based delay.
|
|
|
|
*
|
|
|
|
* This should be called in each major loop of VACUUM processing,
|
|
|
|
* typically once per page processed.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
vacuum_delay_point(void)
|
|
|
|
{
|
|
|
|
/* Always check for interrupts */
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
|
|
|
/* Nap if appropriate */
|
|
|
|
if (VacuumCostActive && !InterruptPending &&
|
|
|
|
VacuumCostBalance >= VacuumCostLimit)
|
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
int msec;
|
2004-02-10 04:42:45 +01:00
|
|
|
|
2004-08-06 06:15:09 +02:00
|
|
|
msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
|
|
|
|
if (msec > VacuumCostDelay * 4)
|
|
|
|
msec = VacuumCostDelay * 4;
|
2004-02-10 04:42:45 +01:00
|
|
|
|
|
|
|
pg_usleep(msec * 1000L);
|
|
|
|
|
|
|
|
VacuumCostBalance = 0;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/* update balance values for workers */
|
|
|
|
AutoVacuumUpdateDelay();
|
|
|
|
|
2004-02-10 04:42:45 +01:00
|
|
|
/* Might have gotten an interrupt while sleeping */
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
}
|
|
|
|
}
|