postgresql/src/backend/commands/sequence.c

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

1903 lines
53 KiB
C
Raw Normal View History

1997-04-02 05:51:23 +02:00
/*-------------------------------------------------------------------------
*
* sequence.c
1997-04-02 05:51:23 +02:00
* PostgreSQL sequences support code.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*
* IDENTIFICATION
2010-09-20 22:08:53 +02:00
* src/backend/commands/sequence.c
*
1997-04-02 05:51:23 +02:00
*-------------------------------------------------------------------------
*/
#include "postgres.h"
1997-04-02 05:51:23 +02:00
#include "access/bufmask.h"
#include "access/htup_details.h"
Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
#include "access/multixact.h"
#include "access/relation.h"
#include "access/table.h"
Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
#include "access/transam.h"
Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/dependency.h"
#include "catalog/indexing.h"
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_sequence.h"
#include "catalog/pg_type.h"
#include "catalog/storage_xlog.h"
#include "commands/defrem.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "funcapi.h"
1999-07-16 07:00:38 +02:00
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "parser/parse_type.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
#include "storage/smgr.h"
#include "utils/acl.h"
1999-07-16 07:00:38 +02:00
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/resowner.h"
#include "utils/syscache.h"
#include "utils/varlena.h"
/*
* We don't want to log each fetching of a value from a sequence,
* so we pre-log a few fetches in advance. In the event of
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
* crash we can lose (skip over) as many values as we pre-logged.
*/
#define SEQ_LOG_VALS 32
/*
* The "special area" of a sequence's buffer page looks like this.
*/
#define SEQ_MAGIC 0x1717
1997-04-02 05:51:23 +02:00
typedef struct sequence_magic
{
uint32 magic;
} sequence_magic;
/*
* We store a SeqTable item for every sequence we have touched in the current
* session. This is needed to hold onto nextval/currval state. (We can't
* rely on the relcache, since it's only, well, a cache, and may decide to
* discard entries.)
*/
1997-04-02 05:51:23 +02:00
typedef struct SeqTableData
{
Oid relid; /* pg_class OID of this sequence (hash key) */
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
RelFileNumber filenumber; /* last seen relfilenumber of this sequence */
LocalTransactionId lxid; /* xact in which we last did a seq op */
bool last_valid; /* do we have a valid "last" value? */
int64 last; /* value last returned by nextval */
int64 cached; /* last value already cached for nextval */
/* if last != cached, we have not used up all the cached values */
int64 increment; /* copy of sequence's increment field */
/* note that increment is zero until we first do nextval_internal() */
1997-04-02 05:51:23 +02:00
} SeqTableData;
typedef SeqTableData *SeqTable;
static HTAB *seqhashtab = NULL; /* hash table for SeqTable items */
1997-04-02 05:51:23 +02:00
/*
* last_used_seq is updated by nextval() to point to the last used
* sequence.
*/
static SeqTableData *last_used_seq = NULL;
2010-11-17 22:42:18 +01:00
static void fill_seq_with_data(Relation rel, HeapTuple tuple);
static void fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum);
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
static Relation lock_and_open_sequence(SeqTable seq);
static void create_seq_hashtable(void);
static void init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel);
static Form_pg_sequence_data read_seq_tuple(Relation rel,
Buffer *buf, HeapTuple seqdatatuple);
static void init_params(ParseState *pstate, List *options, bool for_identity,
bool isInit,
Form_pg_sequence seqform,
Form_pg_sequence_data seqdataform,
bool *need_seq_rewrite,
List **owned_by);
static void do_setval(Oid relid, int64 next, bool iscalled);
static void process_owned_by(Relation seqrel, List *owned_by, bool for_identity);
1997-04-02 05:51:23 +02:00
/*
* DefineSequence
1997-04-02 05:51:23 +02:00
* Creates a new sequence relation
*/
ObjectAddress
DefineSequence(ParseState *pstate, CreateSeqStmt *seq)
1997-04-02 05:51:23 +02:00
{
FormData_pg_sequence seqform;
FormData_pg_sequence_data seqdataform;
bool need_seq_rewrite;
List *owned_by;
1997-04-02 05:51:23 +02:00
CreateStmt *stmt = makeNode(CreateStmt);
Oid seqoid;
ObjectAddress address;
1997-04-02 05:51:23 +02:00
Relation rel;
HeapTuple tuple;
TupleDesc tupDesc;
Datum value[SEQ_COL_LASTCOL];
bool null[SEQ_COL_LASTCOL];
Datum pgs_values[Natts_pg_sequence];
bool pgs_nulls[Natts_pg_sequence];
1997-04-02 05:51:23 +02:00
int i;
/*
* If if_not_exists was given and a relation with the same name already
* exists, bail out. (Note: we needn't check this when not if_not_exists,
* because DefineRelation will complain anyway.)
*/
if (seq->if_not_exists)
{
RangeVarGetAndCheckCreationNamespace(seq->sequence, NoLock, &seqoid);
if (OidIsValid(seqoid))
{
/*
* If we are in an extension script, insist that the pre-existing
* object be a member of the extension, to avoid security risks.
*/
ObjectAddressSet(address, RelationRelationId, seqoid);
checkMembershipInCurrentExtension(&address);
/* OK to skip */
ereport(NOTICE,
(errcode(ERRCODE_DUPLICATE_TABLE),
errmsg("relation \"%s\" already exists, skipping",
seq->sequence->relname)));
return InvalidObjectAddress;
}
}
/* Check and set all option values */
init_params(pstate, seq->options, seq->for_identity, true,
&seqform, &seqdataform,
&need_seq_rewrite, &owned_by);
1997-04-02 05:51:23 +02:00
/*
* Create relation (and fill value[] and null[] for the tuple)
1997-04-02 05:51:23 +02:00
*/
stmt->tableElts = NIL;
for (i = SEQ_COL_FIRSTCOL; i <= SEQ_COL_LASTCOL; i++)
1997-04-02 05:51:23 +02:00
{
ColumnDef *coldef = makeNode(ColumnDef);
coldef->inhcount = 0;
coldef->is_local = true;
coldef->is_not_null = true;
Remove collation information from TypeName, where it does not belong. The initial collations patch treated a COLLATE spec as part of a TypeName, following what can only be described as brain fade on the part of the SQL committee. It's a lot more reasonable to treat COLLATE as a syntactically separate object, so that it can be added in only the productions where it actually belongs, rather than needing to reject it in a boatload of places where it doesn't belong (something the original patch mostly failed to do). In addition this change lets us meet the spec's requirement to allow COLLATE anywhere in the clauses of a ColumnDef, and it avoids unfriendly behavior for constructs such as "foo::type COLLATE collation". To do this, pull collation information out of TypeName and put it in ColumnDef instead, thus reverting most of the collation-related changes in parse_type.c's API. I made one additional structural change, which was to use a ColumnDef as an intermediate node in AT_AlterColumnType AlterTableCmd nodes. This provides enough room to get rid of the "transform" wart in AlterTableCmd too, since the ColumnDef can carry the USING expression easily enough. Also fix some other minor bugs that have crept in in the same areas, like failure to copy recently-added fields of ColumnDef in copyfuncs.c. While at it, document the formerly secret ability to specify a collation in ALTER TABLE ALTER COLUMN TYPE, ALTER TYPE ADD ATTRIBUTE, and ALTER TYPE ALTER ATTRIBUTE TYPE; and correct some misstatements about what the default collation selection will be when COLLATE is omitted. BTW, the three-parameter form of format_type() should go away too, since it just contributes to the confusion in this area; but I'll do that in a separate patch.
2011-03-10 04:38:52 +01:00
coldef->is_from_type = false;
coldef->storage = 0;
coldef->raw_default = NULL;
coldef->cooked_default = NULL;
Remove collation information from TypeName, where it does not belong. The initial collations patch treated a COLLATE spec as part of a TypeName, following what can only be described as brain fade on the part of the SQL committee. It's a lot more reasonable to treat COLLATE as a syntactically separate object, so that it can be added in only the productions where it actually belongs, rather than needing to reject it in a boatload of places where it doesn't belong (something the original patch mostly failed to do). In addition this change lets us meet the spec's requirement to allow COLLATE anywhere in the clauses of a ColumnDef, and it avoids unfriendly behavior for constructs such as "foo::type COLLATE collation". To do this, pull collation information out of TypeName and put it in ColumnDef instead, thus reverting most of the collation-related changes in parse_type.c's API. I made one additional structural change, which was to use a ColumnDef as an intermediate node in AT_AlterColumnType AlterTableCmd nodes. This provides enough room to get rid of the "transform" wart in AlterTableCmd too, since the ColumnDef can carry the USING expression easily enough. Also fix some other minor bugs that have crept in in the same areas, like failure to copy recently-added fields of ColumnDef in copyfuncs.c. While at it, document the formerly secret ability to specify a collation in ALTER TABLE ALTER COLUMN TYPE, ALTER TYPE ADD ATTRIBUTE, and ALTER TYPE ALTER ATTRIBUTE TYPE; and correct some misstatements about what the default collation selection will be when COLLATE is omitted. BTW, the three-parameter form of format_type() should go away too, since it just contributes to the confusion in this area; but I'll do that in a separate patch.
2011-03-10 04:38:52 +01:00
coldef->collClause = NULL;
coldef->collOid = InvalidOid;
coldef->constraints = NIL;
coldef->location = -1;
null[i - 1] = false;
1997-04-02 05:51:23 +02:00
switch (i)
{
1997-04-02 05:51:23 +02:00
case SEQ_COL_LASTVAL:
Remove collation information from TypeName, where it does not belong. The initial collations patch treated a COLLATE spec as part of a TypeName, following what can only be described as brain fade on the part of the SQL committee. It's a lot more reasonable to treat COLLATE as a syntactically separate object, so that it can be added in only the productions where it actually belongs, rather than needing to reject it in a boatload of places where it doesn't belong (something the original patch mostly failed to do). In addition this change lets us meet the spec's requirement to allow COLLATE anywhere in the clauses of a ColumnDef, and it avoids unfriendly behavior for constructs such as "foo::type COLLATE collation". To do this, pull collation information out of TypeName and put it in ColumnDef instead, thus reverting most of the collation-related changes in parse_type.c's API. I made one additional structural change, which was to use a ColumnDef as an intermediate node in AT_AlterColumnType AlterTableCmd nodes. This provides enough room to get rid of the "transform" wart in AlterTableCmd too, since the ColumnDef can carry the USING expression easily enough. Also fix some other minor bugs that have crept in in the same areas, like failure to copy recently-added fields of ColumnDef in copyfuncs.c. While at it, document the formerly secret ability to specify a collation in ALTER TABLE ALTER COLUMN TYPE, ALTER TYPE ADD ATTRIBUTE, and ALTER TYPE ALTER ATTRIBUTE TYPE; and correct some misstatements about what the default collation selection will be when COLLATE is omitted. BTW, the three-parameter form of format_type() should go away too, since it just contributes to the confusion in this area; but I'll do that in a separate patch.
2011-03-10 04:38:52 +01:00
coldef->typeName = makeTypeNameFromOid(INT8OID, -1);
1997-04-02 05:51:23 +02:00
coldef->colname = "last_value";
value[i - 1] = Int64GetDatumFast(seqdataform.last_value);
1997-04-02 05:51:23 +02:00
break;
case SEQ_COL_LOG:
Remove collation information from TypeName, where it does not belong. The initial collations patch treated a COLLATE spec as part of a TypeName, following what can only be described as brain fade on the part of the SQL committee. It's a lot more reasonable to treat COLLATE as a syntactically separate object, so that it can be added in only the productions where it actually belongs, rather than needing to reject it in a boatload of places where it doesn't belong (something the original patch mostly failed to do). In addition this change lets us meet the spec's requirement to allow COLLATE anywhere in the clauses of a ColumnDef, and it avoids unfriendly behavior for constructs such as "foo::type COLLATE collation". To do this, pull collation information out of TypeName and put it in ColumnDef instead, thus reverting most of the collation-related changes in parse_type.c's API. I made one additional structural change, which was to use a ColumnDef as an intermediate node in AT_AlterColumnType AlterTableCmd nodes. This provides enough room to get rid of the "transform" wart in AlterTableCmd too, since the ColumnDef can carry the USING expression easily enough. Also fix some other minor bugs that have crept in in the same areas, like failure to copy recently-added fields of ColumnDef in copyfuncs.c. While at it, document the formerly secret ability to specify a collation in ALTER TABLE ALTER COLUMN TYPE, ALTER TYPE ADD ATTRIBUTE, and ALTER TYPE ALTER ATTRIBUTE TYPE; and correct some misstatements about what the default collation selection will be when COLLATE is omitted. BTW, the three-parameter form of format_type() should go away too, since it just contributes to the confusion in this area; but I'll do that in a separate patch.
2011-03-10 04:38:52 +01:00
coldef->typeName = makeTypeNameFromOid(INT8OID, -1);
coldef->colname = "log_cnt";
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
value[i - 1] = Int64GetDatum((int64) 0);
break;
1997-04-02 05:51:23 +02:00
case SEQ_COL_CALLED:
Remove collation information from TypeName, where it does not belong. The initial collations patch treated a COLLATE spec as part of a TypeName, following what can only be described as brain fade on the part of the SQL committee. It's a lot more reasonable to treat COLLATE as a syntactically separate object, so that it can be added in only the productions where it actually belongs, rather than needing to reject it in a boatload of places where it doesn't belong (something the original patch mostly failed to do). In addition this change lets us meet the spec's requirement to allow COLLATE anywhere in the clauses of a ColumnDef, and it avoids unfriendly behavior for constructs such as "foo::type COLLATE collation". To do this, pull collation information out of TypeName and put it in ColumnDef instead, thus reverting most of the collation-related changes in parse_type.c's API. I made one additional structural change, which was to use a ColumnDef as an intermediate node in AT_AlterColumnType AlterTableCmd nodes. This provides enough room to get rid of the "transform" wart in AlterTableCmd too, since the ColumnDef can carry the USING expression easily enough. Also fix some other minor bugs that have crept in in the same areas, like failure to copy recently-added fields of ColumnDef in copyfuncs.c. While at it, document the formerly secret ability to specify a collation in ALTER TABLE ALTER COLUMN TYPE, ALTER TYPE ADD ATTRIBUTE, and ALTER TYPE ALTER ATTRIBUTE TYPE; and correct some misstatements about what the default collation selection will be when COLLATE is omitted. BTW, the three-parameter form of format_type() should go away too, since it just contributes to the confusion in this area; but I'll do that in a separate patch.
2011-03-10 04:38:52 +01:00
coldef->typeName = makeTypeNameFromOid(BOOLOID, -1);
1997-04-02 05:51:23 +02:00
coldef->colname = "is_called";
value[i - 1] = BoolGetDatum(false);
1997-04-02 05:51:23 +02:00
break;
}
stmt->tableElts = lappend(stmt->tableElts, coldef);
}
stmt->relation = seq->sequence;
stmt->inhRelations = NIL;
stmt->constraints = NIL;
stmt->options = NIL;
stmt->oncommit = ONCOMMIT_NOOP;
stmt->tablespacename = NULL;
stmt->if_not_exists = seq->if_not_exists;
Implement table partitioning. Table partitioning is like table inheritance and reuses much of the existing infrastructure, but there are some important differences. The parent is called a partitioned table and is always empty; it may not have indexes or non-inherited constraints, since those make no sense for a relation with no data of its own. The children are called partitions and contain all of the actual data. Each partition has an implicit partitioning constraint. Multiple inheritance is not allowed, and partitioning and inheritance can't be mixed. Partitions can't have extra columns and may not allow nulls unless the parent does. Tuples inserted into the parent are automatically routed to the correct partition, so tuple-routing ON INSERT triggers are not needed. Tuple routing isn't yet supported for partitions which are foreign tables, and it doesn't handle updates that cross partition boundaries. Currently, tables can be range-partitioned or list-partitioned. List partitioning is limited to a single column, but range partitioning can involve multiple columns. A partitioning "column" can be an expression. Because table partitioning is less general than table inheritance, it is hoped that it will be easier to reason about properties of partitions, and therefore that this will serve as a better foundation for a variety of possible optimizations, including query planner optimizations. The tuple routing based which this patch does based on the implicit partitioning constraints is an example of this, but it seems likely that many other useful optimizations are also possible. Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat, Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova, Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
2016-12-07 19:17:43 +01:00
address = DefineRelation(stmt, RELKIND_SEQUENCE, seq->ownerId, NULL, NULL);
seqoid = address.objectId;
Assert(seqoid != InvalidOid);
rel = table_open(seqoid, AccessExclusiveLock);
1997-04-02 05:51:23 +02:00
tupDesc = RelationGetDescr(rel);
2010-11-17 22:42:18 +01:00
/* now initialize the sequence's data */
tuple = heap_form_tuple(tupDesc, value, null);
fill_seq_with_data(rel, tuple);
/* process OWNED BY if given */
if (owned_by)
process_owned_by(rel, owned_by, seq->for_identity);
2010-11-17 22:42:18 +01:00
table_close(rel, NoLock);
/* fill in pg_sequence */
rel = table_open(SequenceRelationId, RowExclusiveLock);
tupDesc = RelationGetDescr(rel);
memset(pgs_nulls, 0, sizeof(pgs_nulls));
pgs_values[Anum_pg_sequence_seqrelid - 1] = ObjectIdGetDatum(seqoid);
pgs_values[Anum_pg_sequence_seqtypid - 1] = ObjectIdGetDatum(seqform.seqtypid);
pgs_values[Anum_pg_sequence_seqstart - 1] = Int64GetDatumFast(seqform.seqstart);
pgs_values[Anum_pg_sequence_seqincrement - 1] = Int64GetDatumFast(seqform.seqincrement);
pgs_values[Anum_pg_sequence_seqmax - 1] = Int64GetDatumFast(seqform.seqmax);
pgs_values[Anum_pg_sequence_seqmin - 1] = Int64GetDatumFast(seqform.seqmin);
pgs_values[Anum_pg_sequence_seqcache - 1] = Int64GetDatumFast(seqform.seqcache);
pgs_values[Anum_pg_sequence_seqcycle - 1] = BoolGetDatum(seqform.seqcycle);
tuple = heap_form_tuple(tupDesc, pgs_values, pgs_nulls);
CatalogTupleInsert(rel, tuple);
heap_freetuple(tuple);
table_close(rel, RowExclusiveLock);
return address;
2010-11-17 22:42:18 +01:00
}
/*
* Reset a sequence to its initial value.
*
* The change is made transactionally, so that on failure of the current
* transaction, the sequence will be restored to its previous state.
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
* We do that by creating a whole new relfilenumber for the sequence; so this
2010-11-17 22:42:18 +01:00
* works much like the rewriting forms of ALTER TABLE.
*
* Caller is assumed to have acquired AccessExclusiveLock on the sequence,
* which must not be released until end of transaction. Caller is also
* responsible for permissions checking.
*/
void
ResetSequence(Oid seq_relid)
{
Relation seq_rel;
SeqTable elm;
Form_pg_sequence_data seq;
2010-11-17 22:42:18 +01:00
Buffer buf;
HeapTupleData seqdatatuple;
2010-11-17 22:42:18 +01:00
HeapTuple tuple;
HeapTuple pgstuple;
Form_pg_sequence pgsform;
int64 startv;
2010-11-17 22:42:18 +01:00
/*
* Read the old sequence. This does a bit more work than really
* necessary, but it's simple, and we do want to double-check that it's
* indeed a sequence.
*/
init_sequence(seq_relid, &elm, &seq_rel);
(void) read_seq_tuple(seq_rel, &buf, &seqdatatuple);
pgstuple = SearchSysCache1(SEQRELID, ObjectIdGetDatum(seq_relid));
if (!HeapTupleIsValid(pgstuple))
elog(ERROR, "cache lookup failed for sequence %u", seq_relid);
pgsform = (Form_pg_sequence) GETSTRUCT(pgstuple);
startv = pgsform->seqstart;
ReleaseSysCache(pgstuple);
2010-11-17 22:42:18 +01:00
/*
* Copy the existing sequence tuple.
*/
tuple = heap_copytuple(&seqdatatuple);
2010-11-17 22:42:18 +01:00
/* Now we're done with the old page */
UnlockReleaseBuffer(buf);
/*
* Modify the copied tuple to execute the restart (compare the RESTART
* action in AlterSequence)
*/
seq = (Form_pg_sequence_data) GETSTRUCT(tuple);
seq->last_value = startv;
2010-11-17 22:42:18 +01:00
seq->is_called = false;
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
seq->log_cnt = 0;
2010-11-17 22:42:18 +01:00
/*
* Create a new storage file for the sequence.
2010-11-17 22:42:18 +01:00
*/
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
RelationSetNewRelfilenumber(seq_rel, seq_rel->rd_rel->relpersistence);
/*
* Ensure sequence's relfrozenxid is at 0, since it won't contain any
* unfrozen XIDs. Same with relminmxid, since a sequence will never
* contain multixacts.
*/
Assert(seq_rel->rd_rel->relfrozenxid == InvalidTransactionId);
Assert(seq_rel->rd_rel->relminmxid == InvalidMultiXactId);
2010-11-17 22:42:18 +01:00
/*
* Insert the modified tuple into the new storage file.
*/
fill_seq_with_data(seq_rel, tuple);
/* Clear local cache so that we don't think we have cached numbers */
/* Note that we do not change the currval() state */
elm->cached = elm->last;
relation_close(seq_rel, NoLock);
}
/*
* Initialize a sequence's relation with the specified tuple as content
*
* This handles unlogged sequences by writing to both the main and the init
* fork as necessary.
2010-11-17 22:42:18 +01:00
*/
static void
fill_seq_with_data(Relation rel, HeapTuple tuple)
{
fill_seq_fork_with_data(rel, tuple, MAIN_FORKNUM);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
{
SMgrRelation srel;
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
srel = smgropen(rel->rd_locator, InvalidBackendId);
smgrcreate(srel, INIT_FORKNUM, false);
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
FlushRelationBuffers(rel);
smgrclose(srel);
}
}
/*
* Initialize a sequence's relation fork with the specified tuple as content
*/
static void
fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
2010-11-17 22:42:18 +01:00
{
Buffer buf;
Page page;
sequence_magic *sm;
OffsetNumber offnum;
2010-11-17 22:42:18 +01:00
/* Initialize first page of relation with special magic number */
buf = ExtendBufferedRel(EB_REL(rel), forkNum, NULL,
EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
Assert(BufferGetBlockNumber(buf) == 0);
page = BufferGetPage(buf);
PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));
1997-04-02 05:51:23 +02:00
sm = (sequence_magic *) PageGetSpecialPointer(page);
sm->magic = SEQ_MAGIC;
2010-11-17 22:42:18 +01:00
/* Now insert sequence tuple */
/*
* Since VACUUM does not process sequences, we have to force the tuple to
* have xmin = FrozenTransactionId now. Otherwise it would become
* invisible to SELECTs after 2G transactions. It is okay to do this
* because if the current transaction aborts, no other xact will ever
* examine the sequence tuple anyway.
*/
HeapTupleHeaderSetXmin(tuple->t_data, FrozenTransactionId);
HeapTupleHeaderSetXminFrozen(tuple->t_data);
HeapTupleHeaderSetCmin(tuple->t_data, FirstCommandId);
HeapTupleHeaderSetXmax(tuple->t_data, InvalidTransactionId);
tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
ItemPointerSet(&tuple->t_data->t_ctid, 0, FirstOffsetNumber);
Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
/* check the comment above nextval_internal()'s equivalent call. */
if (RelationNeedsWAL(rel))
GetTopTransactionId();
START_CRIT_SECTION();
MarkBufferDirty(buf);
offnum = PageAddItem(page, (Item) tuple->t_data, tuple->t_len,
InvalidOffsetNumber, false, false);
if (offnum != FirstOffsetNumber)
elog(ERROR, "failed to add sequence tuple to page");
/* XLOG stuff */
if (RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM)
{
xl_seq_rec xlrec;
XLogRecPtr recptr;
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
xlrec.locator = rel->rd_locator;
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
XLogRegisterData((char *) tuple->t_data, tuple->t_len);
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
PageSetLSN(page, recptr);
}
END_CRIT_SECTION();
UnlockReleaseBuffer(buf);
1997-04-02 05:51:23 +02:00
}
/*
* AlterSequence
*
* Modify the definition of a sequence relation
*/
ObjectAddress
AlterSequence(ParseState *pstate, AlterSeqStmt *stmt)
{
Oid relid;
SeqTable elm;
Relation seqrel;
Buffer buf;
HeapTupleData datatuple;
Form_pg_sequence seqform;
Form_pg_sequence_data newdataform;
bool need_seq_rewrite;
List *owned_by;
ObjectAddress address;
Relation rel;
HeapTuple seqtuple;
HeapTuple newdatatuple;
/* Open and lock sequence, and check for ownership along the way. */
relid = RangeVarGetRelidExtended(stmt->sequence,
ShareRowExclusiveLock,
stmt->missing_ok ? RVR_MISSING_OK : 0,
RangeVarCallbackOwnsRelation,
NULL);
if (relid == InvalidOid)
{
ereport(NOTICE,
(errmsg("relation \"%s\" does not exist, skipping",
stmt->sequence->relname)));
return InvalidObjectAddress;
}
init_sequence(relid, &elm, &seqrel);
rel = table_open(SequenceRelationId, RowExclusiveLock);
seqtuple = SearchSysCacheCopy1(SEQRELID,
ObjectIdGetDatum(relid));
if (!HeapTupleIsValid(seqtuple))
elog(ERROR, "cache lookup failed for sequence %u",
relid);
seqform = (Form_pg_sequence) GETSTRUCT(seqtuple);
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* lock page's buffer and read tuple into new sequence structure */
(void) read_seq_tuple(seqrel, &buf, &datatuple);
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* copy the existing sequence data tuple, so it can be modified locally */
newdatatuple = heap_copytuple(&datatuple);
newdataform = (Form_pg_sequence_data) GETSTRUCT(newdatatuple);
UnlockReleaseBuffer(buf);
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* Check and set new values */
init_params(pstate, stmt->options, stmt->for_identity, false,
seqform, newdataform,
&need_seq_rewrite, &owned_by);
/* Clear local cache so that we don't think we have cached numbers */
/* Note that we do not change the currval() state */
elm->cached = elm->last;
/* If needed, rewrite the sequence relation itself */
if (need_seq_rewrite)
{
/* check the comment above nextval_internal()'s equivalent call. */
if (RelationNeedsWAL(seqrel))
GetTopTransactionId();
Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
/*
* Create a new storage file for the sequence, making the state
* changes transactional.
*/
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
RelationSetNewRelfilenumber(seqrel, seqrel->rd_rel->relpersistence);
/*
* Ensure sequence's relfrozenxid is at 0, since it won't contain any
* unfrozen XIDs. Same with relminmxid, since a sequence will never
* contain multixacts.
*/
Assert(seqrel->rd_rel->relfrozenxid == InvalidTransactionId);
Assert(seqrel->rd_rel->relminmxid == InvalidMultiXactId);
/*
* Insert the modified tuple into the new storage file.
*/
fill_seq_with_data(seqrel, newdatatuple);
}
/* process OWNED BY if given */
if (owned_by)
process_owned_by(seqrel, owned_by, stmt->for_identity);
/* update the pg_sequence tuple (we could skip this in some cases...) */
CatalogTupleUpdate(rel, &seqtuple->t_self, seqtuple);
InvokeObjectPostAlterHook(RelationRelationId, relid, 0);
ObjectAddressSet(address, RelationRelationId, relid);
table_close(rel, RowExclusiveLock);
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
relation_close(seqrel, NoLock);
return address;
}
void
SequenceChangePersistence(Oid relid, char newrelpersistence)
{
SeqTable elm;
Relation seqrel;
Buffer buf;
HeapTupleData seqdatatuple;
init_sequence(relid, &elm, &seqrel);
/* check the comment above nextval_internal()'s equivalent call. */
if (RelationNeedsWAL(seqrel))
GetTopTransactionId();
(void) read_seq_tuple(seqrel, &buf, &seqdatatuple);
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
RelationSetNewRelfilenumber(seqrel, newrelpersistence);
fill_seq_with_data(seqrel, &seqdatatuple);
UnlockReleaseBuffer(buf);
relation_close(seqrel, NoLock);
}
void
DeleteSequenceTuple(Oid relid)
{
Relation rel;
HeapTuple tuple;
rel = table_open(SequenceRelationId, RowExclusiveLock);
tuple = SearchSysCache1(SEQRELID, ObjectIdGetDatum(relid));
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for sequence %u", relid);
CatalogTupleDelete(rel, &tuple->t_self);
ReleaseSysCache(tuple);
table_close(rel, RowExclusiveLock);
}
1997-04-02 05:51:23 +02:00
/*
* Note: nextval with a text argument is no longer exported as a pg_proc
* entry, but we keep it around to ease porting of C code that may have
* called the function directly.
*/
Datum
nextval(PG_FUNCTION_ARGS)
1997-04-02 05:51:23 +02:00
{
text *seqin = PG_GETARG_TEXT_PP(0);
RangeVar *sequence;
Oid relid;
sequence = makeRangeVarFromNameList(textToQualifiedNameList(seqin));
/*
* XXX: This is not safe in the presence of concurrent DDL, but acquiring
* a lock here is more expensive than letting nextval_internal do it,
* since the latter maintains a cache that keeps us from hitting the lock
* manager more than once per transaction. It's not clear whether the
* performance penalty is material in practice, but for now, we do it this
* way.
*/
Improve table locking behavior in the face of current DDL. In the previous coding, callers were faced with an awkward choice: look up the name, do permissions checks, and then lock the table; or look up the name, lock the table, and then do permissions checks. The first choice was wrong because the results of the name lookup and permissions checks might be out-of-date by the time the table lock was acquired, while the second allowed a user with no privileges to interfere with access to a table by users who do have privileges (e.g. if a malicious backend queues up for an AccessExclusiveLock on a table on which AccessShareLock is already held, further attempts to access the table will be blocked until the AccessExclusiveLock is obtained and the malicious backend's transaction rolls back). To fix, allow callers of RangeVarGetRelid() to pass a callback which gets executed after performing the name lookup but before acquiring the relation lock. If the name lookup is retried (because invalidation messages are received), the callback will be re-executed as well, so we get the best of both worlds. RangeVarGetRelid() is renamed to RangeVarGetRelidExtended(); callers not wishing to supply a callback can continue to invoke it as RangeVarGetRelid(), which is now a macro. Since the only one caller that uses nowait = true now passes a callback anyway, the RangeVarGetRelid() macro defaults nowait as well. The callback can also be used for supplemental locking - for example, REINDEX INDEX needs to acquire the table lock before the index lock to reduce deadlock possibilities. There's a lot more work to be done here to fix all the cases where this can be a problem, but this commit provides the general infrastructure and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE, LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE. Per discussion with Noah Misch and Alvaro Herrera.
2011-11-30 16:12:27 +01:00
relid = RangeVarGetRelid(sequence, NoLock, false);
PG_RETURN_INT64(nextval_internal(relid, true));
}
Datum
nextval_oid(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
PG_RETURN_INT64(nextval_internal(relid, true));
}
int64
nextval_internal(Oid relid, bool check_permissions)
{
1997-04-02 05:51:23 +02:00
SeqTable elm;
Relation seqrel;
1997-04-02 05:51:23 +02:00
Buffer buf;
Page page;
HeapTuple pgstuple;
Form_pg_sequence pgsform;
HeapTupleData seqdatatuple;
Form_pg_sequence_data seq;
int64 incby,
1997-04-02 05:51:23 +02:00
maxv,
minv,
cache,
log,
fetch,
last;
int64 result,
1997-04-02 05:51:23 +02:00
next,
rescnt = 0;
bool cycle;
bool logit = false;
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* open and lock sequence */
init_sequence(relid, &elm, &seqrel);
if (check_permissions &&
pg_class_aclcheck(elm->relid, GetUserId(),
ACL_USAGE | ACL_UPDATE) != ACLCHECK_OK)
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("permission denied for sequence %s",
RelationGetRelationName(seqrel))));
/* read-only transactions may only modify temp sequences */
if (!seqrel->rd_islocaltemp)
PreventCommandIfReadOnly("nextval()");
/*
* Forbid this during parallel operation because, to make it work, the
* cooperating backends would need to share the backend-local cached
* sequence information. Currently, we don't support that.
*/
PreventCommandIfParallelMode("nextval()");
1997-04-02 05:51:23 +02:00
if (elm->last != elm->cached) /* some numbers were cached */
{
Assert(elm->last_valid);
Assert(elm->increment != 0);
1997-06-02 13:22:52 +02:00
elm->last += elm->increment;
relation_close(seqrel, NoLock);
last_used_seq = elm;
return elm->last;
1997-04-02 05:51:23 +02:00
}
pgstuple = SearchSysCache1(SEQRELID, ObjectIdGetDatum(relid));
if (!HeapTupleIsValid(pgstuple))
elog(ERROR, "cache lookup failed for sequence %u", relid);
pgsform = (Form_pg_sequence) GETSTRUCT(pgstuple);
incby = pgsform->seqincrement;
maxv = pgsform->seqmax;
minv = pgsform->seqmin;
cache = pgsform->seqcache;
cycle = pgsform->seqcycle;
ReleaseSysCache(pgstuple);
/* lock page' buffer and read tuple */
seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
page = BufferGetPage(buf);
elm->increment = incby;
last = next = result = seq->last_value;
fetch = cache;
log = seq->log_cnt;
if (!seq->is_called)
{
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
rescnt++; /* return last_value if not is_called */
fetch--;
}
/*
* Decide whether we should emit a WAL log record. If so, force up the
* fetch count to grab SEQ_LOG_VALS more values than we actually need to
* cache. (These will then be usable without logging.)
*
* If this is the first nextval after a checkpoint, we must force a new
* WAL record to be written anyway, else replay starting from the
* checkpoint would fail to advance the sequence past the logged values.
* In this case we may as well fetch extra values.
*/
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
if (log < fetch || !seq->is_called)
{
/* forced log to satisfy local demand for values */
fetch = log = fetch + SEQ_LOG_VALS;
logit = true;
}
else
{
XLogRecPtr redoptr = GetRedoRecPtr();
if (PageGetLSN(page) <= redoptr)
{
/* last update of seq was before checkpoint */
fetch = log = fetch + SEQ_LOG_VALS;
logit = true;
}
}
while (fetch) /* try to fetch cache [+ log ] numbers */
1997-04-02 05:51:23 +02:00
{
/*
1997-04-02 05:51:23 +02:00
* Check MAXVALUE for ascending sequences and MINVALUE for descending
* sequences
*/
if (incby > 0)
{
/* ascending sequence */
1997-04-02 05:51:23 +02:00
if ((maxv >= 0 && next > maxv - incby) ||
(maxv < 0 && next + incby > maxv))
{
if (rescnt > 0)
break; /* stop fetching */
if (!cycle)
ereport(ERROR,
(errcode(ERRCODE_SEQUENCE_GENERATOR_LIMIT_EXCEEDED),
errmsg("nextval: reached maximum value of sequence \"%s\" (%lld)",
RelationGetRelationName(seqrel),
(long long) maxv)));
1997-06-02 13:22:52 +02:00
next = minv;
}
else
1997-04-02 05:51:23 +02:00
next += incby;
}
else
{
/* descending sequence */
1997-04-02 05:51:23 +02:00
if ((minv < 0 && next < minv - incby) ||
(minv >= 0 && next + incby < minv))
{
1997-04-02 05:51:23 +02:00
if (rescnt > 0)
break; /* stop fetching */
if (!cycle)
ereport(ERROR,
(errcode(ERRCODE_SEQUENCE_GENERATOR_LIMIT_EXCEEDED),
errmsg("nextval: reached minimum value of sequence \"%s\" (%lld)",
RelationGetRelationName(seqrel),
(long long) minv)));
1997-04-02 05:51:23 +02:00
next = maxv;
}
else
next += incby;
}
fetch--;
if (rescnt < cache)
{
log--;
rescnt++;
last = next;
if (rescnt == 1) /* if it's first result - */
result = next; /* it's what to return */
}
1997-04-02 05:51:23 +02:00
}
log -= fetch; /* adjust for any unfetched numbers */
Assert(log >= 0);
1997-04-02 05:51:23 +02:00
/* save info in local cache */
elm->last = result; /* last returned number */
elm->cached = last; /* last fetched number */
elm->last_valid = true;
last_used_seq = elm;
Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
/*
* If something needs to be WAL logged, acquire an xid, so this
* transaction's commit will trigger a WAL flush and wait for syncrep.
* It's sufficient to ensure the toplevel transaction has an xid, no need
* to assign xids subxacts, that'll already trigger an appropriate wait.
Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
* (Have to do that here, so we're outside the critical section)
*/
if (logit && RelationNeedsWAL(seqrel))
GetTopTransactionId();
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/* ready to change the on-disk (or really, in-buffer) tuple */
START_CRIT_SECTION();
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/*
* We must mark the buffer dirty before doing XLogInsert(); see notes in
* SyncOneBuffer(). However, we don't apply the desired changes just yet.
* This looks like a violation of the buffer update protocol, but it is in
* fact safe because we hold exclusive lock on the buffer. Any other
* process, including a checkpoint, that tries to examine the buffer
* contents will block until we release the lock, and then will see the
* final state that we install below.
*/
MarkBufferDirty(buf);
/* XLOG stuff */
if (logit && RelationNeedsWAL(seqrel))
{
xl_seq_rec xlrec;
XLogRecPtr recptr;
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/*
* We don't log the current state of the tuple, but rather the state
* as it would appear after "log" more fetches. This lets us skip
* that many future WAL records, at the cost that we lose those
* sequence values if we crash.
*/
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
/* set values that will be saved in xlog */
seq->last_value = next;
seq->is_called = true;
seq->log_cnt = 0;
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
xlrec.locator = seqrel->rd_locator;
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len);
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
PageSetLSN(page, recptr);
}
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/* Now update sequence tuple to the intended final state */
seq->last_value = last; /* last fetched number */
seq->is_called = true;
seq->log_cnt = log; /* how much is logged */
END_CRIT_SECTION();
UnlockReleaseBuffer(buf);
relation_close(seqrel, NoLock);
return result;
1997-04-02 05:51:23 +02:00
}
Datum
currval_oid(PG_FUNCTION_ARGS)
1997-04-02 05:51:23 +02:00
{
Oid relid = PG_GETARG_OID(0);
int64 result;
1997-04-02 05:51:23 +02:00
SeqTable elm;
Relation seqrel;
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* open and lock sequence */
init_sequence(relid, &elm, &seqrel);
if (pg_class_aclcheck(elm->relid, GetUserId(),
ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("permission denied for sequence %s",
RelationGetRelationName(seqrel))));
if (!elm->last_valid)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("currval of sequence \"%s\" is not yet defined in this session",
RelationGetRelationName(seqrel))));
1997-06-02 13:22:52 +02:00
result = elm->last;
relation_close(seqrel, NoLock);
PG_RETURN_INT64(result);
1997-04-02 05:51:23 +02:00
}
Datum
lastval(PG_FUNCTION_ARGS)
{
Relation seqrel;
int64 result;
if (last_used_seq == NULL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("lastval is not yet defined in this session")));
/* Someone may have dropped the sequence since the last nextval() */
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(last_used_seq->relid)))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("lastval is not yet defined in this session")));
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
seqrel = lock_and_open_sequence(last_used_seq);
/* nextval() must have already been called for this sequence */
Assert(last_used_seq->last_valid);
if (pg_class_aclcheck(last_used_seq->relid, GetUserId(),
ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("permission denied for sequence %s",
RelationGetRelationName(seqrel))));
result = last_used_seq->last;
relation_close(seqrel, NoLock);
PG_RETURN_INT64(result);
}
/*
* Main internal procedure that handles 2 & 3 arg forms of SETVAL.
*
* Note that the 3 arg version (which sets the is_called flag) is
* only for use in pg_dump, and setting the is_called flag may not
* work if multiple users are attached to the database and referencing
* the sequence (unlikely if pg_dump is restoring it).
*
* It is necessary to have the 3 arg version so that pg_dump can
* restore the state of a sequence exactly during data-only restores -
* it is the only way to clear the is_called flag in an existing
* sequence.
*/
2000-10-16 19:08:11 +02:00
static void
do_setval(Oid relid, int64 next, bool iscalled)
{
SeqTable elm;
Relation seqrel;
Buffer buf;
HeapTupleData seqdatatuple;
Form_pg_sequence_data seq;
HeapTuple pgstuple;
Form_pg_sequence pgsform;
int64 maxv,
minv;
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* open and lock sequence */
init_sequence(relid, &elm, &seqrel);
if (pg_class_aclcheck(elm->relid, GetUserId(), ACL_UPDATE) != ACLCHECK_OK)
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("permission denied for sequence %s",
RelationGetRelationName(seqrel))));
pgstuple = SearchSysCache1(SEQRELID, ObjectIdGetDatum(relid));
if (!HeapTupleIsValid(pgstuple))
elog(ERROR, "cache lookup failed for sequence %u", relid);
pgsform = (Form_pg_sequence) GETSTRUCT(pgstuple);
maxv = pgsform->seqmax;
minv = pgsform->seqmin;
ReleaseSysCache(pgstuple);
/* read-only transactions may only modify temp sequences */
if (!seqrel->rd_islocaltemp)
PreventCommandIfReadOnly("setval()");
/*
* Forbid this during parallel operation because, to make it work, the
* cooperating backends would need to share the backend-local cached
* sequence information. Currently, we don't support that.
*/
PreventCommandIfParallelMode("setval()");
/* lock page' buffer and read tuple */
seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
if ((next < minv) || (next > maxv))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("setval: value %lld is out of bounds for sequence \"%s\" (%lld..%lld)",
(long long) next, RelationGetRelationName(seqrel),
(long long) minv, (long long) maxv)));
/* Set the currval() state only if iscalled = true */
if (iscalled)
{
elm->last = next; /* last returned number */
elm->last_valid = true;
}
/* In any case, forget any future cached numbers */
elm->cached = elm->last;
Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
/* check the comment above nextval_internal()'s equivalent call. */
if (RelationNeedsWAL(seqrel))
GetTopTransactionId();
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/* ready to change the on-disk (or really, in-buffer) tuple */
START_CRIT_SECTION();
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
seq->last_value = next; /* last fetched number */
seq->is_called = iscalled;
seq->log_cnt = 0;
MarkBufferDirty(buf);
/* XLOG stuff */
if (RelationNeedsWAL(seqrel))
{
xl_seq_rec xlrec;
XLogRecPtr recptr;
Page page = BufferGetPage(buf);
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
xlrec.locator = seqrel->rd_locator;
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len);
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
PageSetLSN(page, recptr);
}
END_CRIT_SECTION();
UnlockReleaseBuffer(buf);
relation_close(seqrel, NoLock);
}
/*
* Implement the 2 arg setval procedure.
* See do_setval for discussion.
*/
Datum
setval_oid(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
int64 next = PG_GETARG_INT64(1);
do_setval(relid, next, true);
PG_RETURN_INT64(next);
}
/*
* Implement the 3 arg setval procedure.
* See do_setval for discussion.
*/
Datum
setval3_oid(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
int64 next = PG_GETARG_INT64(1);
bool iscalled = PG_GETARG_BOOL(2);
do_setval(relid, next, iscalled);
PG_RETURN_INT64(next);
}
/*
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
* Open the sequence and acquire lock if needed
*
* If we haven't touched the sequence already in this transaction,
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
* we need to acquire a lock. We arrange for the lock to
* be owned by the top transaction, so that we don't need to do it
* more than once per xact.
*/
static Relation
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
lock_and_open_sequence(SeqTable seq)
{
LocalTransactionId thislxid = MyProc->lxid;
/* Get the lock if not already held in this xact */
if (seq->lxid != thislxid)
{
ResourceOwner currentOwner;
currentOwner = CurrentResourceOwner;
CurrentResourceOwner = TopTransactionResourceOwner;
LockRelationOid(seq->relid, RowExclusiveLock);
CurrentResourceOwner = currentOwner;
/* Flag that we have a lock in the current xact */
seq->lxid = thislxid;
}
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* We now know we have the lock, and can safely open the rel */
return relation_open(seq->relid, NoLock);
}
/*
* Creates the hash table for storing sequence data
*/
static void
create_seq_hashtable(void)
{
HASHCTL ctl;
ctl.keysize = sizeof(Oid);
ctl.entrysize = sizeof(SeqTableData);
seqhashtab = hash_create("Sequence values", 16, &ctl,
Improve hash_create's API for selecting simple-binary-key hash functions. Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
HASH_ELEM | HASH_BLOBS);
}
/*
* Given a relation OID, open and lock the sequence. p_elm and p_rel are
* output parameters.
*/
static void
init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel)
1997-04-02 05:51:23 +02:00
{
SeqTable elm;
Relation seqrel;
bool found;
/* Find or create a hash table entry for this sequence */
if (seqhashtab == NULL)
create_seq_hashtable();
elm = (SeqTable) hash_search(seqhashtab, &relid, HASH_ENTER, &found);
/*
* Initialize the new hash table entry if it did not exist already.
*
* NOTE: seqhashtab entries are stored for the life of a backend (unless
* explicitly discarded with DISCARD). If the sequence itself is deleted
* then the entry becomes wasted memory, but it's small enough that this
* should not matter.
*/
if (!found)
1997-04-02 05:51:23 +02:00
{
/* relid already filled in */
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
elm->filenumber = InvalidRelFileNumber;
elm->lxid = InvalidLocalTransactionId;
elm->last_valid = false;
elm->last = elm->cached = 0;
1997-04-02 05:51:23 +02:00
}
/*
* Open the sequence relation.
*/
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
seqrel = lock_and_open_sequence(elm);
if (seqrel->rd_rel->relkind != RELKIND_SEQUENCE)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("\"%s\" is not a sequence",
RelationGetRelationName(seqrel))));
2010-11-17 22:42:18 +01:00
/*
* If the sequence has been transactionally replaced since we last saw it,
* discard any cached-but-unissued values. We do not touch the currval()
* state, however.
*/
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
if (seqrel->rd_rel->relfilenode != elm->filenumber)
2010-11-17 22:42:18 +01:00
{
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
elm->filenumber = seqrel->rd_rel->relfilenode;
2010-11-17 22:42:18 +01:00
elm->cached = elm->last;
}
/* Return results */
*p_elm = elm;
*p_rel = seqrel;
1997-04-02 05:51:23 +02:00
}
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/*
* Given an opened sequence relation, lock the page buffer and find the tuple
*
* *buf receives the reference to the pinned-and-ex-locked buffer
* *seqdatatuple receives the reference to the sequence tuple proper
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
* (this arg should point to a local variable of type HeapTupleData)
*
* Function's return value points to the data payload of the tuple
*/
static Form_pg_sequence_data
read_seq_tuple(Relation rel, Buffer *buf, HeapTuple seqdatatuple)
1997-04-02 05:51:23 +02:00
{
Page page;
ItemId lp;
sequence_magic *sm;
Form_pg_sequence_data seq;
*buf = ReadBuffer(rel, 0);
LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
page = BufferGetPage(*buf);
sm = (sequence_magic *) PageGetSpecialPointer(page);
if (sm->magic != SEQ_MAGIC)
elog(ERROR, "bad magic number in sequence \"%s\": %08X",
RelationGetRelationName(rel), sm->magic);
lp = PageGetItemId(page, FirstOffsetNumber);
Assert(ItemIdIsNormal(lp));
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/* Note we currently only bother to set these two fields of *seqdatatuple */
seqdatatuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
seqdatatuple->t_len = ItemIdGetLength(lp);
/*
* Previous releases of Postgres neglected to prevent SELECT FOR UPDATE on
* a sequence, which would leave a non-frozen XID in the sequence tuple's
* xmax, which eventually leads to clog access failures or worse. If we
* see this has happened, clean up after it. We treat this like a hint
* bit update, ie, don't bother to WAL-log it, since we can certainly do
* this again if the update gets lost.
*/
Assert(!(seqdatatuple->t_data->t_infomask & HEAP_XMAX_IS_MULTI));
if (HeapTupleHeaderGetRawXmax(seqdatatuple->t_data) != InvalidTransactionId)
{
HeapTupleHeaderSetXmax(seqdatatuple->t_data, InvalidTransactionId);
seqdatatuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
seqdatatuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
MarkBufferDirtyHint(*buf, true);
}
seq = (Form_pg_sequence_data) GETSTRUCT(seqdatatuple);
return seq;
1997-04-02 05:51:23 +02:00
}
/*
* init_params: process the options list of CREATE or ALTER SEQUENCE, and
* store the values into appropriate fields of seqform, for changes that go
* into the pg_sequence catalog, and fields of seqdataform for changes to the
* sequence relation itself. Set *need_seq_rewrite to true if we changed any
* parameters that require rewriting the sequence's relation (interesting for
* ALTER SEQUENCE). Also set *owned_by to any OWNED BY option, or to NIL if
* there is none.
*
* If isInit is true, fill any unspecified options with default values;
* otherwise, do not change existing options that aren't explicitly overridden.
*
* Note: we force a sequence rewrite whenever we change parameters that affect
* generation of future sequence values, even if the seqdataform per se is not
* changed. This allows ALTER SEQUENCE to behave transactionally. Currently,
* the only option that doesn't cause that is OWNED BY. It's *necessary* for
* ALTER SEQUENCE OWNED BY to not rewrite the sequence, because that would
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
* break pg_upgrade by causing unwanted changes in the sequence's
* relfilenumber.
*/
1997-04-02 05:51:23 +02:00
static void
init_params(ParseState *pstate, List *options, bool for_identity,
bool isInit,
Form_pg_sequence seqform,
Form_pg_sequence_data seqdataform,
bool *need_seq_rewrite,
List **owned_by)
1997-04-02 05:51:23 +02:00
{
DefElem *as_type = NULL;
DefElem *start_value = NULL;
DefElem *restart_value = NULL;
1997-04-02 05:51:23 +02:00
DefElem *increment_by = NULL;
DefElem *max_value = NULL;
DefElem *min_value = NULL;
DefElem *cache_value = NULL;
DefElem *is_cycled = NULL;
ListCell *option;
bool reset_max_value = false;
bool reset_min_value = false;
*need_seq_rewrite = false;
*owned_by = NIL;
foreach(option, options)
1997-04-02 05:51:23 +02:00
{
DefElem *defel = (DefElem *) lfirst(option);
if (strcmp(defel->defname, "as") == 0)
{
if (as_type)
errorConflictingDefElem(defel, pstate);
as_type = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "increment") == 0)
{
if (increment_by)
errorConflictingDefElem(defel, pstate);
1997-04-02 05:51:23 +02:00
increment_by = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "start") == 0)
{
if (start_value)
errorConflictingDefElem(defel, pstate);
start_value = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "restart") == 0)
{
if (restart_value)
errorConflictingDefElem(defel, pstate);
restart_value = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "maxvalue") == 0)
{
if (max_value)
errorConflictingDefElem(defel, pstate);
1997-04-02 05:51:23 +02:00
max_value = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "minvalue") == 0)
{
if (min_value)
errorConflictingDefElem(defel, pstate);
1997-04-02 05:51:23 +02:00
min_value = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "cache") == 0)
{
if (cache_value)
errorConflictingDefElem(defel, pstate);
1997-04-02 05:51:23 +02:00
cache_value = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "cycle") == 0)
{
if (is_cycled)
errorConflictingDefElem(defel, pstate);
is_cycled = defel;
*need_seq_rewrite = true;
}
else if (strcmp(defel->defname, "owned_by") == 0)
{
if (*owned_by)
errorConflictingDefElem(defel, pstate);
*owned_by = defGetQualifiedName(defel);
}
else if (strcmp(defel->defname, "sequence_name") == 0)
{
/*
* The parser allows this, but it is only for identity columns, in
* which case it is filtered out in parse_utilcmd.c. We only get
* here if someone puts it into a CREATE SEQUENCE.
*/
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("invalid sequence option SEQUENCE NAME"),
parser_errposition(pstate, defel->location)));
}
1997-04-02 05:51:23 +02:00
else
elog(ERROR, "option \"%s\" not recognized",
1997-04-02 05:51:23 +02:00
defel->defname);
}
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/*
* We must reset log_cnt when isInit or when changing any parameters that
* would affect future nextval allocations.
*/
if (isInit)
seqdataform->log_cnt = 0;
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
/* AS type */
if (as_type != NULL)
{
Oid newtypid = typenameTypeId(pstate, defGetTypeName(as_type));
if (newtypid != INT2OID &&
newtypid != INT4OID &&
newtypid != INT8OID)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
for_identity
? errmsg("identity column type must be smallint, integer, or bigint")
: errmsg("sequence type must be smallint, integer, or bigint")));
if (!isInit)
{
/*
* When changing type and the old sequence min/max values were the
* min/max of the old type, adjust sequence min/max values to
* min/max of new type. (Otherwise, the user chose explicit
* min/max values, which we'll leave alone.)
*/
if ((seqform->seqtypid == INT2OID && seqform->seqmax == PG_INT16_MAX) ||
(seqform->seqtypid == INT4OID && seqform->seqmax == PG_INT32_MAX) ||
(seqform->seqtypid == INT8OID && seqform->seqmax == PG_INT64_MAX))
reset_max_value = true;
if ((seqform->seqtypid == INT2OID && seqform->seqmin == PG_INT16_MIN) ||
(seqform->seqtypid == INT4OID && seqform->seqmin == PG_INT32_MIN) ||
(seqform->seqtypid == INT8OID && seqform->seqmin == PG_INT64_MIN))
reset_min_value = true;
}
seqform->seqtypid = newtypid;
}
else if (isInit)
{
seqform->seqtypid = INT8OID;
}
/* INCREMENT BY */
if (increment_by != NULL)
{
seqform->seqincrement = defGetInt64(increment_by);
if (seqform->seqincrement == 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("INCREMENT must not be zero")));
seqdataform->log_cnt = 0;
}
else if (isInit)
{
seqform->seqincrement = 1;
}
/* CYCLE */
if (is_cycled != NULL)
{
seqform->seqcycle = boolVal(is_cycled->arg);
Assert(BoolIsValid(seqform->seqcycle));
seqdataform->log_cnt = 0;
}
else if (isInit)
{
seqform->seqcycle = false;
}
/* MAXVALUE (null arg means NO MAXVALUE) */
if (max_value != NULL && max_value->arg)
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
{
seqform->seqmax = defGetInt64(max_value);
seqdataform->log_cnt = 0;
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
}
else if (isInit || max_value != NULL || reset_max_value)
{
if (seqform->seqincrement > 0 || reset_max_value)
{
/* ascending seq */
if (seqform->seqtypid == INT2OID)
seqform->seqmax = PG_INT16_MAX;
else if (seqform->seqtypid == INT4OID)
seqform->seqmax = PG_INT32_MAX;
else
seqform->seqmax = PG_INT64_MAX;
}
else
seqform->seqmax = -1; /* descending seq */
seqdataform->log_cnt = 0;
}
1997-04-02 05:51:23 +02:00
/* Validate maximum value. No need to check INT8 as seqmax is an int64 */
if ((seqform->seqtypid == INT2OID && (seqform->seqmax < PG_INT16_MIN || seqform->seqmax > PG_INT16_MAX))
|| (seqform->seqtypid == INT4OID && (seqform->seqmax < PG_INT32_MIN || seqform->seqmax > PG_INT32_MAX)))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("MAXVALUE (%lld) is out of range for sequence data type %s",
(long long) seqform->seqmax,
format_type_be(seqform->seqtypid))));
/* MINVALUE (null arg means NO MINVALUE) */
if (min_value != NULL && min_value->arg)
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
{
seqform->seqmin = defGetInt64(min_value);
seqdataform->log_cnt = 0;
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
}
else if (isInit || min_value != NULL || reset_min_value)
{
if (seqform->seqincrement < 0 || reset_min_value)
{
/* descending seq */
if (seqform->seqtypid == INT2OID)
seqform->seqmin = PG_INT16_MIN;
else if (seqform->seqtypid == INT4OID)
seqform->seqmin = PG_INT32_MIN;
else
seqform->seqmin = PG_INT64_MIN;
}
else
seqform->seqmin = 1; /* ascending seq */
seqdataform->log_cnt = 0;
}
/* Validate minimum value. No need to check INT8 as seqmin is an int64 */
if ((seqform->seqtypid == INT2OID && (seqform->seqmin < PG_INT16_MIN || seqform->seqmin > PG_INT16_MAX))
|| (seqform->seqtypid == INT4OID && (seqform->seqmin < PG_INT32_MIN || seqform->seqmin > PG_INT32_MAX)))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("MINVALUE (%lld) is out of range for sequence data type %s",
(long long) seqform->seqmin,
format_type_be(seqform->seqtypid))));
/* crosscheck min/max */
if (seqform->seqmin >= seqform->seqmax)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("MINVALUE (%lld) must be less than MAXVALUE (%lld)",
(long long) seqform->seqmin,
(long long) seqform->seqmax)));
/* START WITH */
if (start_value != NULL)
{
seqform->seqstart = defGetInt64(start_value);
}
else if (isInit)
{
if (seqform->seqincrement > 0)
seqform->seqstart = seqform->seqmin; /* ascending seq */
else
seqform->seqstart = seqform->seqmax; /* descending seq */
}
/* crosscheck START */
if (seqform->seqstart < seqform->seqmin)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("START value (%lld) cannot be less than MINVALUE (%lld)",
(long long) seqform->seqstart,
(long long) seqform->seqmin)));
if (seqform->seqstart > seqform->seqmax)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("START value (%lld) cannot be greater than MAXVALUE (%lld)",
(long long) seqform->seqstart,
(long long) seqform->seqmax)));
/* RESTART [WITH] */
if (restart_value != NULL)
{
if (restart_value->arg != NULL)
seqdataform->last_value = defGetInt64(restart_value);
else
seqdataform->last_value = seqform->seqstart;
seqdataform->is_called = false;
seqdataform->log_cnt = 0;
}
else if (isInit)
{
seqdataform->last_value = seqform->seqstart;
seqdataform->is_called = false;
}
/* crosscheck RESTART (or current value, if changing MIN/MAX) */
if (seqdataform->last_value < seqform->seqmin)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("RESTART value (%lld) cannot be less than MINVALUE (%lld)",
(long long) seqdataform->last_value,
(long long) seqform->seqmin)));
if (seqdataform->last_value > seqform->seqmax)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("RESTART value (%lld) cannot be greater than MAXVALUE (%lld)",
(long long) seqdataform->last_value,
(long long) seqform->seqmax)));
/* CACHE */
if (cache_value != NULL)
{
seqform->seqcache = defGetInt64(cache_value);
if (seqform->seqcache <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("CACHE (%lld) must be greater than zero",
(long long) seqform->seqcache)));
seqdataform->log_cnt = 0;
}
else if (isInit)
{
seqform->seqcache = 1;
}
1997-04-02 05:51:23 +02:00
}
/*
* Process an OWNED BY option for CREATE/ALTER SEQUENCE
*
* Ownership permissions on the sequence are already checked,
* but if we are establishing a new owned-by dependency, we must
* enforce that the referenced table has the same owner and namespace
* as the sequence.
*/
static void
process_owned_by(Relation seqrel, List *owned_by, bool for_identity)
{
DependencyType deptype;
int nnames;
Relation tablerel;
AttrNumber attnum;
deptype = for_identity ? DEPENDENCY_INTERNAL : DEPENDENCY_AUTO;
nnames = list_length(owned_by);
Assert(nnames > 0);
if (nnames == 1)
{
/* Must be OWNED BY NONE */
if (strcmp(strVal(linitial(owned_by)), "none") != 0)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("invalid OWNED BY option"),
errhint("Specify OWNED BY table.column or OWNED BY NONE.")));
tablerel = NULL;
attnum = 0;
}
else
{
List *relname;
char *attrname;
RangeVar *rel;
/* Separate relname and attr name */
relname = list_copy_head(owned_by, nnames - 1);
attrname = strVal(llast(owned_by));
/* Open and lock rel to ensure it won't go away meanwhile */
rel = makeRangeVarFromNameList(relname);
tablerel = relation_openrv(rel, AccessShareLock);
/* Must be a regular or foreign table */
if (!(tablerel->rd_rel->relkind == RELKIND_RELATION ||
Implement table partitioning. Table partitioning is like table inheritance and reuses much of the existing infrastructure, but there are some important differences. The parent is called a partitioned table and is always empty; it may not have indexes or non-inherited constraints, since those make no sense for a relation with no data of its own. The children are called partitions and contain all of the actual data. Each partition has an implicit partitioning constraint. Multiple inheritance is not allowed, and partitioning and inheritance can't be mixed. Partitions can't have extra columns and may not allow nulls unless the parent does. Tuples inserted into the parent are automatically routed to the correct partition, so tuple-routing ON INSERT triggers are not needed. Tuple routing isn't yet supported for partitions which are foreign tables, and it doesn't handle updates that cross partition boundaries. Currently, tables can be range-partitioned or list-partitioned. List partitioning is limited to a single column, but range partitioning can involve multiple columns. A partitioning "column" can be an expression. Because table partitioning is less general than table inheritance, it is hoped that it will be easier to reason about properties of partitions, and therefore that this will serve as a better foundation for a variety of possible optimizations, including query planner optimizations. The tuple routing based which this patch does based on the implicit partitioning constraints is an example of this, but it seems likely that many other useful optimizations are also possible. Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat, Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova, Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
2016-12-07 19:17:43 +01:00
tablerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
tablerel->rd_rel->relkind == RELKIND_VIEW ||
Implement table partitioning. Table partitioning is like table inheritance and reuses much of the existing infrastructure, but there are some important differences. The parent is called a partitioned table and is always empty; it may not have indexes or non-inherited constraints, since those make no sense for a relation with no data of its own. The children are called partitions and contain all of the actual data. Each partition has an implicit partitioning constraint. Multiple inheritance is not allowed, and partitioning and inheritance can't be mixed. Partitions can't have extra columns and may not allow nulls unless the parent does. Tuples inserted into the parent are automatically routed to the correct partition, so tuple-routing ON INSERT triggers are not needed. Tuple routing isn't yet supported for partitions which are foreign tables, and it doesn't handle updates that cross partition boundaries. Currently, tables can be range-partitioned or list-partitioned. List partitioning is limited to a single column, but range partitioning can involve multiple columns. A partitioning "column" can be an expression. Because table partitioning is less general than table inheritance, it is hoped that it will be easier to reason about properties of partitions, and therefore that this will serve as a better foundation for a variety of possible optimizations, including query planner optimizations. The tuple routing based which this patch does based on the implicit partitioning constraints is an example of this, but it seems likely that many other useful optimizations are also possible. Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat, Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova, Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
2016-12-07 19:17:43 +01:00
tablerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("sequence cannot be owned by relation \"%s\"",
RelationGetRelationName(tablerel)),
errdetail_relkind_not_supported(tablerel->rd_rel->relkind)));
/* We insist on same owner and schema */
if (seqrel->rd_rel->relowner != tablerel->rd_rel->relowner)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
2006-10-06 19:14:01 +02:00
errmsg("sequence must have same owner as table it is linked to")));
if (RelationGetNamespace(seqrel) != RelationGetNamespace(tablerel))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
2006-10-06 19:14:01 +02:00
errmsg("sequence must be in same schema as table it is linked to")));
/* Now, fetch the attribute number from the system cache */
attnum = get_attnum(RelationGetRelid(tablerel), attrname);
if (attnum == InvalidAttrNumber)
ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_COLUMN),
errmsg("column \"%s\" of relation \"%s\" does not exist",
attrname, RelationGetRelationName(tablerel))));
}
/*
* Catch user explicitly running OWNED BY on identity sequence.
*/
if (deptype == DEPENDENCY_AUTO)
{
Oid tableId;
int32 colId;
if (sequenceIsOwned(RelationGetRelid(seqrel), DEPENDENCY_INTERNAL, &tableId, &colId))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot change ownership of identity sequence"),
errdetail("Sequence \"%s\" is linked to table \"%s\".",
RelationGetRelationName(seqrel),
get_rel_name(tableId))));
}
/*
* OK, we are ready to update pg_depend. First remove any existing
* dependencies for the sequence, then optionally add a new one.
*/
deleteDependencyRecordsForClass(RelationRelationId, RelationGetRelid(seqrel),
RelationRelationId, deptype);
if (tablerel)
{
ObjectAddress refobject,
depobject;
refobject.classId = RelationRelationId;
refobject.objectId = RelationGetRelid(tablerel);
refobject.objectSubId = attnum;
depobject.classId = RelationRelationId;
depobject.objectId = RelationGetRelid(seqrel);
depobject.objectSubId = 0;
recordDependencyOn(&depobject, &refobject, deptype);
}
/* Done, but hold lock until commit */
if (tablerel)
relation_close(tablerel, NoLock);
}
/*
* Return sequence parameters in a list of the form created by the parser.
*/
List *
sequence_options(Oid relid)
{
HeapTuple pgstuple;
Form_pg_sequence pgsform;
List *options = NIL;
pgstuple = SearchSysCache1(SEQRELID, relid);
if (!HeapTupleIsValid(pgstuple))
elog(ERROR, "cache lookup failed for sequence %u", relid);
pgsform = (Form_pg_sequence) GETSTRUCT(pgstuple);
/* Use makeFloat() for 64-bit integers, like gram.y does. */
options = lappend(options,
makeDefElem("cache", (Node *) makeFloat(psprintf(INT64_FORMAT, pgsform->seqcache)), -1));
options = lappend(options,
makeDefElem("cycle", (Node *) makeBoolean(pgsform->seqcycle), -1));
options = lappend(options,
makeDefElem("increment", (Node *) makeFloat(psprintf(INT64_FORMAT, pgsform->seqincrement)), -1));
options = lappend(options,
makeDefElem("maxvalue", (Node *) makeFloat(psprintf(INT64_FORMAT, pgsform->seqmax)), -1));
options = lappend(options,
makeDefElem("minvalue", (Node *) makeFloat(psprintf(INT64_FORMAT, pgsform->seqmin)), -1));
options = lappend(options,
makeDefElem("start", (Node *) makeFloat(psprintf(INT64_FORMAT, pgsform->seqstart)), -1));
ReleaseSysCache(pgstuple);
return options;
}
/*
* Return sequence parameters (formerly for use by information schema)
*/
Datum
pg_sequence_parameters(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
TupleDesc tupdesc;
Datum values[7];
bool isnull[7];
HeapTuple pgstuple;
Form_pg_sequence pgsform;
if (pg_class_aclcheck(relid, GetUserId(), ACL_SELECT | ACL_UPDATE | ACL_USAGE) != ACLCHECK_OK)
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("permission denied for sequence %s",
get_rel_name(relid))));
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
memset(isnull, 0, sizeof(isnull));
pgstuple = SearchSysCache1(SEQRELID, relid);
if (!HeapTupleIsValid(pgstuple))
elog(ERROR, "cache lookup failed for sequence %u", relid);
pgsform = (Form_pg_sequence) GETSTRUCT(pgstuple);
values[0] = Int64GetDatum(pgsform->seqstart);
values[1] = Int64GetDatum(pgsform->seqmin);
values[2] = Int64GetDatum(pgsform->seqmax);
values[3] = Int64GetDatum(pgsform->seqincrement);
values[4] = BoolGetDatum(pgsform->seqcycle);
values[5] = Int64GetDatum(pgsform->seqcache);
values[6] = ObjectIdGetDatum(pgsform->seqtypid);
ReleaseSysCache(pgstuple);
return HeapTupleGetDatum(heap_form_tuple(tupdesc, values, isnull));
}
/*
* Return the last value from the sequence
*
* Note: This has a completely different meaning than lastval().
*/
Datum
pg_sequence_last_value(PG_FUNCTION_ARGS)
{
Oid relid = PG_GETARG_OID(0);
SeqTable elm;
Relation seqrel;
Buffer buf;
HeapTupleData seqtuple;
Form_pg_sequence_data seq;
bool is_called;
int64 result;
Fix ALTER SEQUENCE locking In 1753b1b027035029c2a2a1649065762fafbf63f3, the pg_sequence system catalog was introduced. This made sequence metadata changes transactional, while the actual sequence values are still behaving nontransactionally. This requires some refinement in how ALTER SEQUENCE, which operates on both, locks the sequence and the catalog. The main problems were: - Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error, caused by updates to pg_sequence catalog. - Sequence WAL writes and catalog updates are not protected by same lock, which could lead to inconsistent recovery order. - nextval() disregarding uncommitted ALTER SEQUENCE changes. To fix, nextval() and friends now lock the sequence using RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the sequence using ShareRowExclusiveLock. This means that nextval() and ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence blocks itself. (This was already the case previously for the OWNER TO, RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the entire AlterSequence is protected by the lock on the sequence. As an exception, use reduced locking for ALTER SEQUENCE ... RESTART. Since that is basically a setval(), it does not require the full locking of other ALTER SEQUENCE actions. So check whether we are only running a RESTART and run with less locking if so. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reported-by: Jason Petersen <jason@citusdata.com> Reported-by: Andres Freund <andres@anarazel.de>
2017-05-10 05:35:31 +02:00
/* open and lock sequence */
init_sequence(relid, &elm, &seqrel);
if (pg_class_aclcheck(relid, GetUserId(), ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("permission denied for sequence %s",
RelationGetRelationName(seqrel))));
seq = read_seq_tuple(seqrel, &buf, &seqtuple);
is_called = seq->is_called;
result = seq->last_value;
UnlockReleaseBuffer(buf);
relation_close(seqrel, NoLock);
if (is_called)
PG_RETURN_INT64(result);
else
PG_RETURN_NULL();
}
void
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
seq_redo(XLogReaderState *record)
{
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogRecPtr lsn = record->EndRecPtr;
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
Buffer buffer;
Page page;
Page localpage;
char *item;
Size itemsz;
xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
sequence_magic *sm;
if (info != XLOG_SEQ_LOG)
elog(PANIC, "seq_redo: unknown op code %u", info);
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
buffer = XLogInitBufferForRedo(record, 0);
page = (Page) BufferGetPage(buffer);
/*
* We always reinit the page. However, since this WAL record type is also
* used for updating sequences, it's possible that a hot-standby backend
* is examining the page concurrently; so we mustn't transiently trash the
* buffer. The solution is to build the correct new page contents in
* local workspace and then memcpy into the buffer. Then only bytes that
* are supposed to change will change, even transiently. We must palloc
* the local page for alignment reasons.
*/
localpage = (Page) palloc(BufferGetPageSize(buffer));
PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));
sm = (sequence_magic *) PageGetSpecialPointer(localpage);
sm->magic = SEQ_MAGIC;
item = (char *) xlrec + sizeof(xl_seq_rec);
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
itemsz = XLogRecGetDataLen(record) - sizeof(xl_seq_rec);
Fix longstanding crash-safety bug with newly-created-or-reset sequences. If a crash occurred immediately after the first nextval() call for a serial column, WAL replay would restore the sequence to a state in which it appeared that no nextval() had been done, thus allowing the first sequence value to be returned again by the next nextval() call; as reported in bug #6748 from Xiangming Mei. More generally, the problem would occur if an ALTER SEQUENCE was executed on a freshly created or reset sequence. (The manifestation with serial columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step to serial column creation.) The cause is that sequence creation attempted to save one WAL entry by writing out a WAL record that made it appear that the first nextval() had already happened (viz, with is_called = true), while marking the sequence's in-database state with log_cnt = 1 to show that the first nextval() need not emit a WAL record. However, ALTER SEQUENCE would emit a new WAL entry reflecting the actual in-database state (with is_called = false). Then, nextval would allocate the first sequence value and set is_called = true, but it would trust the log_cnt value and not emit any WAL record. A crash at this point would thus restore the sequence to its post-ALTER state, causing the next nextval() call to return the first sequence value again. To fix, get rid of the idea of logging an is_called status different from reality. This means that the first nextval-driven WAL record will happen at the first nextval call not the second, but the marginal cost of that is pretty negligible. In addition, make sure that ALTER SEQUENCE resets log_cnt to zero in any case where it touches sequence parameters that affect future nextval results. This will result in some user-visible changes in the contents of a sequence's log_cnt column, as reflected in the patch's regression test changes; but no application should be depending on that anyway, since it was already true that log_cnt changes rather unpredictably depending on checkpoint timing. In addition, make some basically-cosmetic improvements to get rid of sequence.c's undesirable intimacy with page layout details. It was always really trying to WAL-log the contents of the sequence tuple, so we should have it do that directly using a HeapTuple's t_data and t_len, rather than backing into it with some magic assumptions about where the tuple would be on the sequence's page. Back-patch to all supported branches.
2012-07-25 23:40:36 +02:00
if (PageAddItem(localpage, (Item) item, itemsz,
FirstOffsetNumber, false, false) == InvalidOffsetNumber)
elog(PANIC, "seq_redo: failed to add item to page");
PageSetLSN(localpage, lsn);
memcpy(page, localpage, BufferGetPageSize(buffer));
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);
pfree(localpage);
}
/*
* Flush cached sequence information.
*/
void
ResetSequenceCaches(void)
{
if (seqhashtab)
{
hash_destroy(seqhashtab);
seqhashtab = NULL;
}
last_used_seq = NULL;
}
/*
* Mask a Sequence page before performing consistency checks on it.
*/
void
seq_mask(char *page, BlockNumber blkno)
{
mask_page_lsn_and_checksum(page);
mask_unused_space(page);
}