2010-02-07 21:48:13 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* relmapper.c
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* Catalog-to-filenumber mapping
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
|
|
|
* For most tables, the physical file underlying the table is specified by
|
|
|
|
* pg_class.relfilenode. However, that obviously won't work for pg_class
|
|
|
|
* itself, nor for the other "nailed" catalogs for which we have to be able
|
|
|
|
* to set up working Relation entries without access to pg_class. It also
|
|
|
|
* does not work for shared catalogs, since there is no practical way to
|
|
|
|
* update other databases' pg_class entries when relocating a shared catalog.
|
|
|
|
* Therefore, for these special catalogs (henceforth referred to as "mapped
|
|
|
|
* catalogs") we rely on a separately maintained file that shows the mapping
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* from catalog OIDs to filenumbers. Each database has a map file for
|
2010-02-07 21:48:13 +01:00
|
|
|
* its local mapped catalogs, and there is a separate map file for shared
|
|
|
|
* catalogs. Mapped catalogs have zero in their pg_class.relfilenode entries.
|
|
|
|
*
|
|
|
|
* Relocation of a normal table is committed (ie, the new physical file becomes
|
|
|
|
* authoritative) when the pg_class row update commits. For mapped catalogs,
|
|
|
|
* the act of updating the map file is effectively commit of the relocation.
|
|
|
|
* We postpone the file update till just before commit of the transaction
|
|
|
|
* doing the rewrite, but there is necessarily a window between. Therefore
|
|
|
|
* mapped catalogs can only be relocated by operations such as VACUUM FULL
|
|
|
|
* and CLUSTER, which make no transactionally-significant changes: it must be
|
|
|
|
* safe for the new file to replace the old, even if the transaction itself
|
|
|
|
* aborts. An important factor here is that the indexes and toast table of
|
|
|
|
* a mapped catalog must also be mapped, so that the rewrites/relocations of
|
|
|
|
* all these files commit in a single map file update rather than being tied
|
|
|
|
* to transaction commit.
|
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
|
2010-02-07 21:48:13 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/utils/cache/relmapper.c
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include <fcntl.h>
|
2010-02-07 23:00:53 +01:00
|
|
|
#include <sys/stat.h>
|
2010-02-07 21:48:13 +01:00
|
|
|
#include <unistd.h>
|
|
|
|
|
|
|
|
#include "access/xact.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "access/xlog.h"
|
|
|
|
#include "access/xloginsert.h"
|
2010-02-07 21:48:13 +01:00
|
|
|
#include "catalog/catalog.h"
|
|
|
|
#include "catalog/pg_tablespace.h"
|
|
|
|
#include "catalog/storage.h"
|
|
|
|
#include "miscadmin.h"
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
#include "pgstat.h"
|
2010-02-07 21:48:13 +01:00
|
|
|
#include "storage/fd.h"
|
2011-09-04 07:13:16 +02:00
|
|
|
#include "storage/lwlock.h"
|
2010-02-07 21:48:13 +01:00
|
|
|
#include "utils/inval.h"
|
|
|
|
#include "utils/relmapper.h"
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The map file is critical data: we have no automatic method for recovering
|
|
|
|
* from loss or corruption of it. We use a CRC so that we can detect
|
2022-07-26 20:56:25 +02:00
|
|
|
* corruption. Since the file might be more than one standard-size disk
|
|
|
|
* sector in size, we cannot rely on overwrite-in-place. Instead, we generate
|
|
|
|
* a new file and rename it into place, atomically replacing the original file.
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
|
|
|
* Entries in the mappings[] array are in no particular order. We could
|
|
|
|
* speed searching by insisting on OID order, but it really shouldn't be
|
|
|
|
* worth the trouble given the intended size of the mapping sets.
|
|
|
|
*/
|
|
|
|
#define RELMAPPER_FILENAME "pg_filenode.map"
|
2022-07-26 20:56:25 +02:00
|
|
|
#define RELMAPPER_TEMP_FILENAME "pg_filenode.map.tmp"
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
#define RELMAPPER_FILEMAGIC 0x592717 /* version ID value */
|
|
|
|
|
2022-07-26 20:56:25 +02:00
|
|
|
/*
|
|
|
|
* There's no need for this constant to have any particular value, and we
|
|
|
|
* can raise it as necessary if we end up with more mapped relations. For
|
|
|
|
* now, we just pick a round number that is modestly larger than the expected
|
|
|
|
* number of mappings.
|
|
|
|
*/
|
|
|
|
#define MAX_MAPPINGS 64
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
typedef struct RelMapping
|
|
|
|
{
|
|
|
|
Oid mapoid; /* OID of a catalog */
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileNumber mapfilenumber; /* its rel file number */
|
2010-02-07 21:48:13 +01:00
|
|
|
} RelMapping;
|
|
|
|
|
|
|
|
typedef struct RelMapFile
|
|
|
|
{
|
|
|
|
int32 magic; /* always RELMAPPER_FILEMAGIC */
|
|
|
|
int32 num_mappings; /* number of valid RelMapping entries */
|
|
|
|
RelMapping mappings[MAX_MAPPINGS];
|
2015-04-14 16:03:42 +02:00
|
|
|
pg_crc32c crc; /* CRC of all above */
|
2010-02-07 21:48:13 +01:00
|
|
|
} RelMapFile;
|
|
|
|
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
/*
|
|
|
|
* State for serializing local and shared relmappings for parallel workers
|
|
|
|
* (active states only). See notes on active_* and pending_* updates state.
|
|
|
|
*/
|
|
|
|
typedef struct SerializedActiveRelMaps
|
|
|
|
{
|
|
|
|
RelMapFile active_shared_updates;
|
|
|
|
RelMapFile active_local_updates;
|
|
|
|
} SerializedActiveRelMaps;
|
|
|
|
|
2010-02-07 21:48:13 +01:00
|
|
|
/*
|
|
|
|
* The currently known contents of the shared map file and our database's
|
|
|
|
* local map file are stored here. These can be reloaded from disk
|
|
|
|
* immediately whenever we receive an update sinval message.
|
|
|
|
*/
|
|
|
|
static RelMapFile shared_map;
|
|
|
|
static RelMapFile local_map;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We use the same RelMapFile data structure to track uncommitted local
|
|
|
|
* changes in the mappings (but note the magic and crc fields are not made
|
|
|
|
* valid in these variables). Currently, map updates are not allowed within
|
|
|
|
* subtransactions, so one set of transaction-level changes is sufficient.
|
|
|
|
*
|
|
|
|
* The active_xxx variables contain updates that are valid in our transaction
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* and should be honored by RelationMapOidToFilenumber. The pending_xxx
|
2010-02-07 21:48:13 +01:00
|
|
|
* variables contain updates we have been told about that aren't active yet;
|
|
|
|
* they will become active at the next CommandCounterIncrement. This setup
|
|
|
|
* lets map updates act similarly to updates of pg_class rows, ie, they
|
|
|
|
* become visible only at the next CommandCounterIncrement boundary.
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
*
|
|
|
|
* Active shared and active local updates are serialized by the parallel
|
|
|
|
* infrastructure, and deserialized within parallel workers.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
|
|
|
static RelMapFile active_shared_updates;
|
|
|
|
static RelMapFile active_local_updates;
|
|
|
|
static RelMapFile pending_shared_updates;
|
|
|
|
static RelMapFile pending_local_updates;
|
|
|
|
|
|
|
|
|
|
|
|
/* non-export function prototypes */
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
static void apply_map_update(RelMapFile *map, Oid relationId,
|
2022-09-20 22:09:30 +02:00
|
|
|
RelFileNumber fileNumber, bool add_okay);
|
2010-02-07 21:48:13 +01:00
|
|
|
static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
|
|
|
|
bool add_okay);
|
2021-06-24 09:45:23 +02:00
|
|
|
static void load_relmap_file(bool shared, bool lock_held);
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
|
|
|
|
int elevel);
|
|
|
|
static void write_relmap_file(RelMapFile *newmap, bool write_wal,
|
|
|
|
bool send_sinval, bool preserve_files,
|
2010-02-07 21:48:13 +01:00
|
|
|
Oid dbid, Oid tsid, const char *dbpath);
|
|
|
|
static void perform_relmap_update(bool shared, const RelMapFile *updates);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* RelationMapOidToFilenumber
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* The raison d' etre ... given a relation OID, look up its filenumber.
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
|
|
|
* Although shared and local relation OIDs should never overlap, the caller
|
|
|
|
* always knows which we need --- so pass that information to avoid useless
|
|
|
|
* searching.
|
|
|
|
*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* Returns InvalidRelFileNumber if the OID is not known (which should never
|
|
|
|
* happen, but the caller is in a better position to report a meaningful
|
|
|
|
* error).
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileNumber
|
|
|
|
RelationMapOidToFilenumber(Oid relationId, bool shared)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
|
|
|
const RelMapFile *map;
|
|
|
|
int32 i;
|
|
|
|
|
|
|
|
/* If there are active updates, believe those over the main maps */
|
|
|
|
if (shared)
|
|
|
|
{
|
|
|
|
map = &active_shared_updates;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map->mappings[i].mapoid)
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return map->mappings[i].mapfilenumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
map = &shared_map;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map->mappings[i].mapoid)
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return map->mappings[i].mapfilenumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
map = &active_local_updates;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map->mappings[i].mapoid)
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return map->mappings[i].mapfilenumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
map = &local_map;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map->mappings[i].mapoid)
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return map->mappings[i].mapfilenumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return InvalidRelFileNumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
|
2013-07-22 16:34:34 +02:00
|
|
|
/*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* RelationMapFilenumberToOid
|
2013-07-22 16:34:34 +02:00
|
|
|
*
|
|
|
|
* Do the reverse of the normal direction of mapping done in
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* RelationMapOidToFilenumber.
|
2013-07-22 16:34:34 +02:00
|
|
|
*
|
|
|
|
* This is not supposed to be used during normal running but rather for
|
|
|
|
* information purposes when looking at the filesystem or xlog.
|
|
|
|
*
|
|
|
|
* Returns InvalidOid if the OID is not known; this can easily happen if the
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* relfilenumber doesn't pertain to a mapped relation.
|
2013-07-22 16:34:34 +02:00
|
|
|
*/
|
|
|
|
Oid
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelationMapFilenumberToOid(RelFileNumber filenumber, bool shared)
|
2013-07-22 16:34:34 +02:00
|
|
|
{
|
|
|
|
const RelMapFile *map;
|
|
|
|
int32 i;
|
|
|
|
|
|
|
|
/* If there are active updates, believe those over the main maps */
|
|
|
|
if (shared)
|
|
|
|
{
|
|
|
|
map = &active_shared_updates;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (filenumber == map->mappings[i].mapfilenumber)
|
2013-07-22 16:34:34 +02:00
|
|
|
return map->mappings[i].mapoid;
|
|
|
|
}
|
|
|
|
map = &shared_map;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (filenumber == map->mappings[i].mapfilenumber)
|
2013-07-22 16:34:34 +02:00
|
|
|
return map->mappings[i].mapoid;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
map = &active_local_updates;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (filenumber == map->mappings[i].mapfilenumber)
|
2013-07-22 16:34:34 +02:00
|
|
|
return map->mappings[i].mapoid;
|
|
|
|
}
|
|
|
|
map = &local_map;
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (filenumber == map->mappings[i].mapfilenumber)
|
2013-07-22 16:34:34 +02:00
|
|
|
return map->mappings[i].mapoid;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return InvalidOid;
|
|
|
|
}
|
|
|
|
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
/*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* RelationMapOidToFilenumberForDatabase
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* Like RelationMapOidToFilenumber, but reads the mapping from the indicated
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
* path instead of using the one for the current database.
|
|
|
|
*/
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileNumber
|
|
|
|
RelationMapOidToFilenumberForDatabase(char *dbpath, Oid relationId)
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
{
|
|
|
|
RelMapFile map;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* Read the relmap file from the source database. */
|
|
|
|
read_relmap_file(&map, dbpath, false, ERROR);
|
|
|
|
|
|
|
|
/* Iterate over the relmap entries to find the input relation OID. */
|
|
|
|
for (i = 0; i < map.num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map.mappings[i].mapoid)
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return map.mappings[i].mapfilenumber;
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
}
|
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return InvalidRelFileNumber;
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapCopy
|
|
|
|
*
|
|
|
|
* Copy relmapfile from source db path to the destination db path and WAL log
|
|
|
|
* the operation. This is intended for use in creating a new relmap file
|
|
|
|
* for a database that doesn't have one yet, not for replacing an existing
|
|
|
|
* relmap file.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
|
|
|
|
{
|
|
|
|
RelMapFile map;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the relmap file from the source database.
|
|
|
|
*/
|
|
|
|
read_relmap_file(&map, srcdbpath, false, ERROR);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write the same data into the destination database's relmap file.
|
|
|
|
*
|
|
|
|
* No sinval is needed because no one can be connected to the destination
|
|
|
|
* database yet. For the same reason, there is no need to acquire
|
|
|
|
* RelationMappingLock.
|
|
|
|
*
|
|
|
|
* There's no point in trying to preserve files here. The new database
|
|
|
|
* isn't usable yet anyway, and won't ever be if we can't install a relmap
|
|
|
|
* file.
|
|
|
|
*/
|
|
|
|
write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
|
|
|
|
}
|
|
|
|
|
2010-02-07 21:48:13 +01:00
|
|
|
/*
|
|
|
|
* RelationMapUpdateMap
|
|
|
|
*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* Install a new relfilenumber mapping for the specified relation.
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
|
|
|
* If immediate is true (or we're bootstrapping), the mapping is activated
|
|
|
|
* immediately. Otherwise it is made pending until CommandCounterIncrement.
|
|
|
|
*/
|
|
|
|
void
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelationMapUpdateMap(Oid relationId, RelFileNumber fileNumber, bool shared,
|
2010-02-07 21:48:13 +01:00
|
|
|
bool immediate)
|
|
|
|
{
|
|
|
|
RelMapFile *map;
|
|
|
|
|
|
|
|
if (IsBootstrapProcessingMode())
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In bootstrap mode, the mapping gets installed in permanent map.
|
|
|
|
*/
|
|
|
|
if (shared)
|
|
|
|
map = &shared_map;
|
|
|
|
else
|
|
|
|
map = &local_map;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
* We don't currently support map changes within subtransactions, or
|
|
|
|
* when in parallel mode. This could be done with more bookkeeping
|
|
|
|
* infrastructure, but it doesn't presently seem worth it.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
|
|
|
if (GetCurrentTransactionNestLevel() > 1)
|
|
|
|
elog(ERROR, "cannot change relation mapping within subtransaction");
|
|
|
|
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
if (IsInParallelMode())
|
|
|
|
elog(ERROR, "cannot change relation mapping in parallel mode");
|
|
|
|
|
2010-02-07 21:48:13 +01:00
|
|
|
if (immediate)
|
|
|
|
{
|
|
|
|
/* Make it active, but only locally */
|
|
|
|
if (shared)
|
|
|
|
map = &active_shared_updates;
|
|
|
|
else
|
|
|
|
map = &active_local_updates;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Make it pending */
|
|
|
|
if (shared)
|
|
|
|
map = &pending_shared_updates;
|
|
|
|
else
|
|
|
|
map = &pending_local_updates;
|
|
|
|
}
|
|
|
|
}
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
apply_map_update(map, relationId, fileNumber, true);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* apply_map_update
|
|
|
|
*
|
|
|
|
* Insert a new mapping into the given map variable, replacing any existing
|
|
|
|
* mapping for the same relation.
|
|
|
|
*
|
|
|
|
* In some cases the caller knows there must be an existing mapping; pass
|
|
|
|
* add_okay = false to draw an error if not.
|
|
|
|
*/
|
|
|
|
static void
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
apply_map_update(RelMapFile *map, Oid relationId, RelFileNumber fileNumber,
|
|
|
|
bool add_okay)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
|
|
|
int32 i;
|
|
|
|
|
|
|
|
/* Replace any existing mapping */
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map->mappings[i].mapoid)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
map->mappings[i].mapfilenumber = fileNumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Nope, need to add a new mapping */
|
|
|
|
if (!add_okay)
|
|
|
|
elog(ERROR, "attempt to apply a mapping to unmapped relation %u",
|
|
|
|
relationId);
|
|
|
|
if (map->num_mappings >= MAX_MAPPINGS)
|
|
|
|
elog(ERROR, "ran out of space in relation map");
|
|
|
|
map->mappings[map->num_mappings].mapoid = relationId;
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
map->mappings[map->num_mappings].mapfilenumber = fileNumber;
|
2010-02-07 21:48:13 +01:00
|
|
|
map->num_mappings++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* merge_map_updates
|
|
|
|
*
|
|
|
|
* Merge all the updates in the given pending-update map into the target map.
|
|
|
|
* This is just a bulk form of apply_map_update.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
merge_map_updates(RelMapFile *map, const RelMapFile *updates, bool add_okay)
|
|
|
|
{
|
|
|
|
int32 i;
|
|
|
|
|
|
|
|
for (i = 0; i < updates->num_mappings; i++)
|
|
|
|
{
|
|
|
|
apply_map_update(map,
|
|
|
|
updates->mappings[i].mapoid,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
updates->mappings[i].mapfilenumber,
|
2010-02-07 21:48:13 +01:00
|
|
|
add_okay);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapRemoveMapping
|
|
|
|
*
|
|
|
|
* Remove a relation's entry in the map. This is only allowed for "active"
|
|
|
|
* (but not committed) local mappings. We need it so we can back out the
|
|
|
|
* entry for the transient target file when doing VACUUM FULL/CLUSTER on
|
|
|
|
* a mapped relation.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapRemoveMapping(Oid relationId)
|
|
|
|
{
|
|
|
|
RelMapFile *map = &active_local_updates;
|
|
|
|
int32 i;
|
|
|
|
|
|
|
|
for (i = 0; i < map->num_mappings; i++)
|
|
|
|
{
|
|
|
|
if (relationId == map->mappings[i].mapoid)
|
|
|
|
{
|
|
|
|
/* Found it, collapse it out */
|
|
|
|
map->mappings[i] = map->mappings[map->num_mappings - 1];
|
|
|
|
map->num_mappings--;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
elog(ERROR, "could not find temporary mapping for relation %u",
|
|
|
|
relationId);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapInvalidate
|
|
|
|
*
|
|
|
|
* This routine is invoked for SI cache flush messages. We must re-read
|
|
|
|
* the indicated map file. However, we might receive a SI message in a
|
|
|
|
* process that hasn't yet, and might never, load the mapping files;
|
|
|
|
* for example the autovacuum launcher, which *must not* try to read
|
|
|
|
* a local map since it is attached to no particular database.
|
|
|
|
* So, re-read only if the map is valid now.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapInvalidate(bool shared)
|
|
|
|
{
|
|
|
|
if (shared)
|
|
|
|
{
|
|
|
|
if (shared_map.magic == RELMAPPER_FILEMAGIC)
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(true, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
if (local_map.magic == RELMAPPER_FILEMAGIC)
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(false, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapInvalidateAll
|
|
|
|
*
|
|
|
|
* Reload all map files. This is used to recover from SI message buffer
|
|
|
|
* overflow: we can't be sure if we missed an inval message.
|
|
|
|
* Again, reload only currently-valid maps.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapInvalidateAll(void)
|
|
|
|
{
|
|
|
|
if (shared_map.magic == RELMAPPER_FILEMAGIC)
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(true, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
if (local_map.magic == RELMAPPER_FILEMAGIC)
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(false, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtCCI_RelationMap
|
|
|
|
*
|
|
|
|
* Activate any "pending" relation map updates at CommandCounterIncrement time.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtCCI_RelationMap(void)
|
|
|
|
{
|
|
|
|
if (pending_shared_updates.num_mappings != 0)
|
|
|
|
{
|
|
|
|
merge_map_updates(&active_shared_updates,
|
|
|
|
&pending_shared_updates,
|
|
|
|
true);
|
|
|
|
pending_shared_updates.num_mappings = 0;
|
|
|
|
}
|
|
|
|
if (pending_local_updates.num_mappings != 0)
|
|
|
|
{
|
|
|
|
merge_map_updates(&active_local_updates,
|
|
|
|
&pending_local_updates,
|
|
|
|
true);
|
|
|
|
pending_local_updates.num_mappings = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtEOXact_RelationMap
|
|
|
|
*
|
|
|
|
* Handle relation mapping at main-transaction commit or abort.
|
|
|
|
*
|
|
|
|
* During commit, this must be called as late as possible before the actual
|
|
|
|
* transaction commit, so as to minimize the window where the transaction
|
|
|
|
* could still roll back after committing map changes. Although nothing
|
|
|
|
* critically bad happens in such a case, we still would prefer that it
|
|
|
|
* not happen, since we'd possibly be losing useful updates to the relations'
|
|
|
|
* pg_class row(s).
|
|
|
|
*
|
|
|
|
* During abort, we just have to throw away any pending map changes.
|
|
|
|
* Normal post-abort cleanup will take care of fixing relcache entries.
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
* Parallel worker commit/abort is handled by resetting active mappings
|
|
|
|
* that may have been received from the leader process. (There should be
|
|
|
|
* no pending updates in parallel workers.)
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
|
|
|
void
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
AtEOXact_RelationMap(bool isCommit, bool isParallelWorker)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
if (isCommit && !isParallelWorker)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We should not get here with any "pending" updates. (We could
|
|
|
|
* logically choose to treat such as committed, but in the current
|
|
|
|
* code this should never happen.)
|
|
|
|
*/
|
|
|
|
Assert(pending_shared_updates.num_mappings == 0);
|
|
|
|
Assert(pending_local_updates.num_mappings == 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write any active updates to the actual map files, then reset them.
|
|
|
|
*/
|
|
|
|
if (active_shared_updates.num_mappings != 0)
|
|
|
|
{
|
|
|
|
perform_relmap_update(true, &active_shared_updates);
|
|
|
|
active_shared_updates.num_mappings = 0;
|
|
|
|
}
|
|
|
|
if (active_local_updates.num_mappings != 0)
|
|
|
|
{
|
|
|
|
perform_relmap_update(false, &active_local_updates);
|
|
|
|
active_local_updates.num_mappings = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
/* Abort or parallel worker --- drop all local and pending updates */
|
|
|
|
Assert(!isParallelWorker || pending_shared_updates.num_mappings == 0);
|
|
|
|
Assert(!isParallelWorker || pending_local_updates.num_mappings == 0);
|
|
|
|
|
2010-02-07 21:48:13 +01:00
|
|
|
active_shared_updates.num_mappings = 0;
|
|
|
|
active_local_updates.num_mappings = 0;
|
|
|
|
pending_shared_updates.num_mappings = 0;
|
|
|
|
pending_local_updates.num_mappings = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtPrepare_RelationMap
|
|
|
|
*
|
|
|
|
* Handle relation mapping at PREPARE.
|
|
|
|
*
|
|
|
|
* Currently, we don't support preparing any transaction that changes the map.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtPrepare_RelationMap(void)
|
|
|
|
{
|
|
|
|
if (active_shared_updates.num_mappings != 0 ||
|
|
|
|
active_local_updates.num_mappings != 0 ||
|
|
|
|
pending_shared_updates.num_mappings != 0 ||
|
|
|
|
pending_local_updates.num_mappings != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("cannot PREPARE a transaction that modified relation mapping")));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* CheckPointRelationMap
|
|
|
|
*
|
|
|
|
* This is called during a checkpoint. It must ensure that any relation map
|
|
|
|
* updates that were WAL-logged before the start of the checkpoint are
|
|
|
|
* securely flushed to disk and will not need to be replayed later. This
|
|
|
|
* seems unlikely to be a performance-critical issue, so we use a simple
|
|
|
|
* method: we just take and release the RelationMappingLock. This ensures
|
|
|
|
* that any already-logged map update is complete, because write_relmap_file
|
|
|
|
* will fsync the map file before the lock is released.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
CheckPointRelationMap(void)
|
|
|
|
{
|
|
|
|
LWLockAcquire(RelationMappingLock, LW_SHARED);
|
|
|
|
LWLockRelease(RelationMappingLock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapFinishBootstrap
|
|
|
|
*
|
|
|
|
* Write out the initial relation mapping files at the completion of
|
|
|
|
* bootstrap. All the mapped files should have been made known to us
|
|
|
|
* via RelationMapUpdateMap calls.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapFinishBootstrap(void)
|
|
|
|
{
|
|
|
|
Assert(IsBootstrapProcessingMode());
|
|
|
|
|
|
|
|
/* Shouldn't be anything "pending" ... */
|
|
|
|
Assert(active_shared_updates.num_mappings == 0);
|
|
|
|
Assert(active_local_updates.num_mappings == 0);
|
|
|
|
Assert(pending_shared_updates.num_mappings == 0);
|
|
|
|
Assert(pending_local_updates.num_mappings == 0);
|
|
|
|
|
|
|
|
/* Write the files; no WAL or sinval needed */
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
write_relmap_file(&shared_map, false, false, false,
|
|
|
|
InvalidOid, GLOBALTABLESPACE_OID, "global");
|
|
|
|
write_relmap_file(&local_map, false, false, false,
|
2010-02-07 21:48:13 +01:00
|
|
|
MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapInitialize
|
|
|
|
*
|
|
|
|
* This initializes the mapper module at process startup. We can't access the
|
|
|
|
* database yet, so just make sure the maps are empty.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapInitialize(void)
|
|
|
|
{
|
|
|
|
/* The static variables should initialize to zeroes, but let's be sure */
|
|
|
|
shared_map.magic = 0; /* mark it not loaded */
|
|
|
|
local_map.magic = 0;
|
|
|
|
shared_map.num_mappings = 0;
|
|
|
|
local_map.num_mappings = 0;
|
|
|
|
active_shared_updates.num_mappings = 0;
|
|
|
|
active_local_updates.num_mappings = 0;
|
|
|
|
pending_shared_updates.num_mappings = 0;
|
|
|
|
pending_local_updates.num_mappings = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapInitializePhase2
|
|
|
|
*
|
|
|
|
* This is called to prepare for access to pg_database during startup.
|
|
|
|
* We should be able to read the shared map file now.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapInitializePhase2(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In bootstrap mode, the map file isn't there yet, so do nothing.
|
|
|
|
*/
|
|
|
|
if (IsBootstrapProcessingMode())
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load the shared map file, die on error.
|
|
|
|
*/
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(true, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RelationMapInitializePhase3
|
|
|
|
*
|
|
|
|
* This is called as soon as we have determined MyDatabaseId and set up
|
|
|
|
* DatabasePath. At this point we should be able to read the local map file.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationMapInitializePhase3(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In bootstrap mode, the map file isn't there yet, so do nothing.
|
|
|
|
*/
|
|
|
|
if (IsBootstrapProcessingMode())
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load the local map file, die on error.
|
|
|
|
*/
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(false, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
|
Handle parallel index builds on mapped relations.
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes. This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode. When this happened, the final index structure
was typically not consistent with the owning table's structure. The
final index structure could contain entries formed from both heap
relfilenodes. Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.
On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3). On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.
Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 22:01:34 +02:00
|
|
|
/*
|
|
|
|
* EstimateRelationMapSpace
|
|
|
|
*
|
|
|
|
* Estimate space needed to pass active shared and local relmaps to parallel
|
|
|
|
* workers.
|
|
|
|
*/
|
|
|
|
Size
|
|
|
|
EstimateRelationMapSpace(void)
|
|
|
|
{
|
|
|
|
return sizeof(SerializedActiveRelMaps);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* SerializeRelationMap
|
|
|
|
*
|
|
|
|
* Serialize active shared and local relmap state for parallel workers.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
SerializeRelationMap(Size maxSize, char *startAddress)
|
|
|
|
{
|
|
|
|
SerializedActiveRelMaps *relmaps;
|
|
|
|
|
|
|
|
Assert(maxSize >= EstimateRelationMapSpace());
|
|
|
|
|
|
|
|
relmaps = (SerializedActiveRelMaps *) startAddress;
|
|
|
|
relmaps->active_shared_updates = active_shared_updates;
|
|
|
|
relmaps->active_local_updates = active_local_updates;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RestoreRelationMap
|
|
|
|
*
|
|
|
|
* Restore active shared and local relmap state within a parallel worker.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RestoreRelationMap(char *startAddress)
|
|
|
|
{
|
|
|
|
SerializedActiveRelMaps *relmaps;
|
|
|
|
|
|
|
|
if (active_shared_updates.num_mappings != 0 ||
|
|
|
|
active_local_updates.num_mappings != 0 ||
|
|
|
|
pending_shared_updates.num_mappings != 0 ||
|
|
|
|
pending_local_updates.num_mappings != 0)
|
|
|
|
elog(ERROR, "parallel worker has existing mappings");
|
|
|
|
|
|
|
|
relmaps = (SerializedActiveRelMaps *) startAddress;
|
|
|
|
active_shared_updates = relmaps->active_shared_updates;
|
|
|
|
active_local_updates = relmaps->active_local_updates;
|
|
|
|
}
|
|
|
|
|
2010-02-07 21:48:13 +01:00
|
|
|
/*
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
* load_relmap_file -- load the shared or local map file
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
* Because these files are essential for access to core system catalogs,
|
|
|
|
* failure to load either of them is a fatal error.
|
2010-02-07 21:48:13 +01:00
|
|
|
*
|
|
|
|
* Note that the local case requires DatabasePath to be set up.
|
|
|
|
*/
|
|
|
|
static void
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(bool shared, bool lock_held)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
if (shared)
|
|
|
|
read_relmap_file(&shared_map, "global", lock_held, FATAL);
|
|
|
|
else
|
|
|
|
read_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* read_relmap_file -- load data from any relation mapper file
|
|
|
|
*
|
|
|
|
* dbpath must be the relevant database path, or "global" for shared relations.
|
|
|
|
*
|
|
|
|
* RelationMappingLock will be acquired released unless lock_held = true.
|
|
|
|
*
|
|
|
|
* Errors will be reported at the indicated elevel, which should be at least
|
|
|
|
* ERROR.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
|
|
|
|
{
|
2010-02-07 21:48:13 +01:00
|
|
|
char mapfilename[MAXPGPATH];
|
2015-04-14 16:03:42 +02:00
|
|
|
pg_crc32c crc;
|
2010-02-07 21:48:13 +01:00
|
|
|
int fd;
|
Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state. For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on. So simplify all those error
messages by just referring to files with their path used. In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.
Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set. The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.
So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 01:01:23 +02:00
|
|
|
int r;
|
2010-02-07 21:48:13 +01:00
|
|
|
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
Assert(elevel >= ERROR);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/*
|
2021-06-24 09:45:23 +02:00
|
|
|
* Grab the lock to prevent the file from being updated while we read it,
|
|
|
|
* unless the caller is already holding the lock. If the file is updated
|
|
|
|
* shortly after we look, the sinval signaling mechanism will make us
|
|
|
|
* re-read it before we are able to access any relation that's affected by
|
|
|
|
* the change.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
2021-06-24 09:45:23 +02:00
|
|
|
if (!lock_held)
|
|
|
|
LWLockAcquire(RelationMappingLock, LW_SHARED);
|
|
|
|
|
2022-07-27 17:12:15 +02:00
|
|
|
/*
|
|
|
|
* Open the target file.
|
|
|
|
*
|
2023-05-19 23:24:48 +02:00
|
|
|
* Because Windows isn't happy about the idea of renaming over a file that
|
|
|
|
* someone has open, we only open this file after acquiring the lock, and
|
|
|
|
* for the same reason, we close it before releasing the lock. That way,
|
|
|
|
* by the time write_relmap_file() acquires an exclusive lock, no one else
|
|
|
|
* will have it open.
|
2022-07-27 17:12:15 +02:00
|
|
|
*/
|
|
|
|
snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
|
|
|
|
RELMAPPER_FILENAME);
|
|
|
|
fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
|
|
|
|
if (fd < 0)
|
|
|
|
ereport(elevel,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m",
|
|
|
|
mapfilename)));
|
|
|
|
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
/* Now read the data. */
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
|
Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state. For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on. So simplify all those error
messages by just referring to files with their path used. In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.
Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set. The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.
So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 01:01:23 +02:00
|
|
|
r = read(fd, map, sizeof(RelMapFile));
|
|
|
|
if (r != sizeof(RelMapFile))
|
|
|
|
{
|
|
|
|
if (r < 0)
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
ereport(elevel,
|
Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state. For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on. So simplify all those error
messages by just referring to files with their path used. In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.
Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set. The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.
So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 01:01:23 +02:00
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not read file \"%s\": %m", mapfilename)));
|
|
|
|
else
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
ereport(elevel,
|
2018-07-23 02:37:36 +02:00
|
|
|
(errcode(ERRCODE_DATA_CORRUPTED),
|
|
|
|
errmsg("could not read file \"%s\": read %d of %zu",
|
Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state. For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on. So simplify all those error
messages by just referring to files with their path used. In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.
Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set. The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.
So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 01:01:23 +02:00
|
|
|
mapfilename, r, sizeof(RelMapFile))));
|
|
|
|
}
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_end();
|
2010-02-07 21:48:13 +01:00
|
|
|
|
2019-07-06 23:18:46 +02:00
|
|
|
if (CloseTransientFile(fd) != 0)
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
ereport(elevel,
|
Tighten use of OpenTransientFile and CloseTransientFile
This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary. These have been reported by Joe Conway.
2) When opening transient files, it is up to the caller to close the
file descriptors opened. In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error. However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it. This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.
Note that one frontend code path is impacted by this commit so as an
error is issued when fetching control file data, making backend and
frontend to be treated consistently.
Reported-by: Joe Conway, Michael Paquier
Author: Michael Paquier
Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway
Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz
Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
2019-03-09 00:50:55 +01:00
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not close file \"%s\": %m",
|
|
|
|
mapfilename)));
|
2010-02-07 21:48:13 +01:00
|
|
|
|
2022-07-27 17:12:15 +02:00
|
|
|
if (!lock_held)
|
|
|
|
LWLockRelease(RelationMappingLock);
|
|
|
|
|
2010-02-07 21:48:13 +01:00
|
|
|
/* check for correct magic number, etc */
|
|
|
|
if (map->magic != RELMAPPER_FILEMAGIC ||
|
|
|
|
map->num_mappings < 0 ||
|
|
|
|
map->num_mappings > MAX_MAPPINGS)
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
ereport(elevel,
|
2010-02-07 21:48:13 +01:00
|
|
|
(errmsg("relation mapping file \"%s\" contains invalid data",
|
|
|
|
mapfilename)));
|
|
|
|
|
|
|
|
/* verify the CRC */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(crc);
|
|
|
|
COMP_CRC32C(crc, (char *) map, offsetof(RelMapFile, crc));
|
|
|
|
FIN_CRC32C(crc);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
if (!EQ_CRC32C(crc, map->crc))
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
ereport(elevel,
|
2010-02-07 21:48:13 +01:00
|
|
|
(errmsg("relation mapping file \"%s\" contains incorrect checksum",
|
|
|
|
mapfilename)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write out a new shared or local map file with the given contents.
|
|
|
|
*
|
|
|
|
* The magic number and CRC are automatically updated in *newmap. On
|
|
|
|
* success, we copy the data to the appropriate permanent static variable.
|
|
|
|
*
|
2017-08-16 06:22:32 +02:00
|
|
|
* If write_wal is true then an appropriate WAL message is emitted.
|
2010-02-07 21:48:13 +01:00
|
|
|
* (It will be false for bootstrap and WAL replay cases.)
|
|
|
|
*
|
2017-08-16 06:22:32 +02:00
|
|
|
* If send_sinval is true then a SI invalidation message is sent.
|
2010-02-07 21:48:13 +01:00
|
|
|
* (This should be true except in bootstrap case.)
|
|
|
|
*
|
2017-08-16 06:22:32 +02:00
|
|
|
* If preserve_files is true then the storage manager is warned not to
|
2010-02-07 21:48:13 +01:00
|
|
|
* delete the files listed in the map.
|
|
|
|
*
|
|
|
|
* Because this may be called during WAL replay when MyDatabaseId,
|
|
|
|
* DatabasePath, etc aren't valid, we require the caller to pass in suitable
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
* values. Pass dbpath as "global" for the shared map.
|
|
|
|
*
|
|
|
|
* The caller is also responsible for being sure no concurrent map update
|
|
|
|
* could be happening.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
|
|
|
static void
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
|
|
|
|
bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
char mapfilename[MAXPGPATH];
|
2022-07-26 20:56:25 +02:00
|
|
|
char maptempfilename[MAXPGPATH];
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill in the overhead fields and update CRC.
|
|
|
|
*/
|
|
|
|
newmap->magic = RELMAPPER_FILEMAGIC;
|
|
|
|
if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
|
|
|
|
elog(ERROR, "attempt to write bogus relation mapping");
|
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(newmap->crc);
|
|
|
|
COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
|
|
|
|
FIN_CRC32C(newmap->crc);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/*
|
2022-07-26 20:56:25 +02:00
|
|
|
* Construct filenames -- a temporary file that we'll create to write the
|
|
|
|
* data initially, and then the permanent name to which we will rename it.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
|
|
|
|
dbpath, RELMAPPER_FILENAME);
|
2022-07-26 20:56:25 +02:00
|
|
|
snprintf(maptempfilename, sizeof(maptempfilename), "%s/%s",
|
|
|
|
dbpath, RELMAPPER_TEMP_FILENAME);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open a temporary file. If a file already exists with this name, it must
|
|
|
|
* be left over from a previous crash, so we can overwrite it. Concurrent
|
|
|
|
* calls to this function are not allowed.
|
|
|
|
*/
|
|
|
|
fd = OpenTransientFile(maptempfilename,
|
|
|
|
O_WRONLY | O_CREAT | O_TRUNC | PG_BINARY);
|
2010-02-07 21:48:13 +01:00
|
|
|
if (fd < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state. For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on. So simplify all those error
messages by just referring to files with their path used. In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.
Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set. The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.
So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 01:01:23 +02:00
|
|
|
errmsg("could not open file \"%s\": %m",
|
2022-07-26 20:56:25 +02:00
|
|
|
maptempfilename)));
|
|
|
|
|
|
|
|
/* Write new data to the file. */
|
|
|
|
pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
|
|
|
|
if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile))
|
|
|
|
{
|
|
|
|
/* if write didn't set errno, assume problem is no disk space */
|
|
|
|
if (errno == 0)
|
|
|
|
errno = ENOSPC;
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not write file \"%s\": %m",
|
|
|
|
maptempfilename)));
|
|
|
|
}
|
|
|
|
pgstat_report_wait_end();
|
|
|
|
|
|
|
|
/* And close the file. */
|
|
|
|
if (CloseTransientFile(fd) != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not close file \"%s\": %m",
|
|
|
|
maptempfilename)));
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
if (write_wal)
|
|
|
|
{
|
|
|
|
xl_relmap_update xlrec;
|
|
|
|
XLogRecPtr lsn;
|
|
|
|
|
|
|
|
/* now errors are fatal ... */
|
|
|
|
START_CRIT_SECTION();
|
|
|
|
|
|
|
|
xlrec.dbid = dbid;
|
|
|
|
xlrec.tsid = tsid;
|
|
|
|
xlrec.nbytes = sizeof(RelMapFile);
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogBeginInsert();
|
|
|
|
XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
|
|
|
|
XLogRegisterData((char *) newmap, sizeof(RelMapFile));
|
2010-02-07 21:48:13 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/* As always, WAL must hit the disk before the data update does */
|
|
|
|
XLogFlush(lsn);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-07-26 20:56:25 +02:00
|
|
|
* durable_rename() does all the hard work of making sure that we rename
|
|
|
|
* the temporary file into place in a crash-safe manner.
|
|
|
|
*
|
|
|
|
* NB: Although we instruct durable_rename() to use ERROR, we will often
|
|
|
|
* be in a critical section at this point; if so, ERROR will become PANIC.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
2022-07-26 20:56:25 +02:00
|
|
|
pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_REPLACE);
|
|
|
|
durable_rename(maptempfilename, mapfilename, ERROR);
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_end();
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that the file is safely on disk, send sinval message to let other
|
|
|
|
* backends know to re-read it. We must do this inside the critical
|
|
|
|
* section: if for some reason we fail to send the message, we have to
|
|
|
|
* force a database-wide PANIC. Otherwise other backends might continue
|
|
|
|
* execution with stale mapping information, which would be catastrophic
|
|
|
|
* as soon as others began to use the now-committed data.
|
|
|
|
*/
|
|
|
|
if (send_sinval)
|
|
|
|
CacheInvalidateRelmap(dbid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure that the files listed in the map are not deleted if the outer
|
|
|
|
* transaction aborts. This had better be within the critical section
|
|
|
|
* too: it's not likely to fail, but if it did, we'd arrive at transaction
|
|
|
|
* abort with the files still vulnerable. PANICing will leave things in a
|
|
|
|
* good state on-disk.
|
|
|
|
*
|
|
|
|
* Note: we're cheating a little bit here by assuming that mapped files
|
|
|
|
* are either in pg_global or the database's default tablespace.
|
|
|
|
*/
|
|
|
|
if (preserve_files)
|
|
|
|
{
|
|
|
|
int32 i;
|
|
|
|
|
|
|
|
for (i = 0; i < newmap->num_mappings; i++)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator rlocator;
|
2010-02-07 21:48:13 +01:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
rlocator.spcOid = tsid;
|
|
|
|
rlocator.dbOid = dbid;
|
|
|
|
rlocator.relNumber = newmap->mappings[i].mapfilenumber;
|
|
|
|
RelationPreserveStorage(rlocator, false);
|
2010-02-07 21:48:13 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Critical section done */
|
|
|
|
if (write_wal)
|
|
|
|
END_CRIT_SECTION();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Merge the specified updates into the appropriate "real" map,
|
|
|
|
* and write out the changes. This function must be used for committing
|
|
|
|
* updates during normal multiuser operation.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
perform_relmap_update(bool shared, const RelMapFile *updates)
|
|
|
|
{
|
|
|
|
RelMapFile newmap;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Anyone updating a relation's mapping info should take exclusive lock on
|
|
|
|
* that rel and hold it until commit. This ensures that there will not be
|
|
|
|
* concurrent updates on the same mapping value; but there could easily be
|
|
|
|
* concurrent updates on different values in the same file. We cover that
|
|
|
|
* by acquiring the RelationMappingLock, re-reading the target file to
|
|
|
|
* ensure it's up to date, applying the updates, and writing the data
|
|
|
|
* before releasing RelationMappingLock.
|
|
|
|
*
|
|
|
|
* There is only one RelationMappingLock. In principle we could try to
|
|
|
|
* have one per mapping file, but it seems unlikely to be worth the
|
|
|
|
* trouble.
|
|
|
|
*/
|
|
|
|
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
/* Be certain we see any other updates just made */
|
2021-06-24 09:45:23 +02:00
|
|
|
load_relmap_file(shared, true);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/* Prepare updated data in a local variable */
|
|
|
|
if (shared)
|
|
|
|
memcpy(&newmap, &shared_map, sizeof(RelMapFile));
|
|
|
|
else
|
|
|
|
memcpy(&newmap, &local_map, sizeof(RelMapFile));
|
|
|
|
|
2012-03-21 19:09:39 +01:00
|
|
|
/*
|
|
|
|
* Apply the updates to newmap. No new mappings should appear, unless
|
|
|
|
* somebody is adding indexes to system catalogs.
|
|
|
|
*/
|
|
|
|
merge_map_updates(&newmap, updates, allowSystemTableMods);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/* Write out the updated map and do other necessary tasks */
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
write_relmap_file(&newmap, true, true, true,
|
2010-02-07 21:48:13 +01:00
|
|
|
(shared ? InvalidOid : MyDatabaseId),
|
|
|
|
(shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
(shared ? "global" : DatabasePath));
|
|
|
|
|
|
|
|
/*
|
2022-04-11 10:49:41 +02:00
|
|
|
* We successfully wrote the updated file, so it's now safe to rely on the
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
* new values in this process, too.
|
|
|
|
*/
|
|
|
|
if (shared)
|
|
|
|
memcpy(&shared_map, &newmap, sizeof(RelMapFile));
|
|
|
|
else
|
|
|
|
memcpy(&local_map, &newmap, sizeof(RelMapFile));
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/* Now we can release the lock */
|
|
|
|
LWLockRelease(RelationMappingLock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RELMAP resource manager's routines
|
|
|
|
*/
|
|
|
|
void
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
relmap_redo(XLogReaderState *record)
|
2010-02-07 21:48:13 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
/* Backup blocks are not used in relmap records */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
Assert(!XLogRecHasAnyBlockRefs(record));
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
if (info == XLOG_RELMAP_UPDATE)
|
|
|
|
{
|
|
|
|
xl_relmap_update *xlrec = (xl_relmap_update *) XLogRecGetData(record);
|
|
|
|
RelMapFile newmap;
|
|
|
|
char *dbpath;
|
|
|
|
|
|
|
|
if (xlrec->nbytes != sizeof(RelMapFile))
|
|
|
|
elog(PANIC, "relmap_redo: wrong size %u in relmap update record",
|
|
|
|
xlrec->nbytes);
|
|
|
|
memcpy(&newmap, xlrec->data, sizeof(newmap));
|
|
|
|
|
|
|
|
/* We need to construct the pathname for this database */
|
|
|
|
dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write out the new map and send sinval, but of course don't write a
|
|
|
|
* new WAL entry. There's no surrounding transaction to tell to
|
|
|
|
* preserve files, either.
|
|
|
|
*
|
|
|
|
* There shouldn't be anyone else updating relmaps during WAL replay,
|
2021-06-24 10:19:03 +02:00
|
|
|
* but grab the lock to interlock against load_relmap_file().
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
*
|
|
|
|
* Note that we use the same WAL record for updating the relmap of an
|
|
|
|
* existing database as we do for creating a new database. In the
|
|
|
|
* latter case, taking the relmap log and sending sinval messages is
|
|
|
|
* unnecessary, but harmless. If we wanted to avoid it, we could add a
|
2022-04-11 10:49:41 +02:00
|
|
|
* flag to the WAL record to indicate which operation is being
|
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
2022-03-29 17:31:43 +02:00
|
|
|
* performed.
|
2010-02-07 21:48:13 +01:00
|
|
|
*/
|
2021-06-24 10:19:03 +02:00
|
|
|
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
|
Refactor code for reading and writing relation map files.
Restructure things so that the functions which update the global
variables shared_map and local_map are separate from the functions
which just read and write relation map files without touching any
global variables.
In the new structure of things, write_relmap_file() writes a relmap
file but no longer performs global variable updates. A symmetric
function read_relmap_file() that just reads a file without changing
any global variables is added, and load_relmap_file(), which does
change the global variables, uses it as a subroutine.
Because write_relmap_file() no longer updates shared_map and
local_map, that logic is moved to perform_relmap_update(). However,
no similar logic is added to relmap_redo() even though it also calls
write_relmap_file(). That's because recovery must not rely on the
contents of the relation map, and therefore there is no need to
initialize it. In fact, doing so seems like a mistake, because we
might then manage to rely on the in-memory map where we shouldn't.
Patch by me, based on earlier work by Dilip Kumar. Reviewed by
Ashutosh Sharma.
Discussion: http://postgr.es/m/CA+TgmobQLgrt4AXsc0ru7aFFkzv=9fS-Q_yO69=k9WY67RCctg@mail.gmail.com
2022-03-17 18:21:07 +01:00
|
|
|
write_relmap_file(&newmap, false, true, false,
|
2010-02-07 21:48:13 +01:00
|
|
|
xlrec->dbid, xlrec->tsid, dbpath);
|
2021-06-24 10:19:03 +02:00
|
|
|
LWLockRelease(RelationMappingLock);
|
2010-02-07 21:48:13 +01:00
|
|
|
|
|
|
|
pfree(dbpath);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
elog(PANIC, "relmap_redo: unknown op code %u", info);
|
|
|
|
}
|