1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* md.c
|
1996-07-09 08:22:35 +02:00
|
|
|
* This code manages relations that reside on magnetic disk.
|
|
|
|
*
|
2013-03-30 19:23:45 +01:00
|
|
|
* Or at least, that was what the Berkeley folk had in mind when they named
|
|
|
|
* this file. In reality, what this code provides is an interface from
|
|
|
|
* the smgr API to Unix-like filesystem APIs, so it will work with any type
|
|
|
|
* of device for which the operating system provides filesystem support.
|
|
|
|
* It doesn't matter whether the bits are on spinning rust or some other
|
|
|
|
* storage technology.
|
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/storage/smgr/md.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
2000-11-08 23:10:03 +01:00
|
|
|
#include "postgres.h"
|
|
|
|
|
1996-11-08 07:02:30 +01:00
|
|
|
#include <unistd.h>
|
1999-07-16 05:14:30 +02:00
|
|
|
#include <fcntl.h>
|
1996-07-09 08:22:35 +02:00
|
|
|
#include <sys/file.h>
|
|
|
|
|
2011-08-26 22:52:16 +02:00
|
|
|
#include "access/xlog.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "access/xlogutils.h"
|
2019-07-17 02:14:08 +02:00
|
|
|
#include "commands/tablespace.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "miscadmin.h"
|
|
|
|
#include "pg_trace.h"
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
#include "pgstat.h"
|
2004-05-31 05:48:10 +02:00
|
|
|
#include "postmaster/bgwriter.h"
|
2007-01-03 19:11:01 +01:00
|
|
|
#include "storage/bufmgr.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "storage/fd.h"
|
2019-04-04 10:56:03 +02:00
|
|
|
#include "storage/md.h"
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
#include "storage/relfilelocator.h"
|
1999-07-16 05:14:30 +02:00
|
|
|
#include "storage/smgr.h"
|
2019-04-04 10:56:03 +02:00
|
|
|
#include "storage/sync.h"
|
2004-05-31 05:48:10 +02:00
|
|
|
#include "utils/hsearch.h"
|
2000-07-17 05:05:41 +02:00
|
|
|
#include "utils/memutils.h"
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* The magnetic disk storage manager keeps track of open file
|
|
|
|
* descriptors in its own descriptor pool. This is done to make it
|
|
|
|
* easier to support relations that are larger than the operating
|
|
|
|
* system's file size limit (often 2GBytes). In order to do that,
|
|
|
|
* we break relations up into "segment" files that are each shorter than
|
|
|
|
* the OS file size limit. The segment size is set by the RELSEG_SIZE
|
|
|
|
* configuration constant in pg_config.h.
|
2006-11-20 02:07:56 +01:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* On disk, a relation must consist of consecutively numbered segment
|
|
|
|
* files in the pattern
|
|
|
|
* -- Zero or more full segments of exactly RELSEG_SIZE blocks each
|
|
|
|
* -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks
|
|
|
|
* -- Optionally, any number of inactive segments of size 0 blocks.
|
|
|
|
* The full and partial segments are collectively the "active" segments.
|
|
|
|
* Inactive segments are those that once contained data but are currently
|
|
|
|
* not needed because of an mdtruncate() operation. The reason for leaving
|
|
|
|
* them present at size zero, rather than unlinking them, is that other
|
|
|
|
* backends and/or the checkpointer might be holding open file references to
|
|
|
|
* such segments. If the relation expands again after mdtruncate(), such
|
|
|
|
* that a deactivated segment becomes active again, it is important that
|
|
|
|
* such file references still be valid --- else data might get written
|
|
|
|
* out to an unlinked old copy of a segment file that will eventually
|
|
|
|
* disappear.
|
1999-09-02 04:57:50 +02:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* File descriptors are stored in the per-fork md_seg_fds arrays inside
|
|
|
|
* SMgrRelation. The length of these arrays is stored in md_num_open_segs.
|
|
|
|
* Note that a fork's md_num_open_segs having a specific value does not
|
|
|
|
* necessarily mean the relation doesn't have additional segments; we may
|
|
|
|
* just not have opened the next segment yet. (We could not have "all
|
|
|
|
* segments are in the array" as an invariant anyway, since another backend
|
|
|
|
* could extend the relation while we aren't looking.) We do not have
|
|
|
|
* entries for inactive segments, however; as soon as we find a partial
|
|
|
|
* segment, we assume that any subsequent segments are inactive.
|
2004-05-31 05:48:10 +02:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* The entire MdfdVec array is palloc'd in the MdCxt memory context.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
typedef struct _MdfdVec
|
|
|
|
{
|
2004-02-10 02:55:27 +01:00
|
|
|
File mdfd_vfd; /* fd number in fd.c's pool */
|
2004-05-31 05:48:10 +02:00
|
|
|
BlockNumber mdfd_segno; /* segment number, from 0 */
|
1996-07-09 08:22:35 +02:00
|
|
|
} MdfdVec;
|
|
|
|
|
2014-06-30 09:13:48 +02:00
|
|
|
static MemoryContext MdCxt; /* context for all MdfdVec objects */
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2000-10-16 16:52:28 +02:00
|
|
|
|
2019-04-04 10:56:03 +02:00
|
|
|
/* Populate a file tag describing an md.c segment file. */
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
#define INIT_MD_FILETAG(a,xx_rlocator,xx_forknum,xx_segno) \
|
2019-04-04 10:56:03 +02:00
|
|
|
( \
|
|
|
|
memset(&(a), 0, sizeof(FileTag)), \
|
|
|
|
(a).handler = SYNC_HANDLER_MD, \
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
(a).rlocator = (xx_rlocator), \
|
2019-04-04 10:56:03 +02:00
|
|
|
(a).forknum = (xx_forknum), \
|
|
|
|
(a).segno = (xx_segno) \
|
|
|
|
)
|
2007-04-12 19:10:55 +02:00
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
|
2016-05-04 10:54:20 +02:00
|
|
|
/*** behavior for mdopen & _mdfd_getseg ***/
|
|
|
|
/* ereport if segment not present */
|
|
|
|
#define EXTENSION_FAIL (1 << 0)
|
|
|
|
/* return NULL if segment not present */
|
|
|
|
#define EXTENSION_RETURN_NULL (1 << 1)
|
|
|
|
/* create new segments as needed */
|
|
|
|
#define EXTENSION_CREATE (1 << 2)
|
|
|
|
/* create new segments if needed during recovery */
|
|
|
|
#define EXTENSION_CREATE_RECOVERY (1 << 3)
|
|
|
|
/*
|
|
|
|
* Allow opening segments which are preceded by segments smaller than
|
2019-01-10 01:36:25 +01:00
|
|
|
* RELSEG_SIZE, e.g. inactive segments (see above). Note that this breaks
|
2016-05-04 10:54:20 +02:00
|
|
|
* mdnblocks() and related functionality henceforth - which currently is ok,
|
|
|
|
* because this is only required in the checkpointer which never uses
|
|
|
|
* mdnblocks().
|
|
|
|
*/
|
|
|
|
#define EXTENSION_DONT_CHECK_SIZE (1 << 4)
|
2022-05-07 06:19:42 +02:00
|
|
|
/* don't try to open a segment, if not already open */
|
|
|
|
#define EXTENSION_DONT_OPEN (1 << 5)
|
2016-05-04 10:54:20 +02:00
|
|
|
|
2007-01-03 19:11:01 +01:00
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
/* local routines */
|
2022-09-20 04:18:36 +02:00
|
|
|
static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
|
2012-07-20 01:28:22 +02:00
|
|
|
bool isRedo);
|
2019-07-17 02:14:08 +02:00
|
|
|
static MdfdVec *mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior);
|
2008-08-11 13:05:11 +02:00
|
|
|
static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
|
|
|
|
MdfdVec *seg);
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
static void register_unlink_segment(RelFileLocatorBackend rlocator, ForkNumber forknum,
|
2019-04-04 10:56:03 +02:00
|
|
|
BlockNumber segno);
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
static void register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
|
2019-04-04 10:56:03 +02:00
|
|
|
BlockNumber segno);
|
2016-09-09 02:02:43 +02:00
|
|
|
static void _fdvec_resize(SMgrRelation reln,
|
|
|
|
ForkNumber forknum,
|
|
|
|
int nseg);
|
2009-08-05 20:01:54 +02:00
|
|
|
static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
|
|
|
|
BlockNumber segno);
|
2022-09-20 04:18:36 +02:00
|
|
|
static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
|
2008-08-11 13:05:11 +02:00
|
|
|
BlockNumber segno, int oflags);
|
2022-09-20 04:18:36 +02:00
|
|
|
static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
|
2016-05-04 10:54:20 +02:00
|
|
|
BlockNumber blkno, bool skipFsync, int behavior);
|
2008-08-11 13:05:11 +02:00
|
|
|
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
|
|
|
|
MdfdVec *seg);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2023-04-08 01:04:49 +02:00
|
|
|
static inline int
|
|
|
|
_mdfd_open_flags(void)
|
|
|
|
{
|
|
|
|
int flags = O_RDWR | PG_BINARY;
|
|
|
|
|
|
|
|
if (io_direct_flags & IO_DIRECT_DATA)
|
|
|
|
flags |= PG_O_DIRECT;
|
|
|
|
|
|
|
|
return flags;
|
|
|
|
}
|
2004-02-10 02:55:27 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdinit() -- Initialize private state for magnetic disk storage manager.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2001-06-28 01:31:40 +02:00
|
|
|
mdinit(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-06-28 05:33:33 +02:00
|
|
|
MdCxt = AllocSetContextCreate(TopMemoryContext,
|
|
|
|
"MdSmgr",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2009-06-25 23:36:00 +02:00
|
|
|
}
|
|
|
|
|
2008-08-11 13:05:11 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdexists() -- Does the physical file exist?
|
2008-08-11 13:05:11 +02:00
|
|
|
*
|
|
|
|
* Note: this will return true for lingering files, with pending deletions
|
|
|
|
*/
|
|
|
|
bool
|
2022-09-20 04:18:36 +02:00
|
|
|
mdexists(SMgrRelation reln, ForkNumber forknum)
|
2008-08-11 13:05:11 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Close it first, to ensure that we notice if the fork has been unlinked
|
2022-04-07 09:28:40 +02:00
|
|
|
* since we opened it. As an optimization, we can skip that in recovery,
|
|
|
|
* which already closes relations when dropping them.
|
2008-08-11 13:05:11 +02:00
|
|
|
*/
|
2022-04-07 09:28:40 +02:00
|
|
|
if (!InRecovery)
|
2022-09-20 04:18:36 +02:00
|
|
|
mdclose(reln, forknum);
|
2008-08-11 13:05:11 +02:00
|
|
|
|
2022-09-20 04:18:36 +02:00
|
|
|
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
|
2008-08-11 13:05:11 +02:00
|
|
|
}
|
|
|
|
|
2004-02-10 02:55:27 +01:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdcreate() -- Create a new relation on magnetic disk.
|
2004-02-10 02:55:27 +01:00
|
|
|
*
|
|
|
|
* If isRedo is true, it's okay for the relation to exist already.
|
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2022-09-20 04:18:36 +02:00
|
|
|
mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
MdfdVec *mdfd;
|
2000-11-08 23:10:03 +01:00
|
|
|
char *path;
|
2004-02-10 02:55:27 +01:00
|
|
|
File fd;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2022-09-20 04:18:36 +02:00
|
|
|
if (isRedo && reln->md_num_open_segs[forknum] > 0)
|
2007-01-03 19:11:01 +01:00
|
|
|
return; /* created and opened already... */
|
2004-02-11 23:55:26 +01:00
|
|
|
|
2022-09-20 04:18:36 +02:00
|
|
|
Assert(reln->md_num_open_segs[forknum] == 0);
|
2000-06-20 01:37:08 +02:00
|
|
|
|
2019-07-17 02:14:08 +02:00
|
|
|
/*
|
|
|
|
* We may be using the target table space for the first time in this
|
|
|
|
* database, so create a per-database subdirectory if needed.
|
|
|
|
*
|
|
|
|
* XXX this is a fairly ugly violation of module layering, but this seems
|
|
|
|
* to be the best place to put the check. Maybe TablespaceCreateDbspace
|
|
|
|
* should be here and not in commands/tablespace.c? But that would imply
|
|
|
|
* importing a lot of stuff that smgr.c oughtn't know, either.
|
|
|
|
*/
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
|
|
|
|
reln->smgr_rlocator.locator.dbOid,
|
2019-07-17 02:14:08 +02:00
|
|
|
isRedo);
|
|
|
|
|
2022-09-20 04:18:36 +02:00
|
|
|
path = relpath(reln->smgr_rlocator, forknum);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2023-04-08 01:04:49 +02:00
|
|
|
fd = PathNameOpenFile(path, _mdfd_open_flags() | O_CREAT | O_EXCL);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
mdcreate():
fd = FileNameOpenFile(path, O_RDWR|O_CREAT|O_EXCL, 0600);
/*
* If the file already exists and is empty, we pretend that the
* create succeeded. During bootstrap processing, we skip that check,
* because pg_time, pg_variable, and pg_log get created before their
* .bki file entries are processed.
*
> * As the result of this pretence it was possible to have in
> * pg_class > 1 records with the same relname. Actually, it
> * should be fixed in upper levels, too, but... - vadim 05/06/97
> */
1997-05-06 04:03:20 +02:00
|
|
|
if (fd < 0)
|
|
|
|
{
|
2000-06-20 01:37:08 +02:00
|
|
|
int save_errno = errno;
|
|
|
|
|
2019-01-28 03:21:02 +01:00
|
|
|
if (isRedo)
|
2023-04-08 01:04:49 +02:00
|
|
|
fd = PathNameOpenFile(path, _mdfd_open_flags());
|
mdcreate():
fd = FileNameOpenFile(path, O_RDWR|O_CREAT|O_EXCL, 0600);
/*
* If the file already exists and is empty, we pretend that the
* create succeeded. During bootstrap processing, we skip that check,
* because pg_time, pg_variable, and pg_log get created before their
* .bki file entries are processed.
*
> * As the result of this pretence it was possible to have in
> * pg_class > 1 records with the same relname. Actually, it
> * should be fixed in upper levels, too, but... - vadim 05/06/97
> */
1997-05-06 04:03:20 +02:00
|
|
|
if (fd < 0)
|
2000-06-20 01:37:08 +02:00
|
|
|
{
|
2007-01-03 19:11:01 +01:00
|
|
|
/* be sure to report the error reported by create, not open */
|
2000-06-20 01:37:08 +02:00
|
|
|
errno = save_errno;
|
2007-01-03 19:11:01 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not create file \"%s\": %m", path)));
|
2000-06-20 01:37:08 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2000-11-08 23:10:03 +01:00
|
|
|
|
|
|
|
pfree(path);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2022-09-20 04:18:36 +02:00
|
|
|
_fdvec_resize(reln, forknum, 1);
|
|
|
|
mdfd = &reln->md_seg_fds[forknum][0];
|
2016-09-09 02:02:43 +02:00
|
|
|
mdfd->mdfd_vfd = fd;
|
|
|
|
mdfd->mdfd_segno = 0;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdunlink() -- Unlink a relation.
|
2004-02-10 02:55:27 +01:00
|
|
|
*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* Note that we're passed a RelFileLocatorBackend --- by the time this is called,
|
2004-02-10 02:55:27 +01:00
|
|
|
* there won't be an SMgrRelation hashtable entry anymore.
|
|
|
|
*
|
2022-09-20 04:18:36 +02:00
|
|
|
* forknum can be a fork number to delete a specific fork, or InvalidForkNumber
|
2012-07-19 19:07:33 +02:00
|
|
|
* to delete all forks.
|
|
|
|
*
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
* For regular relations, we don't unlink the first segment file of the rel,
|
|
|
|
* but just truncate it to zero length, and record a request to unlink it after
|
2007-11-15 21:36:40 +01:00
|
|
|
* the next checkpoint. Additional segments can be unlinked immediately,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* however. Leaving the empty file in place prevents that relfilenumber
|
|
|
|
* from being reused. The scenario this protects us from is:
|
2007-11-15 21:36:40 +01:00
|
|
|
* 1. We delete a relation (and commit, and actually remove its file).
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* 2. We create a new relation, which by chance gets the same relfilenumber as
|
2007-11-15 21:36:40 +01:00
|
|
|
* the just-deleted one (OIDs must've wrapped around for that to happen).
|
|
|
|
* 3. We crash before another checkpoint occurs.
|
|
|
|
* During replay, we would delete the file and then recreate it, which is fine
|
|
|
|
* if the contents of the file were repopulated by subsequent WAL entries.
|
|
|
|
* But if we didn't WAL-log insertions, but instead relied on fsyncing the
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
|
|
|
* file after populating it (as we do at wal_level=minimal), the contents of
|
|
|
|
* the file would be lost forever. By leaving the empty file until after the
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* next checkpoint, we prevent reassignment of the relfilenumber until it's
|
|
|
|
* safe, because relfilenumber assignment skips over any existing file.
|
2007-11-15 21:36:40 +01:00
|
|
|
*
|
2022-11-09 20:15:38 +01:00
|
|
|
* Additional segments, if any, are truncated and then unlinked. The reason
|
|
|
|
* for truncating is that other backends may still hold open FDs for these at
|
|
|
|
* the smgr level, so that the kernel can't remove the file yet. We want to
|
|
|
|
* reclaim the disk space right away despite that.
|
|
|
|
*
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
* We do not need to go through this dance for temp relations, though, because
|
|
|
|
* we never make WAL entries for temp rels, and so a temp rel poses no threat
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* to the health of a regular rel that has taken over its relfilenumber.
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
* The fact that temp rels and regular rels have different file naming
|
2022-11-09 20:15:38 +01:00
|
|
|
* patterns provides additional safety. Other backends shouldn't have open
|
|
|
|
* FDs for them, either.
|
|
|
|
*
|
|
|
|
* We also don't do it while performing a binary upgrade. There is no reuse
|
|
|
|
* hazard in that case, since after a crash or even a simple ERROR, the
|
|
|
|
* upgrade fails and the whole cluster must be recreated from scratch.
|
|
|
|
* Furthermore, it is important to remove the files from disk immediately,
|
|
|
|
* because we may be about to reuse the same relfilenumber.
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
*
|
2011-12-20 21:00:36 +01:00
|
|
|
* All the above applies only to the relation's main fork; other forks can
|
|
|
|
* just be removed immediately, since they are not needed to prevent the
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* relfilenumber from being recycled. Also, we do not carefully
|
2011-12-20 21:00:36 +01:00
|
|
|
* track whether other forks have been created or not, but just attempt to
|
|
|
|
* unlink them unconditionally; so we should never complain about ENOENT.
|
|
|
|
*
|
|
|
|
* If isRedo is true, it's unsurprising for the relation to be already gone.
|
2007-11-15 21:36:40 +01:00
|
|
|
* Also, we should remove the file immediately instead of queuing a request
|
|
|
|
* for later, since during redo there's no possibility of creating a
|
|
|
|
* conflicting relation.
|
|
|
|
*
|
2022-11-09 20:15:38 +01:00
|
|
|
* Note: we currently just never warn about ENOENT at all. We could warn in
|
|
|
|
* the main-fork, non-isRedo case, but it doesn't seem worth the trouble.
|
|
|
|
*
|
2007-11-15 21:36:40 +01:00
|
|
|
* Note: any failure should be reported as WARNING not ERROR, because
|
2007-01-03 19:11:01 +01:00
|
|
|
* we are usually not in a transaction anymore when this is called.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2022-09-20 04:18:36 +02:00
|
|
|
mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2012-07-19 19:07:33 +02:00
|
|
|
/* Now do the per-fork work */
|
2022-09-20 04:18:36 +02:00
|
|
|
if (forknum == InvalidForkNumber)
|
2012-07-19 19:07:33 +02:00
|
|
|
{
|
2022-09-20 04:18:36 +02:00
|
|
|
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
|
|
|
|
mdunlinkfork(rlocator, forknum, isRedo);
|
2012-07-19 19:07:33 +02:00
|
|
|
}
|
|
|
|
else
|
2022-09-20 04:18:36 +02:00
|
|
|
mdunlinkfork(rlocator, forknum, isRedo);
|
2012-07-19 19:07:33 +02:00
|
|
|
}
|
|
|
|
|
2020-12-01 01:21:03 +01:00
|
|
|
/*
|
|
|
|
* Truncate a file to release disk space.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
do_truncate(const char *path)
|
|
|
|
{
|
|
|
|
int save_errno;
|
|
|
|
int ret;
|
|
|
|
|
2020-12-01 03:34:57 +01:00
|
|
|
ret = pg_truncate(path, 0);
|
2020-12-01 01:21:03 +01:00
|
|
|
|
|
|
|
/* Log a warning here to avoid repetition in callers. */
|
|
|
|
if (ret < 0 && errno != ENOENT)
|
|
|
|
{
|
|
|
|
save_errno = errno;
|
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not truncate file \"%s\": %m", path)));
|
|
|
|
errno = save_errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-19 19:07:33 +02:00
|
|
|
static void
|
2022-09-20 04:18:36 +02:00
|
|
|
mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
|
2012-07-19 19:07:33 +02:00
|
|
|
{
|
|
|
|
char *path;
|
|
|
|
int ret;
|
2022-11-09 20:15:38 +01:00
|
|
|
int save_errno;
|
2012-07-19 19:07:33 +02:00
|
|
|
|
2022-09-20 04:18:36 +02:00
|
|
|
path = relpath(rlocator, forknum);
|
1999-09-02 04:57:50 +02:00
|
|
|
|
2007-11-15 21:36:40 +01:00
|
|
|
/*
|
2022-11-09 20:15:38 +01:00
|
|
|
* Truncate and then unlink the first segment, or just register a request
|
|
|
|
* to unlink it later, as described in the comments for mdunlink().
|
2007-11-15 21:36:40 +01:00
|
|
|
*/
|
2022-11-09 20:15:38 +01:00
|
|
|
if (isRedo || IsBinaryUpgrade || forknum != MAIN_FORKNUM ||
|
|
|
|
RelFileLocatorBackendIsTemp(rlocator))
|
2009-08-05 20:01:54 +02:00
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (!RelFileLocatorBackendIsTemp(rlocator))
|
2020-12-01 01:21:03 +01:00
|
|
|
{
|
|
|
|
/* Prevent other backends' fds from holding on to the disk space */
|
|
|
|
ret = do_truncate(path);
|
|
|
|
|
|
|
|
/* Forget any pending sync requests for the first segment */
|
2022-11-07 17:36:45 +01:00
|
|
|
save_errno = errno;
|
2022-09-20 04:18:36 +02:00
|
|
|
register_forget_request(rlocator, forknum, 0 /* first seg */ );
|
2022-11-07 17:36:45 +01:00
|
|
|
errno = save_errno;
|
2020-12-01 01:21:03 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
ret = 0;
|
2019-04-04 10:56:03 +02:00
|
|
|
|
2020-12-01 01:21:03 +01:00
|
|
|
/* Next unlink the file, unless it was already found to be missing */
|
2022-11-09 20:15:38 +01:00
|
|
|
if (ret >= 0 || errno != ENOENT)
|
2020-12-01 01:21:03 +01:00
|
|
|
{
|
|
|
|
ret = unlink(path);
|
|
|
|
if (ret < 0 && errno != ENOENT)
|
2022-11-09 20:15:38 +01:00
|
|
|
{
|
|
|
|
save_errno = errno;
|
2020-12-01 01:21:03 +01:00
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not remove file \"%s\": %m", path)));
|
2022-11-09 20:15:38 +01:00
|
|
|
errno = save_errno;
|
|
|
|
}
|
2020-12-01 01:21:03 +01:00
|
|
|
}
|
2009-08-05 20:01:54 +02:00
|
|
|
}
|
2007-11-15 21:36:40 +01:00
|
|
|
else
|
2007-11-15 22:49:47 +01:00
|
|
|
{
|
2020-12-01 01:21:03 +01:00
|
|
|
/* Prevent other backends' fds from holding on to the disk space */
|
|
|
|
ret = do_truncate(path);
|
2011-12-20 21:00:36 +01:00
|
|
|
|
2022-11-09 20:15:38 +01:00
|
|
|
/* Register request to unlink first segment later */
|
|
|
|
save_errno = errno;
|
|
|
|
register_unlink_segment(rlocator, forknum, 0 /* first seg */ );
|
|
|
|
errno = save_errno;
|
2000-11-08 23:10:03 +01:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2008-05-02 03:08:27 +02:00
|
|
|
/*
|
2022-11-09 20:15:38 +01:00
|
|
|
* Delete any additional segments.
|
|
|
|
*
|
|
|
|
* Note that because we loop until getting ENOENT, we will correctly
|
|
|
|
* remove all inactive segments as well as active ones. Ideally we'd
|
|
|
|
* continue the loop until getting exactly that errno, but that risks an
|
|
|
|
* infinite loop if the problem is directory-wide (for instance, if we
|
|
|
|
* suddenly can't read the data directory itself). We compromise by
|
|
|
|
* continuing after a non-ENOENT truncate error, but stopping after any
|
|
|
|
* unlink error. If there is indeed a directory-wide problem, additional
|
|
|
|
* unlink attempts wouldn't work anyway.
|
2008-05-02 03:08:27 +02:00
|
|
|
*/
|
2022-11-09 20:15:38 +01:00
|
|
|
if (ret >= 0 || errno != ENOENT)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2000-11-08 23:10:03 +01:00
|
|
|
char *segpath = (char *) palloc(strlen(path) + 12);
|
2022-11-09 20:15:38 +01:00
|
|
|
BlockNumber segno;
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2022-11-09 20:15:38 +01:00
|
|
|
for (segno = 1;; segno++)
|
2000-11-08 23:10:03 +01:00
|
|
|
{
|
2022-11-09 20:15:38 +01:00
|
|
|
sprintf(segpath, "%s.%u", path, segno);
|
2020-12-01 01:21:03 +01:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (!RelFileLocatorBackendIsTemp(rlocator))
|
2020-12-01 01:21:03 +01:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Prevent other backends' fds from holding on to the disk
|
2022-11-09 20:15:38 +01:00
|
|
|
* space. We're done if we see ENOENT, though.
|
2020-12-01 01:21:03 +01:00
|
|
|
*/
|
|
|
|
if (do_truncate(segpath) < 0 && errno == ENOENT)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Forget any pending sync requests for this segment before we
|
|
|
|
* try to unlink.
|
|
|
|
*/
|
2022-09-20 04:18:36 +02:00
|
|
|
register_forget_request(rlocator, forknum, segno);
|
2020-12-01 01:21:03 +01:00
|
|
|
}
|
2019-04-04 10:56:03 +02:00
|
|
|
|
2000-11-08 23:10:03 +01:00
|
|
|
if (unlink(segpath) < 0)
|
|
|
|
{
|
|
|
|
/* ENOENT is expected after the last segment... */
|
|
|
|
if (errno != ENOENT)
|
2007-01-03 19:11:01 +01:00
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not remove file \"%s\": %m", segpath)));
|
2000-11-08 23:10:03 +01:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
pfree(segpath);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2000-11-08 23:10:03 +01:00
|
|
|
pfree(path);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdextend() -- Add a block to the specified relation.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* The semantics are nearly the same as mdwrite(): write at the
|
|
|
|
* specified position. However, this is to be used for the case of
|
|
|
|
* extending a relation (i.e., blocknum is at or beyond the current
|
|
|
|
* EOF). Note that we assume writing a block beyond current EOF
|
|
|
|
* causes intervening file space to become filled with zeroes.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2008-08-11 13:05:11 +02:00
|
|
|
mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
|
2023-02-27 07:45:44 +01:00
|
|
|
const void *buffer, bool skipFsync)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t seekpos;
|
2001-05-10 22:38:49 +02:00
|
|
|
int nbytes;
|
1996-07-09 08:22:35 +02:00
|
|
|
MdfdVec *v;
|
|
|
|
|
2023-04-08 00:38:09 +02:00
|
|
|
/* If this build supports direct I/O, the buffer must be I/O aligned. */
|
|
|
|
if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
|
|
|
|
Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
|
|
|
|
|
2007-01-03 19:11:01 +01:00
|
|
|
/* This assert is too expensive to have on normally ... */
|
|
|
|
#ifdef CHECK_WRITE_VS_EXTEND
|
2008-08-11 13:05:11 +02:00
|
|
|
Assert(blocknum >= mdnblocks(reln, forknum));
|
2007-01-03 19:11:01 +01:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
|
|
|
|
* more --- we mustn't create a block whose number actually is
|
2021-09-09 17:45:48 +02:00
|
|
|
* InvalidBlockNumber. (Note that this failure should be unreachable
|
|
|
|
* because of upstream checks in bufmgr.c.)
|
2007-01-03 19:11:01 +01:00
|
|
|
*/
|
|
|
|
if (blocknum == InvalidBlockNumber)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("cannot extend file \"%s\" beyond %u blocks",
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
relpath(reln->smgr_rlocator, forknum),
|
2007-01-03 19:11:01 +01:00
|
|
|
InvalidBlockNumber)));
|
|
|
|
|
2010-08-13 22:10:54 +02:00
|
|
|
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2018-11-06 21:51:50 +01:00
|
|
|
if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
|
1999-10-06 08:38:04 +02:00
|
|
|
{
|
2007-01-03 19:11:01 +01:00
|
|
|
if (nbytes < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not extend file \"%s\": %m",
|
|
|
|
FilePathName(v->mdfd_vfd)),
|
2007-01-03 19:11:01 +01:00
|
|
|
errhint("Check free disk space.")));
|
|
|
|
/* short write: complain appropriately */
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_DISK_FULL),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not extend file \"%s\": wrote only %d of %d bytes at block %u",
|
|
|
|
FilePathName(v->mdfd_vfd),
|
2007-01-03 19:11:01 +01:00
|
|
|
nbytes, BLCKSZ, blocknum),
|
|
|
|
errhint("Check free disk space.")));
|
1999-10-06 08:38:04 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2010-08-13 22:10:54 +02:00
|
|
|
if (!skipFsync && !SmgrIsTemp(reln))
|
2008-08-11 13:05:11 +02:00
|
|
|
register_dirty_segment(reln, forknum, v);
|
2004-05-31 05:48:10 +02:00
|
|
|
|
2008-08-11 13:05:11 +02:00
|
|
|
Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2023-04-05 19:06:39 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdzeroextend() -- Add new zeroed out blocks to the specified relation.
|
2023-04-05 19:06:39 +02:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* Similar to mdextend(), except the relation can be extended by multiple
|
|
|
|
* blocks at once and the added blocks will be filled with zeroes.
|
2023-04-05 19:06:39 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
mdzeroextend(SMgrRelation reln, ForkNumber forknum,
|
|
|
|
BlockNumber blocknum, int nblocks, bool skipFsync)
|
|
|
|
{
|
|
|
|
MdfdVec *v;
|
|
|
|
BlockNumber curblocknum = blocknum;
|
|
|
|
int remblocks = nblocks;
|
|
|
|
|
|
|
|
Assert(nblocks > 0);
|
|
|
|
|
|
|
|
/* This assert is too expensive to have on normally ... */
|
|
|
|
#ifdef CHECK_WRITE_VS_EXTEND
|
|
|
|
Assert(blocknum >= mdnblocks(reln, forknum));
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
|
|
|
|
* more --- we mustn't create a block whose number actually is
|
|
|
|
* InvalidBlockNumber or larger.
|
|
|
|
*/
|
|
|
|
if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
|
|
|
|
errmsg("cannot extend file \"%s\" beyond %u blocks",
|
|
|
|
relpath(reln->smgr_rlocator, forknum),
|
|
|
|
InvalidBlockNumber)));
|
|
|
|
|
|
|
|
while (remblocks > 0)
|
|
|
|
{
|
2023-05-19 23:24:48 +02:00
|
|
|
BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
|
2023-04-05 19:06:39 +02:00
|
|
|
off_t seekpos = (off_t) BLCKSZ * segstartblock;
|
|
|
|
int numblocks;
|
|
|
|
|
|
|
|
if (segstartblock + remblocks > RELSEG_SIZE)
|
|
|
|
numblocks = RELSEG_SIZE - segstartblock;
|
|
|
|
else
|
|
|
|
numblocks = remblocks;
|
|
|
|
|
|
|
|
v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
|
|
|
|
|
|
|
|
Assert(segstartblock < RELSEG_SIZE);
|
|
|
|
Assert(segstartblock + numblocks <= RELSEG_SIZE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If available and useful, use posix_fallocate() (via FileAllocate())
|
|
|
|
* to extend the relation. That's often more efficient than using
|
|
|
|
* write(), as it commonly won't cause the kernel to allocate page
|
|
|
|
* cache space for the extended pages.
|
|
|
|
*
|
|
|
|
* However, we don't use FileAllocate() for small extensions, as it
|
|
|
|
* defeats delayed allocation on some filesystems. Not clear where
|
|
|
|
* that decision should be made though? For now just use a cutoff of
|
|
|
|
* 8, anything between 4 and 8 worked OK in some local testing.
|
|
|
|
*/
|
|
|
|
if (numblocks > 8)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = FileFallocate(v->mdfd_vfd,
|
|
|
|
seekpos, (off_t) BLCKSZ * numblocks,
|
|
|
|
WAIT_EVENT_DATA_FILE_EXTEND);
|
|
|
|
if (ret != 0)
|
|
|
|
{
|
|
|
|
ereport(ERROR,
|
|
|
|
errcode_for_file_access(),
|
|
|
|
errmsg("could not extend file \"%s\" with FileFallocate(): %m",
|
|
|
|
FilePathName(v->mdfd_vfd)),
|
|
|
|
errhint("Check free disk space."));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Even if we don't want to use fallocate, we can still extend a
|
|
|
|
* bit more efficiently than writing each 8kB block individually.
|
2023-05-19 23:24:48 +02:00
|
|
|
* pg_pwrite_zeros() (via FileZero()) uses pg_pwritev_with_retry()
|
|
|
|
* to avoid multiple writes or needing a zeroed buffer for the
|
|
|
|
* whole length of the extension.
|
2023-04-05 19:06:39 +02:00
|
|
|
*/
|
|
|
|
ret = FileZero(v->mdfd_vfd,
|
|
|
|
seekpos, (off_t) BLCKSZ * numblocks,
|
|
|
|
WAIT_EVENT_DATA_FILE_EXTEND);
|
|
|
|
if (ret < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
errcode_for_file_access(),
|
|
|
|
errmsg("could not extend file \"%s\": %m",
|
|
|
|
FilePathName(v->mdfd_vfd)),
|
|
|
|
errhint("Check free disk space."));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!skipFsync && !SmgrIsTemp(reln))
|
|
|
|
register_dirty_segment(reln, forknum, v);
|
|
|
|
|
|
|
|
Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
|
|
|
|
|
|
|
|
remblocks -= numblocks;
|
|
|
|
curblocknum += numblocks;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdopenfork() -- Open one fork of the specified relation.
|
2004-02-10 02:55:27 +01:00
|
|
|
*
|
|
|
|
* Note we only open the first segment, when there are multiple segments.
|
2007-01-03 19:11:01 +01:00
|
|
|
*
|
|
|
|
* If first segment is not present, either ereport or return NULL according
|
|
|
|
* to "behavior". We treat EXTENSION_CREATE the same as EXTENSION_FAIL;
|
|
|
|
* EXTENSION_CREATE means it's OK to extend an existing relation, not to
|
|
|
|
* invent one out of whole cloth.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2004-02-10 02:55:27 +01:00
|
|
|
static MdfdVec *
|
2019-07-17 02:14:08 +02:00
|
|
|
mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-05-31 05:48:10 +02:00
|
|
|
MdfdVec *mdfd;
|
1996-07-09 08:22:35 +02:00
|
|
|
char *path;
|
2004-02-10 02:55:27 +01:00
|
|
|
File fd;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2004-02-10 02:55:27 +01:00
|
|
|
/* No work if already open */
|
2016-09-09 02:02:43 +02:00
|
|
|
if (reln->md_num_open_segs[forknum] > 0)
|
|
|
|
return &reln->md_seg_fds[forknum][0];
|
2000-11-08 23:10:03 +01:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
path = relpath(reln->smgr_rlocator, forknum);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2023-04-08 01:04:49 +02:00
|
|
|
fd = PathNameOpenFile(path, _mdfd_open_flags());
|
2000-11-08 23:10:03 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
if (fd < 0)
|
1999-09-06 01:24:53 +02:00
|
|
|
{
|
2019-01-28 03:21:02 +01:00
|
|
|
if ((behavior & EXTENSION_RETURN_NULL) &&
|
|
|
|
FILE_POSSIBLY_DELETED(errno))
|
1999-09-06 01:24:53 +02:00
|
|
|
{
|
2019-01-28 03:21:02 +01:00
|
|
|
pfree(path);
|
|
|
|
return NULL;
|
1999-09-06 01:24:53 +02:00
|
|
|
}
|
2019-01-28 03:21:02 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m", path)));
|
1999-09-06 01:24:53 +02:00
|
|
|
}
|
2000-11-08 23:10:03 +01:00
|
|
|
|
|
|
|
pfree(path);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
_fdvec_resize(reln, forknum, 1);
|
|
|
|
mdfd = &reln->md_seg_fds[forknum][0];
|
2004-05-31 05:48:10 +02:00
|
|
|
mdfd->mdfd_vfd = fd;
|
|
|
|
mdfd->mdfd_segno = 0;
|
2016-09-09 02:02:43 +02:00
|
|
|
|
2008-08-11 13:05:11 +02:00
|
|
|
Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
return mdfd;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2019-07-17 02:14:08 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdopen() -- Initialize newly-opened relation.
|
2019-07-17 02:14:08 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
mdopen(SMgrRelation reln)
|
|
|
|
{
|
|
|
|
/* mark it not open */
|
|
|
|
for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
|
|
|
|
reln->md_num_open_segs[forknum] = 0;
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdclose() -- Close the specified relation, if it isn't closed already.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2008-08-11 13:05:11 +02:00
|
|
|
mdclose(SMgrRelation reln, ForkNumber forknum)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
int nopensegs = reln->md_num_open_segs[forknum];
|
2000-04-09 06:43:20 +02:00
|
|
|
|
2004-02-10 02:55:27 +01:00
|
|
|
/* No work if already closed */
|
2016-09-09 02:02:43 +02:00
|
|
|
if (nopensegs == 0)
|
2007-01-03 19:11:01 +01:00
|
|
|
return;
|
2000-04-09 06:43:20 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
/* close segments starting from the end */
|
|
|
|
while (nopensegs > 0)
|
1997-05-22 19:08:35 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
MdfdVec *v = &reln->md_seg_fds[forknum][nopensegs - 1];
|
1999-09-02 04:57:50 +02:00
|
|
|
|
2020-01-11 03:31:22 +01:00
|
|
|
FileClose(v->mdfd_vfd);
|
|
|
|
_fdvec_resize(reln, forknum, nopensegs - 1);
|
2016-09-09 02:02:43 +02:00
|
|
|
nopensegs--;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2009-01-12 06:10:45 +01:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdprefetch() -- Initiate asynchronous read of the specified block of a relation
|
2009-01-12 06:10:45 +01:00
|
|
|
*/
|
2020-04-08 03:36:45 +02:00
|
|
|
bool
|
2009-01-12 06:10:45 +01:00
|
|
|
mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
|
|
|
|
{
|
|
|
|
#ifdef USE_PREFETCH
|
|
|
|
off_t seekpos;
|
|
|
|
MdfdVec *v;
|
|
|
|
|
2023-04-08 01:04:49 +02:00
|
|
|
Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
|
|
|
|
|
2020-04-08 03:36:45 +02:00
|
|
|
v = _mdfd_getseg(reln, forknum, blocknum, false,
|
|
|
|
InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
|
|
|
|
if (v == NULL)
|
|
|
|
return false;
|
2009-01-12 06:10:45 +01:00
|
|
|
|
|
|
|
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2009-01-12 06:10:45 +01:00
|
|
|
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
|
|
|
|
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
|
2009-01-12 06:10:45 +01:00
|
|
|
#endif /* USE_PREFETCH */
|
2020-04-08 03:36:45 +02:00
|
|
|
|
|
|
|
return true;
|
2009-01-12 06:10:45 +01:00
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdread() -- Read the specified block from a relation.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2008-08-11 13:05:11 +02:00
|
|
|
mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
|
2023-02-27 07:45:44 +01:00
|
|
|
void *buffer)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t seekpos;
|
1996-07-09 08:22:35 +02:00
|
|
|
int nbytes;
|
|
|
|
MdfdVec *v;
|
|
|
|
|
2023-04-08 00:38:09 +02:00
|
|
|
/* If this build supports direct I/O, the buffer must be I/O aligned. */
|
|
|
|
if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
|
|
|
|
Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
|
|
|
|
|
2009-03-12 00:19:25 +01:00
|
|
|
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
reln->smgr_rlocator.locator.spcOid,
|
|
|
|
reln->smgr_rlocator.locator.dbOid,
|
|
|
|
reln->smgr_rlocator.locator.relNumber,
|
|
|
|
reln->smgr_rlocator.backend);
|
2008-12-17 02:39:04 +01:00
|
|
|
|
2016-05-04 10:54:20 +02:00
|
|
|
v = _mdfd_getseg(reln, forknum, blocknum, false,
|
|
|
|
EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2018-11-06 21:51:50 +01:00
|
|
|
nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
|
2008-12-17 02:39:04 +01:00
|
|
|
|
2009-03-12 00:19:25 +01:00
|
|
|
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
reln->smgr_rlocator.locator.spcOid,
|
|
|
|
reln->smgr_rlocator.locator.dbOid,
|
|
|
|
reln->smgr_rlocator.locator.relNumber,
|
|
|
|
reln->smgr_rlocator.backend,
|
2009-03-12 00:19:25 +01:00
|
|
|
nbytes,
|
|
|
|
BLCKSZ);
|
2008-12-17 02:39:04 +01:00
|
|
|
|
|
|
|
if (nbytes != BLCKSZ)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2007-01-03 19:11:01 +01:00
|
|
|
if (nbytes < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not read block %u in file \"%s\": %m",
|
|
|
|
blocknum, FilePathName(v->mdfd_vfd))));
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2001-05-10 22:38:49 +02:00
|
|
|
/*
|
2007-01-03 19:11:01 +01:00
|
|
|
* Short read: we are at or past EOF, or we read a partial block at
|
|
|
|
* EOF. Normally this is an error; upper levels should never try to
|
|
|
|
* read a nonexistent block. However, if zero_damaged_pages is ON or
|
|
|
|
* we are InRecovery, we should instead return zeroes without
|
|
|
|
* complaining. This allows, for example, the case of trying to
|
|
|
|
* update a block that was later truncated away.
|
2001-05-10 22:38:49 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
if (zero_damaged_pages || InRecovery)
|
1999-10-06 08:38:04 +02:00
|
|
|
MemSet(buffer, 0, BLCKSZ);
|
1996-07-09 08:22:35 +02:00
|
|
|
else
|
2007-01-03 19:11:01 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_DATA_CORRUPTED),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
|
|
|
|
blocknum, FilePathName(v->mdfd_vfd),
|
2007-01-03 19:11:01 +01:00
|
|
|
nbytes, BLCKSZ)));
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdwrite() -- Write the supplied block at the appropriate location.
|
2007-01-03 19:11:01 +01:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* This is to be used only for updating already-existing blocks of a
|
|
|
|
* relation (ie, those before the current EOF). To extend a relation,
|
|
|
|
* use mdextend().
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2008-08-11 13:05:11 +02:00
|
|
|
mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
|
2023-02-27 07:45:44 +01:00
|
|
|
const void *buffer, bool skipFsync)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t seekpos;
|
2007-01-03 19:11:01 +01:00
|
|
|
int nbytes;
|
1996-07-09 08:22:35 +02:00
|
|
|
MdfdVec *v;
|
|
|
|
|
2023-04-08 00:38:09 +02:00
|
|
|
/* If this build supports direct I/O, the buffer must be I/O aligned. */
|
|
|
|
if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
|
|
|
|
Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
|
|
|
|
|
2007-01-03 19:11:01 +01:00
|
|
|
/* This assert is too expensive to have on normally ... */
|
|
|
|
#ifdef CHECK_WRITE_VS_EXTEND
|
2008-08-11 13:05:11 +02:00
|
|
|
Assert(blocknum < mdnblocks(reln, forknum));
|
2007-01-03 19:11:01 +01:00
|
|
|
#endif
|
|
|
|
|
2009-03-12 00:19:25 +01:00
|
|
|
TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
reln->smgr_rlocator.locator.spcOid,
|
|
|
|
reln->smgr_rlocator.locator.dbOid,
|
|
|
|
reln->smgr_rlocator.locator.relNumber,
|
|
|
|
reln->smgr_rlocator.backend);
|
2008-12-17 02:39:04 +01:00
|
|
|
|
2016-05-04 10:54:20 +02:00
|
|
|
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
|
|
|
|
EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2018-11-06 21:51:50 +01:00
|
|
|
nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
|
2008-12-17 02:39:04 +01:00
|
|
|
|
2009-03-12 00:19:25 +01:00
|
|
|
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
reln->smgr_rlocator.locator.spcOid,
|
|
|
|
reln->smgr_rlocator.locator.dbOid,
|
|
|
|
reln->smgr_rlocator.locator.relNumber,
|
|
|
|
reln->smgr_rlocator.backend,
|
2009-03-12 00:19:25 +01:00
|
|
|
nbytes,
|
|
|
|
BLCKSZ);
|
2008-12-17 02:39:04 +01:00
|
|
|
|
|
|
|
if (nbytes != BLCKSZ)
|
2004-05-31 22:31:33 +02:00
|
|
|
{
|
2007-01-03 19:11:01 +01:00
|
|
|
if (nbytes < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not write block %u in file \"%s\": %m",
|
|
|
|
blocknum, FilePathName(v->mdfd_vfd))));
|
2007-01-03 19:11:01 +01:00
|
|
|
/* short write: complain appropriately */
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_DISK_FULL),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
|
2007-01-03 19:11:01 +01:00
|
|
|
blocknum,
|
2009-08-05 20:01:54 +02:00
|
|
|
FilePathName(v->mdfd_vfd),
|
2007-01-03 19:11:01 +01:00
|
|
|
nbytes, BLCKSZ),
|
|
|
|
errhint("Check free disk space.")));
|
2004-05-31 22:31:33 +02:00
|
|
|
}
|
2004-05-31 05:48:10 +02:00
|
|
|
|
2010-08-13 22:10:54 +02:00
|
|
|
if (!skipFsync && !SmgrIsTemp(reln))
|
2008-08-11 13:05:11 +02:00
|
|
|
register_dirty_segment(reln, forknum, v);
|
2000-04-09 06:43:20 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2023-05-19 13:42:06 +02:00
|
|
|
/*
|
|
|
|
* mdwriteback() -- Tell the kernel to write pages back to storage.
|
|
|
|
*
|
|
|
|
* This accepts a range of blocks because flushing several pages at once is
|
|
|
|
* considerably more efficient than doing so individually.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
mdwriteback(SMgrRelation reln, ForkNumber forknum,
|
|
|
|
BlockNumber blocknum, BlockNumber nblocks)
|
|
|
|
{
|
|
|
|
Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Issue flush requests in as few requests as possible; have to split at
|
|
|
|
* segment boundaries though, since those are actually separate files.
|
|
|
|
*/
|
|
|
|
while (nblocks > 0)
|
|
|
|
{
|
|
|
|
BlockNumber nflush = nblocks;
|
|
|
|
off_t seekpos;
|
|
|
|
MdfdVec *v;
|
|
|
|
int segnum_start,
|
|
|
|
segnum_end;
|
|
|
|
|
|
|
|
v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
|
|
|
|
EXTENSION_DONT_OPEN);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We might be flushing buffers of already removed relations, that's
|
|
|
|
* ok, just ignore that case. If the segment file wasn't open already
|
|
|
|
* (ie from a recent mdwrite()), then we don't want to re-open it, to
|
|
|
|
* avoid a race with PROCSIGNAL_BARRIER_SMGRRELEASE that might leave
|
|
|
|
* us with a descriptor to a file that is about to be unlinked.
|
|
|
|
*/
|
|
|
|
if (!v)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* compute offset inside the current segment */
|
|
|
|
segnum_start = blocknum / RELSEG_SIZE;
|
|
|
|
|
|
|
|
/* compute number of desired writes within the current segment */
|
|
|
|
segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
|
|
|
|
if (segnum_start != segnum_end)
|
|
|
|
nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
|
|
|
|
|
|
|
|
Assert(nflush >= 1);
|
|
|
|
Assert(nflush <= nblocks);
|
|
|
|
|
|
|
|
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
|
|
|
|
|
|
|
|
FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
|
|
|
|
|
|
|
|
nblocks -= nflush;
|
|
|
|
blocknum += nflush;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdnblocks() -- Get the number of blocks stored in a relation.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2023-05-19 10:52:04 +02:00
|
|
|
* Important side effect: all active segments of the relation are opened
|
|
|
|
* and added to the md_seg_fds array. If this routine has not been
|
|
|
|
* called, then only segments up to the last one actually touched
|
|
|
|
* are present in the array.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2001-06-28 01:31:40 +02:00
|
|
|
BlockNumber
|
2008-08-11 13:05:11 +02:00
|
|
|
mdnblocks(SMgrRelation reln, ForkNumber forknum)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2020-09-04 04:57:35 +02:00
|
|
|
MdfdVec *v;
|
2001-06-28 01:31:40 +02:00
|
|
|
BlockNumber nblocks;
|
2020-09-04 04:57:35 +02:00
|
|
|
BlockNumber segno;
|
|
|
|
|
|
|
|
mdopenfork(reln, forknum, EXTENSION_FAIL);
|
2003-01-07 02:19:12 +01:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
/* mdopen has opened the first segment */
|
|
|
|
Assert(reln->md_num_open_segs[forknum] > 0);
|
|
|
|
|
2003-01-07 02:19:12 +01:00
|
|
|
/*
|
2016-09-09 02:02:43 +02:00
|
|
|
* Start from the last open segments, to avoid redundant seeks. We have
|
|
|
|
* previously verified that these segments are exactly RELSEG_SIZE long,
|
|
|
|
* and it's useless to recheck that each time.
|
2006-11-20 02:07:56 +01:00
|
|
|
*
|
|
|
|
* NOTE: this assumption could only be wrong if another backend has
|
2003-01-07 02:19:12 +01:00
|
|
|
* truncated the relation. We rely on higher code levels to handle that
|
2006-11-20 02:07:56 +01:00
|
|
|
* scenario by closing and re-opening the md fd, which is handled via
|
2011-11-01 18:14:47 +01:00
|
|
|
* relcache flush. (Since the checkpointer doesn't participate in
|
2016-09-09 02:02:43 +02:00
|
|
|
* relcache flush, it could have segment entries for inactive segments;
|
|
|
|
* that's OK because the checkpointer never needs to compute relation
|
|
|
|
* size.)
|
2003-01-07 02:19:12 +01:00
|
|
|
*/
|
2016-09-09 02:02:43 +02:00
|
|
|
segno = reln->md_num_open_segs[forknum] - 1;
|
|
|
|
v = &reln->md_seg_fds[forknum][segno];
|
2003-01-07 02:19:12 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
for (;;)
|
|
|
|
{
|
2008-08-11 13:05:11 +02:00
|
|
|
nblocks = _mdnblocks(reln, forknum, v);
|
2001-06-28 01:31:40 +02:00
|
|
|
if (nblocks > ((BlockNumber) RELSEG_SIZE))
|
2003-07-25 00:04:15 +02:00
|
|
|
elog(FATAL, "segment too big");
|
2001-06-28 01:31:40 +02:00
|
|
|
if (nblocks < ((BlockNumber) RELSEG_SIZE))
|
|
|
|
return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks;
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2001-05-10 22:38:49 +02:00
|
|
|
/*
|
|
|
|
* If segment is exactly RELSEG_SIZE, advance to next one.
|
|
|
|
*/
|
|
|
|
segno++;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
/*
|
2019-01-10 01:36:25 +01:00
|
|
|
* We used to pass O_CREAT here, but that has the disadvantage that it
|
|
|
|
* might create a segment which has vanished through some operating
|
2016-09-09 02:02:43 +02:00
|
|
|
* system misadventure. In such a case, creating the segment here
|
|
|
|
* undermines _mdfd_getseg's attempts to notice and report an error
|
|
|
|
* upon access to a missing segment.
|
|
|
|
*/
|
|
|
|
v = _mdfd_openseg(reln, forknum, segno, 0);
|
|
|
|
if (v == NULL)
|
|
|
|
return segno * ((BlockNumber) RELSEG_SIZE);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1996-11-27 08:24:02 +01:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdtruncate() -- Truncate relation to specified number of blocks.
|
1996-11-27 08:24:02 +01:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2010-08-13 22:10:54 +02:00
|
|
|
mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
|
1996-11-27 08:24:02 +01:00
|
|
|
{
|
2001-06-28 01:31:40 +02:00
|
|
|
BlockNumber curnblk;
|
|
|
|
BlockNumber priorblocks;
|
2016-09-09 02:02:43 +02:00
|
|
|
int curopensegs;
|
1996-11-27 08:24:02 +01:00
|
|
|
|
1999-09-02 04:57:50 +02:00
|
|
|
/*
|
2006-11-20 02:07:56 +01:00
|
|
|
* NOTE: mdnblocks makes sure we have opened all active segments, so that
|
|
|
|
* truncation loop will get them all!
|
1999-09-02 04:57:50 +02:00
|
|
|
*/
|
2008-08-11 13:05:11 +02:00
|
|
|
curnblk = mdnblocks(reln, forknum);
|
2001-06-28 01:31:40 +02:00
|
|
|
if (nblocks > curnblk)
|
2007-01-03 19:11:01 +01:00
|
|
|
{
|
|
|
|
/* Bogus request ... but no complaint if InRecovery */
|
|
|
|
if (InRecovery)
|
|
|
|
return;
|
|
|
|
ereport(ERROR,
|
2009-08-05 20:01:54 +02:00
|
|
|
(errmsg("could not truncate file \"%s\" to %u blocks: it's only %u blocks now",
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
relpath(reln->smgr_rlocator, forknum),
|
2007-01-03 19:11:01 +01:00
|
|
|
nblocks, curnblk)));
|
|
|
|
}
|
1999-09-02 04:57:50 +02:00
|
|
|
if (nblocks == curnblk)
|
2007-01-03 19:11:01 +01:00
|
|
|
return; /* no work */
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
/*
|
|
|
|
* Truncate segments, starting at the last one. Starting at the end makes
|
|
|
|
* managing the memory for the fd array easier, should there be errors.
|
|
|
|
*/
|
|
|
|
curopensegs = reln->md_num_open_segs[forknum];
|
|
|
|
while (curopensegs > 0)
|
1999-06-18 18:47:23 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
MdfdVec *v;
|
|
|
|
|
|
|
|
priorblocks = (curopensegs - 1) * RELSEG_SIZE;
|
|
|
|
|
|
|
|
v = &reln->md_seg_fds[forknum][curopensegs - 1];
|
1999-09-02 04:57:50 +02:00
|
|
|
|
|
|
|
if (priorblocks > nblocks)
|
1999-06-18 18:47:23 +02:00
|
|
|
{
|
1999-09-02 04:57:50 +02:00
|
|
|
/*
|
2016-09-09 02:02:43 +02:00
|
|
|
* This segment is no longer active. We truncate the file, but do
|
|
|
|
* not delete it, for reasons explained in the header comments.
|
1999-09-02 04:57:50 +02:00
|
|
|
*/
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
if (FileTruncate(v->mdfd_vfd, 0, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
|
2007-01-03 19:11:01 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not truncate file \"%s\": %m",
|
|
|
|
FilePathName(v->mdfd_vfd))));
|
|
|
|
|
2010-08-13 22:10:54 +02:00
|
|
|
if (!SmgrIsTemp(reln))
|
2008-08-11 13:05:11 +02:00
|
|
|
register_dirty_segment(reln, forknum, v);
|
2016-09-09 02:02:43 +02:00
|
|
|
|
|
|
|
/* we never drop the 1st segment */
|
|
|
|
Assert(v != &reln->md_seg_fds[forknum][0]);
|
|
|
|
|
|
|
|
FileClose(v->mdfd_vfd);
|
|
|
|
_fdvec_resize(reln, forknum, curopensegs - 1);
|
1999-06-18 18:47:23 +02:00
|
|
|
}
|
2001-06-28 01:31:40 +02:00
|
|
|
else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
|
1999-09-02 04:57:50 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This is the last segment we want to keep. Truncate the file to
|
2016-09-09 02:02:43 +02:00
|
|
|
* the right length. NOTE: if nblocks is exactly a multiple K of
|
|
|
|
* RELSEG_SIZE, we will truncate the K+1st segment to 0 length but
|
|
|
|
* keep it. This adheres to the invariant given in the header
|
|
|
|
* comments.
|
1999-09-02 04:57:50 +02:00
|
|
|
*/
|
2001-06-28 01:31:40 +02:00
|
|
|
BlockNumber lastsegblocks = nblocks - priorblocks;
|
2000-04-12 19:17:23 +02:00
|
|
|
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
|
2007-01-03 19:11:01 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not truncate file \"%s\" to %u blocks: %m",
|
|
|
|
FilePathName(v->mdfd_vfd),
|
2007-01-03 19:11:01 +01:00
|
|
|
nblocks)));
|
2010-08-13 22:10:54 +02:00
|
|
|
if (!SmgrIsTemp(reln))
|
2008-08-11 13:05:11 +02:00
|
|
|
register_dirty_segment(reln, forknum, v);
|
1999-09-02 04:57:50 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2016-09-09 02:02:43 +02:00
|
|
|
* We still need this segment, so nothing to do for this and any
|
|
|
|
* earlier segment.
|
1999-09-02 04:57:50 +02:00
|
|
|
*/
|
2016-09-09 02:02:43 +02:00
|
|
|
break;
|
1999-09-02 04:57:50 +02:00
|
|
|
}
|
2016-09-09 02:02:43 +02:00
|
|
|
curopensegs--;
|
1999-06-18 18:47:23 +02:00
|
|
|
}
|
2000-06-28 05:33:33 +02:00
|
|
|
}
|
1996-11-27 08:24:02 +01:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* mdimmedsync() -- Immediately sync a relation to stable storage.
|
2005-06-20 20:37:02 +02:00
|
|
|
*
|
|
|
|
* Note that only writes already issued are synced; this routine knows
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
|
|
|
* nothing of dirty buffers that may exist inside the buffer manager. We
|
|
|
|
* sync active and inactive segments; smgrDoPendingSyncs() relies on this.
|
|
|
|
* Consider a relation skipping WAL. Suppose a checkpoint syncs blocks of
|
|
|
|
* some segment, then mdtruncate() renders that segment inactive. If we
|
|
|
|
* crash before the next checkpoint syncs the newly-inactive segment, that
|
|
|
|
* segment may survive recovery, reintroducing unwanted data into the table.
|
2004-06-02 19:28:18 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
void
|
2008-08-11 13:05:11 +02:00
|
|
|
mdimmedsync(SMgrRelation reln, ForkNumber forknum)
|
2004-06-02 19:28:18 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
int segno;
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
|
|
|
int min_inactive_seg;
|
2004-06-02 19:28:18 +02:00
|
|
|
|
|
|
|
/*
|
2006-11-20 02:07:56 +01:00
|
|
|
* NOTE: mdnblocks makes sure we have opened all active segments, so that
|
2004-06-02 19:28:18 +02:00
|
|
|
* fsync loop will get them all!
|
|
|
|
*/
|
2011-04-11 21:28:45 +02:00
|
|
|
mdnblocks(reln, forknum);
|
2004-06-02 19:28:18 +02:00
|
|
|
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
|
|
|
min_inactive_seg = segno = reln->md_num_open_segs[forknum];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Temporarily open inactive segments, then close them after sync. There
|
|
|
|
* may be some inactive segments left opened after fsync() error, but that
|
|
|
|
* is harmless. We don't bother to clean them up and take a risk of
|
|
|
|
* further trouble. The next mdclose() will soon close them.
|
|
|
|
*/
|
|
|
|
while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
|
|
|
|
segno++;
|
2004-06-02 19:28:18 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
while (segno > 0)
|
2004-06-02 19:28:18 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
MdfdVec *v = &reln->md_seg_fds[forknum][segno - 1];
|
|
|
|
|
2023-02-10 07:22:26 +01:00
|
|
|
/*
|
|
|
|
* fsyncs done through mdimmedsync() should be tracked in a separate
|
|
|
|
* IOContext than those done through mdsyncfiletag() to differentiate
|
|
|
|
* between unavoidable client backend fsyncs (e.g. those done during
|
|
|
|
* index build) and those which ideally would have been done by the
|
|
|
|
* checkpointer. Since other IO operations bypassing the buffer
|
|
|
|
* manager could also be tracked in such an IOContext, wait until
|
|
|
|
* these are also tracked to track immediate fsyncs.
|
|
|
|
*/
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
|
PANIC on fsync() failure.
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
2018-11-19 01:31:10 +01:00
|
|
|
ereport(data_sync_elevel(ERROR),
|
2007-01-03 19:11:01 +01:00
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not fsync file \"%s\": %m",
|
|
|
|
FilePathName(v->mdfd_vfd))));
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
|
|
|
|
|
|
|
/* Close inactive segments immediately */
|
|
|
|
if (segno > min_inactive_seg)
|
|
|
|
{
|
|
|
|
FileClose(v->mdfd_vfd);
|
|
|
|
_fdvec_resize(reln, forknum, segno - 1);
|
|
|
|
}
|
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
segno--;
|
2004-06-02 19:28:18 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2004-05-31 05:48:10 +02:00
|
|
|
* register_dirty_segment() -- Mark a relation segment as needing fsync
|
|
|
|
*
|
|
|
|
* If there is a local pending-ops table, just make an entry in it for
|
2019-04-04 10:56:03 +02:00
|
|
|
* ProcessSyncRequests to process later. Otherwise, try to pass off the
|
|
|
|
* fsync request to the checkpointer process. If that fails, just do the
|
|
|
|
* fsync locally before returning (we hope this will not happen often
|
|
|
|
* enough to be a performance problem).
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2007-01-03 19:11:01 +01:00
|
|
|
static void
|
2008-08-11 13:05:11 +02:00
|
|
|
register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2019-04-04 10:56:03 +02:00
|
|
|
FileTag tag;
|
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
|
2019-04-04 10:56:03 +02:00
|
|
|
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
/* Temp relations should never be fsync'd */
|
|
|
|
Assert(!SmgrIsTemp(reln));
|
|
|
|
|
2019-04-04 10:56:03 +02:00
|
|
|
if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
|
2004-05-31 05:48:10 +02:00
|
|
|
{
|
2023-04-08 01:05:26 +02:00
|
|
|
instr_time io_start;
|
|
|
|
|
|
|
|
ereport(DEBUG1,
|
|
|
|
(errmsg_internal("could not forward fsync request because request queue is full")));
|
|
|
|
|
|
|
|
io_start = pgstat_prepare_io_time();
|
|
|
|
|
|
|
|
if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
|
|
|
|
ereport(data_sync_elevel(ERROR),
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not fsync file \"%s\": %m",
|
|
|
|
FilePathName(seg->mdfd_vfd))));
|
|
|
|
|
2023-02-10 07:22:26 +01:00
|
|
|
/*
|
|
|
|
* We have no way of knowing if the current IOContext is
|
|
|
|
* IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
|
|
|
|
* point, so count the fsync as being in the IOCONTEXT_NORMAL
|
|
|
|
* IOContext. This is probably okay, because the number of backend
|
|
|
|
* fsyncs doesn't say anything about the efficacy of the
|
|
|
|
* BufferAccessStrategy. And counting both fsyncs done in
|
|
|
|
* IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
|
|
|
|
* IOCONTEXT_NORMAL is likely clearer when investigating the number of
|
|
|
|
* backend fsyncs.
|
|
|
|
*/
|
2023-04-08 01:05:26 +02:00
|
|
|
pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
|
|
|
|
IOOP_FSYNC, io_start, 1);
|
2007-01-03 19:11:01 +01:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2007-11-15 21:36:40 +01:00
|
|
|
/*
|
2019-06-08 04:46:38 +02:00
|
|
|
* register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
|
2007-11-15 21:36:40 +01:00
|
|
|
*/
|
|
|
|
static void
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
register_unlink_segment(RelFileLocatorBackend rlocator, ForkNumber forknum,
|
2019-04-04 10:56:03 +02:00
|
|
|
BlockNumber segno)
|
2007-11-15 21:36:40 +01:00
|
|
|
{
|
2019-04-04 10:56:03 +02:00
|
|
|
FileTag tag;
|
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
INIT_MD_FILETAG(tag, rlocator.locator, forknum, segno);
|
2019-04-04 10:56:03 +02:00
|
|
|
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
/* Should never be used with temp relations */
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
Assert(!RelFileLocatorBackendIsTemp(rlocator));
|
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 22:55:39 +02:00
|
|
|
|
2019-04-04 10:56:03 +02:00
|
|
|
RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, true /* retryOnError */ );
|
2007-11-15 21:36:40 +01:00
|
|
|
}
|
|
|
|
|
2000-10-28 18:21:00 +02:00
|
|
|
/*
|
2019-04-04 10:56:03 +02:00
|
|
|
* register_forget_request() -- forget any fsyncs for a relation fork's segment
|
2000-10-28 18:21:00 +02:00
|
|
|
*/
|
2019-04-04 10:56:03 +02:00
|
|
|
static void
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
|
2019-04-04 10:56:03 +02:00
|
|
|
BlockNumber segno)
|
2000-10-28 18:21:00 +02:00
|
|
|
{
|
2019-04-04 10:56:03 +02:00
|
|
|
FileTag tag;
|
2007-01-17 17:25:01 +01:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
INIT_MD_FILETAG(tag, rlocator.locator, forknum, segno);
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2019-04-04 10:56:03 +02:00
|
|
|
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
|
2007-01-17 17:25:01 +01:00
|
|
|
}
|
2004-05-31 05:48:10 +02:00
|
|
|
|
2007-01-17 17:25:01 +01:00
|
|
|
/*
|
2019-04-25 16:43:48 +02:00
|
|
|
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
|
2007-01-17 17:25:01 +01:00
|
|
|
*/
|
|
|
|
void
|
2019-04-04 10:56:03 +02:00
|
|
|
ForgetDatabaseSyncRequests(Oid dbid)
|
2007-01-17 17:25:01 +01:00
|
|
|
{
|
2019-04-04 10:56:03 +02:00
|
|
|
FileTag tag;
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator rlocator;
|
2007-01-17 17:25:01 +01:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
rlocator.dbOid = dbid;
|
|
|
|
rlocator.spcOid = 0;
|
|
|
|
rlocator.relNumber = 0;
|
2007-01-17 17:25:01 +01:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
INIT_MD_FILETAG(tag, rlocator, InvalidForkNumber, InvalidBlockNumber);
|
2019-04-04 10:56:03 +02:00
|
|
|
|
|
|
|
RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
|
2000-10-28 18:21:00 +02:00
|
|
|
}
|
|
|
|
|
Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-04 19:21:15 +02:00
|
|
|
/*
|
|
|
|
* DropRelationFiles -- drop files of all given relations
|
|
|
|
*/
|
|
|
|
void
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
|
Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-04 19:21:15 +02:00
|
|
|
{
|
|
|
|
SMgrRelation *srels;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
srels = palloc(sizeof(SMgrRelation) * ndelrels);
|
|
|
|
for (i = 0; i < ndelrels; i++)
|
|
|
|
{
|
|
|
|
SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
|
|
|
|
|
|
|
|
if (isRedo)
|
|
|
|
{
|
|
|
|
ForkNumber fork;
|
|
|
|
|
|
|
|
for (fork = 0; fork <= MAX_FORKNUM; fork++)
|
|
|
|
XLogDropRelation(delrels[i], fork);
|
|
|
|
}
|
|
|
|
srels[i] = srel;
|
|
|
|
}
|
|
|
|
|
|
|
|
smgrdounlinkall(srels, ndelrels, isRedo);
|
|
|
|
|
2019-03-27 02:39:39 +01:00
|
|
|
for (i = 0; i < ndelrels; i++)
|
Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-04 19:21:15 +02:00
|
|
|
smgrclose(srels[i]);
|
|
|
|
pfree(srels);
|
|
|
|
}
|
|
|
|
|
2007-01-17 17:25:01 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* _fdvec_resize() -- Resize the fork's open segments array
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2016-09-09 02:02:43 +02:00
|
|
|
static void
|
|
|
|
_fdvec_resize(SMgrRelation reln,
|
|
|
|
ForkNumber forknum,
|
|
|
|
int nseg)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
if (nseg == 0)
|
|
|
|
{
|
|
|
|
if (reln->md_num_open_segs[forknum] > 0)
|
|
|
|
{
|
|
|
|
pfree(reln->md_seg_fds[forknum]);
|
|
|
|
reln->md_seg_fds[forknum] = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (reln->md_num_open_segs[forknum] == 0)
|
|
|
|
{
|
|
|
|
reln->md_seg_fds[forknum] =
|
|
|
|
MemoryContextAlloc(MdCxt, sizeof(MdfdVec) * nseg);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2020-01-11 03:31:22 +01:00
|
|
|
* It doesn't seem worthwhile complicating the code to amortize
|
|
|
|
* repalloc() calls. Those are far faster than PathNameOpenFile() or
|
|
|
|
* FileClose(), and the memory context internally will sometimes avoid
|
|
|
|
* doing an actual reallocation.
|
2016-09-09 02:02:43 +02:00
|
|
|
*/
|
|
|
|
reln->md_seg_fds[forknum] =
|
|
|
|
repalloc(reln->md_seg_fds[forknum],
|
|
|
|
sizeof(MdfdVec) * nseg);
|
|
|
|
}
|
|
|
|
|
|
|
|
reln->md_num_open_segs[forknum] = nseg;
|
1997-05-22 19:08:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-08-05 20:01:54 +02:00
|
|
|
* Return the filename for the specified segment of the relation. The
|
|
|
|
* returned string is palloc'd.
|
1997-05-22 19:08:35 +02:00
|
|
|
*/
|
2009-08-05 20:01:54 +02:00
|
|
|
static char *
|
|
|
|
_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2009-08-05 20:01:54 +02:00
|
|
|
char *path,
|
|
|
|
*fullpath;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
path = relpath(reln->smgr_rlocator, forknum);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
if (segno > 0)
|
|
|
|
{
|
2014-01-07 03:30:26 +01:00
|
|
|
fullpath = psprintf("%s.%u", path, segno);
|
2000-04-09 06:43:20 +02:00
|
|
|
pfree(path);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
fullpath = path;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2009-08-05 20:01:54 +02:00
|
|
|
return fullpath;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the specified segment of the relation,
|
|
|
|
* and make a MdfdVec object for it. Returns NULL on failure.
|
|
|
|
*/
|
|
|
|
static MdfdVec *
|
|
|
|
_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
|
|
|
|
int oflags)
|
|
|
|
{
|
|
|
|
MdfdVec *v;
|
2020-05-21 17:31:16 +02:00
|
|
|
File fd;
|
2009-08-05 20:01:54 +02:00
|
|
|
char *fullpath;
|
|
|
|
|
|
|
|
fullpath = _mdfd_segpath(reln, forknum, segno);
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* open the file */
|
2023-04-08 01:04:49 +02:00
|
|
|
fd = PathNameOpenFile(fullpath, _mdfd_open_flags() | oflags);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2000-04-09 06:43:20 +02:00
|
|
|
pfree(fullpath);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
if (fd < 0)
|
2004-01-07 19:56:30 +01:00
|
|
|
return NULL;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2020-01-26 21:05:27 +01:00
|
|
|
/*
|
|
|
|
* Segments are always opened in order from lowest to highest, so we must
|
|
|
|
* be adding a new one at the end.
|
|
|
|
*/
|
|
|
|
Assert(segno == reln->md_num_open_segs[forknum]);
|
|
|
|
|
|
|
|
_fdvec_resize(reln, forknum, segno + 1);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* fill the entry */
|
2016-09-09 02:02:43 +02:00
|
|
|
v = &reln->md_seg_fds[forknum][segno];
|
1996-07-09 08:22:35 +02:00
|
|
|
v->mdfd_vfd = fd;
|
2004-05-31 05:48:10 +02:00
|
|
|
v->mdfd_segno = segno;
|
2016-09-09 02:02:43 +02:00
|
|
|
|
2008-08-11 13:05:11 +02:00
|
|
|
Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/* all done */
|
|
|
|
return v;
|
|
|
|
}
|
1999-09-02 04:57:50 +02:00
|
|
|
|
2004-01-06 19:07:32 +01:00
|
|
|
/*
|
2023-05-19 10:52:04 +02:00
|
|
|
* _mdfd_getseg() -- Find the segment of the relation holding the
|
|
|
|
* specified block.
|
2007-01-03 19:11:01 +01:00
|
|
|
*
|
|
|
|
* If the segment doesn't exist, we ereport, return NULL, or create the
|
2010-08-13 22:10:54 +02:00
|
|
|
* segment, according to "behavior". Note: skipFsync is only used in the
|
|
|
|
* EXTENSION_CREATE case.
|
2004-01-06 19:07:32 +01:00
|
|
|
*/
|
1999-09-02 04:57:50 +02:00
|
|
|
static MdfdVec *
|
2008-08-11 13:05:11 +02:00
|
|
|
_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
|
2016-05-04 10:54:20 +02:00
|
|
|
bool skipFsync, int behavior)
|
1999-09-02 04:57:50 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
MdfdVec *v;
|
2007-01-03 19:11:01 +01:00
|
|
|
BlockNumber targetseg;
|
2004-05-31 05:48:10 +02:00
|
|
|
BlockNumber nextsegno;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2016-06-03 16:13:36 +02:00
|
|
|
/* some way to handle non-existent segments needs to be specified */
|
2016-05-04 10:54:20 +02:00
|
|
|
Assert(behavior &
|
2022-05-07 06:19:42 +02:00
|
|
|
(EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
|
|
|
|
EXTENSION_DONT_OPEN));
|
2016-05-04 10:54:20 +02:00
|
|
|
|
2007-01-03 19:11:01 +01:00
|
|
|
targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
|
2016-09-09 02:02:43 +02:00
|
|
|
|
|
|
|
/* if an existing and opened segment, we're done */
|
|
|
|
if (targetseg < reln->md_num_open_segs[forknum])
|
2004-05-31 05:48:10 +02:00
|
|
|
{
|
2016-09-09 02:02:43 +02:00
|
|
|
v = &reln->md_seg_fds[forknum][targetseg];
|
|
|
|
return v;
|
|
|
|
}
|
2007-01-03 19:11:01 +01:00
|
|
|
|
2022-05-07 06:19:42 +02:00
|
|
|
/* The caller only wants the segment if we already had it open. */
|
|
|
|
if (behavior & EXTENSION_DONT_OPEN)
|
|
|
|
return NULL;
|
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
/*
|
|
|
|
* The target segment is not yet open. Iterate over all the segments
|
|
|
|
* between the last opened and the target segment. This way missing
|
|
|
|
* segments either raise an error, or get created (according to
|
|
|
|
* 'behavior'). Start with either the last opened, or the first segment if
|
|
|
|
* none was opened before.
|
|
|
|
*/
|
|
|
|
if (reln->md_num_open_segs[forknum] > 0)
|
|
|
|
v = &reln->md_seg_fds[forknum][reln->md_num_open_segs[forknum] - 1];
|
|
|
|
else
|
|
|
|
{
|
2019-07-17 02:14:08 +02:00
|
|
|
v = mdopenfork(reln, forknum, behavior);
|
2016-09-09 02:02:43 +02:00
|
|
|
if (!v)
|
|
|
|
return NULL; /* if behavior & EXTENSION_RETURN_NULL */
|
|
|
|
}
|
|
|
|
|
|
|
|
for (nextsegno = reln->md_num_open_segs[forknum];
|
|
|
|
nextsegno <= targetseg; nextsegno++)
|
|
|
|
{
|
|
|
|
BlockNumber nblocks = _mdnblocks(reln, forknum, v);
|
|
|
|
int flags = 0;
|
Don't open formally non-existent segments in _mdfd_getseg().
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29ca we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
2016-04-27 05:32:51 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
Assert(nextsegno == v->mdfd_segno + 1);
|
|
|
|
|
|
|
|
if (nblocks > ((BlockNumber) RELSEG_SIZE))
|
|
|
|
elog(FATAL, "segment too big");
|
Don't open formally non-existent segments in _mdfd_getseg().
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29ca we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
2016-04-27 05:32:51 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
if ((behavior & EXTENSION_CREATE) ||
|
|
|
|
(InRecovery && (behavior & EXTENSION_CREATE_RECOVERY)))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Normally we will create new segments only if authorized by the
|
|
|
|
* caller (i.e., we are doing mdextend()). But when doing WAL
|
|
|
|
* recovery, create segments anyway; this allows cases such as
|
|
|
|
* replaying WAL data that has a write into a high-numbered
|
|
|
|
* segment of a relation that was later deleted. We want to go
|
|
|
|
* ahead and create the segments so we can finish out the replay.
|
|
|
|
*
|
|
|
|
* We have to maintain the invariant that segments before the last
|
|
|
|
* active segment are of size RELSEG_SIZE; therefore, if
|
|
|
|
* extending, pad them out with zeroes if needed. (This only
|
|
|
|
* matters if in recovery, or if the caller is extending the
|
|
|
|
* relation discontiguously, but that can happen in hash indexes.)
|
|
|
|
*/
|
|
|
|
if (nblocks < ((BlockNumber) RELSEG_SIZE))
|
2007-01-03 19:11:01 +01:00
|
|
|
{
|
2023-04-08 00:38:09 +02:00
|
|
|
char *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
|
|
|
|
MCXT_ALLOC_ZERO);
|
2007-01-03 19:11:01 +01:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
mdextend(reln, forknum,
|
|
|
|
nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
|
|
|
|
zerobuf, skipFsync);
|
|
|
|
pfree(zerobuf);
|
2007-01-03 19:11:01 +01:00
|
|
|
}
|
2016-09-09 02:02:43 +02:00
|
|
|
flags = O_CREAT;
|
|
|
|
}
|
|
|
|
else if (!(behavior & EXTENSION_DONT_CHECK_SIZE) &&
|
|
|
|
nblocks < ((BlockNumber) RELSEG_SIZE))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* When not extending (or explicitly including truncated
|
|
|
|
* segments), only open the next segment if the current one is
|
|
|
|
* exactly RELSEG_SIZE. If not (this branch), either return NULL
|
|
|
|
* or fail.
|
|
|
|
*/
|
|
|
|
if (behavior & EXTENSION_RETURN_NULL)
|
2007-01-03 19:11:01 +01:00
|
|
|
{
|
Don't open formally non-existent segments in _mdfd_getseg().
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29ca we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
2016-04-27 05:32:51 +02:00
|
|
|
/*
|
2016-09-09 02:02:43 +02:00
|
|
|
* Some callers discern between reasons for _mdfd_getseg()
|
|
|
|
* returning NULL based on errno. As there's no failing
|
|
|
|
* syscall involved in this case, explicitly set errno to
|
|
|
|
* ENOENT, as that seems the closest interpretation.
|
Don't open formally non-existent segments in _mdfd_getseg().
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29ca we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
2016-04-27 05:32:51 +02:00
|
|
|
*/
|
2016-09-09 02:02:43 +02:00
|
|
|
errno = ENOENT;
|
|
|
|
return NULL;
|
2007-01-03 19:11:01 +01:00
|
|
|
}
|
Don't open formally non-existent segments in _mdfd_getseg().
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29ca we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
2016-04-27 05:32:51 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\" (target block %u): previous segment is only %u blocks",
|
|
|
|
_mdfd_segpath(reln, forknum, nextsegno),
|
|
|
|
blkno, nblocks)));
|
|
|
|
}
|
Don't open formally non-existent segments in _mdfd_getseg().
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29ca we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
2016-04-27 05:32:51 +02:00
|
|
|
|
2016-09-09 02:02:43 +02:00
|
|
|
v = _mdfd_openseg(reln, forknum, nextsegno, flags);
|
|
|
|
|
|
|
|
if (v == NULL)
|
|
|
|
{
|
|
|
|
if ((behavior & EXTENSION_RETURN_NULL) &&
|
|
|
|
FILE_POSSIBLY_DELETED(errno))
|
|
|
|
return NULL;
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not open file \"%s\" (target block %u): %m",
|
|
|
|
_mdfd_segpath(reln, forknum, nextsegno),
|
2004-02-10 02:55:27 +01:00
|
|
|
blkno)));
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2016-09-09 02:02:43 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
return v;
|
|
|
|
}
|
|
|
|
|
2000-04-11 01:41:52 +02:00
|
|
|
/*
|
2004-02-10 02:55:27 +01:00
|
|
|
* Get number of blocks present in a single disk file
|
2000-04-09 06:43:20 +02:00
|
|
|
*/
|
1996-07-09 08:22:35 +02:00
|
|
|
static BlockNumber
|
2008-08-11 13:05:11 +02:00
|
|
|
_mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t len;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2018-11-06 21:51:50 +01:00
|
|
|
len = FileSize(seg->mdfd_vfd);
|
1999-10-06 08:38:04 +02:00
|
|
|
if (len < 0)
|
2007-01-03 19:11:01 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2009-08-05 20:01:54 +02:00
|
|
|
errmsg("could not seek to end of file \"%s\": %m",
|
|
|
|
FilePathName(seg->mdfd_vfd))));
|
2007-01-03 19:11:01 +01:00
|
|
|
/* note that this calculation will ignore any partial block at EOF */
|
|
|
|
return (BlockNumber) (len / BLCKSZ);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2019-04-04 10:56:03 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Sync a file to disk, given a file tag. Write the path into an output
|
|
|
|
* buffer so the caller can use it in error messages.
|
|
|
|
*
|
|
|
|
* Return 0 on success, -1 on failure, with errno set.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
mdsyncfiletag(const FileTag *ftag, char *path)
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
|
2019-12-14 05:38:09 +01:00
|
|
|
File file;
|
2023-04-08 01:05:26 +02:00
|
|
|
instr_time io_start;
|
2019-12-14 03:54:31 +01:00
|
|
|
bool need_to_close;
|
2019-12-14 05:38:09 +01:00
|
|
|
int result,
|
|
|
|
save_errno;
|
2019-12-14 03:54:31 +01:00
|
|
|
|
|
|
|
/* See if we already have the file open, or need to open it. */
|
|
|
|
if (ftag->segno < reln->md_num_open_segs[ftag->forknum])
|
|
|
|
{
|
|
|
|
file = reln->md_seg_fds[ftag->forknum][ftag->segno].mdfd_vfd;
|
|
|
|
strlcpy(path, FilePathName(file), MAXPGPATH);
|
|
|
|
need_to_close = false;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
char *p;
|
|
|
|
|
|
|
|
p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
|
|
|
|
strlcpy(path, p, MAXPGPATH);
|
|
|
|
pfree(p);
|
|
|
|
|
2023-04-08 01:04:49 +02:00
|
|
|
file = PathNameOpenFile(path, _mdfd_open_flags());
|
2019-12-14 05:38:09 +01:00
|
|
|
if (file < 0)
|
2019-12-14 03:54:31 +01:00
|
|
|
return -1;
|
|
|
|
need_to_close = true;
|
|
|
|
}
|
|
|
|
|
2023-04-08 01:05:26 +02:00
|
|
|
io_start = pgstat_prepare_io_time();
|
|
|
|
|
2019-12-14 03:54:31 +01:00
|
|
|
/* Sync the file. */
|
2019-12-14 05:38:09 +01:00
|
|
|
result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
|
2019-12-14 03:54:31 +01:00
|
|
|
save_errno = errno;
|
|
|
|
|
2019-12-14 05:38:09 +01:00
|
|
|
if (need_to_close)
|
|
|
|
FileClose(file);
|
2019-12-14 03:54:31 +01:00
|
|
|
|
2023-04-08 01:05:26 +02:00
|
|
|
pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
|
|
|
|
IOOP_FSYNC, io_start, 1);
|
2023-02-10 07:22:26 +01:00
|
|
|
|
2019-12-14 05:38:09 +01:00
|
|
|
errno = save_errno;
|
2019-12-14 03:54:31 +01:00
|
|
|
return result;
|
2019-04-04 10:56:03 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unlink a file, given a file tag. Write the path into an output
|
|
|
|
* buffer so the caller can use it in error messages.
|
|
|
|
*
|
|
|
|
* Return 0 on success, -1 on failure, with errno set.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
mdunlinkfiletag(const FileTag *ftag, char *path)
|
|
|
|
{
|
|
|
|
char *p;
|
|
|
|
|
|
|
|
/* Compute the path. */
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
|
2019-04-04 10:56:03 +02:00
|
|
|
strlcpy(path, p, MAXPGPATH);
|
|
|
|
pfree(p);
|
|
|
|
|
|
|
|
/* Try to unlink the file. */
|
|
|
|
return unlink(path);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if a given candidate request matches a given tag, when processing
|
|
|
|
* a SYNC_FILTER_REQUEST request. This will be called for all pending
|
|
|
|
* requests to find out whether to forget them.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* For now we only use filter requests as a way to drop all scheduled
|
|
|
|
* callbacks relating to a given database, when dropping the database.
|
|
|
|
* We'll return true for all candidates that have the same database OID as
|
|
|
|
* the ftag from the SYNC_FILTER_REQUEST request, so they're forgotten.
|
|
|
|
*/
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
|
2019-04-04 10:56:03 +02:00
|
|
|
}
|