1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* hio.c
|
1996-07-09 08:22:35 +02:00
|
|
|
* POSTGRES heap access method input/output code.
|
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/access/heap/hio.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
1999-07-16 01:04:24 +02:00
|
|
|
#include "postgres.h"
|
1996-10-20 10:32:11 +02:00
|
|
|
|
2008-11-06 21:51:15 +01:00
|
|
|
#include "access/heapam.h"
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "access/hio.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
#include "access/visibilitymap.h"
|
2008-06-09 00:00:48 +02:00
|
|
|
#include "storage/bufmgr.h"
|
2001-06-29 23:08:25 +02:00
|
|
|
#include "storage/freespace.h"
|
2008-05-12 02:00:54 +02:00
|
|
|
#include "storage/lmgr.h"
|
2010-02-09 22:43:30 +01:00
|
|
|
#include "storage/smgr.h"
|
2001-06-29 23:08:25 +02:00
|
|
|
|
1996-10-21 07:59:49 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2000-07-03 04:54:21 +02:00
|
|
|
* RelationPutHeapTuple - place tuple at specified page
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2003-07-21 22:29:40 +02:00
|
|
|
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
|
1998-12-15 13:47:01 +01:00
|
|
|
*
|
2001-07-14 00:52:58 +02:00
|
|
|
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
RelationPutHeapTuple(Relation relation,
|
1998-12-15 13:47:01 +01:00
|
|
|
Buffer buffer,
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
HeapTuple tuple,
|
|
|
|
bool token)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1998-12-15 13:47:01 +01:00
|
|
|
Page pageHeader;
|
|
|
|
OffsetNumber offnum;
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A tuple that's being inserted speculatively should already have its
|
|
|
|
* token set.
|
|
|
|
*/
|
|
|
|
Assert(!token || HeapTupleHeaderIsSpeculative(tuple->t_data));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2020-10-22 14:44:18 +02:00
|
|
|
/*
|
|
|
|
* Do not allow tuples with invalid combinations of hint bits to be placed
|
2021-02-23 21:30:21 +01:00
|
|
|
* on a page. This combination is detected as corruption by the
|
|
|
|
* contrib/amcheck logic, so if you disable this assertion, make
|
|
|
|
* corresponding changes there.
|
2020-10-22 14:44:18 +02:00
|
|
|
*/
|
|
|
|
Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_COMMITTED) &&
|
|
|
|
(tuple->t_data->t_infomask & HEAP_XMAX_IS_MULTI)));
|
|
|
|
|
2001-07-14 00:52:58 +02:00
|
|
|
/* Add the tuple to the page */
|
2016-04-20 15:31:19 +02:00
|
|
|
pageHeader = BufferGetPage(buffer);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2001-07-14 00:52:58 +02:00
|
|
|
offnum = PageAddItem(pageHeader, (Item) tuple->t_data,
|
2007-09-20 19:56:33 +02:00
|
|
|
tuple->t_len, InvalidOffsetNumber, false, true);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2000-07-03 04:54:21 +02:00
|
|
|
if (offnum == InvalidOffsetNumber)
|
2003-07-21 22:29:40 +02:00
|
|
|
elog(PANIC, "failed to add tuple to page");
|
2000-07-03 04:54:21 +02:00
|
|
|
|
2001-07-14 00:52:58 +02:00
|
|
|
/* Update tuple->t_self to the actual position where it was stored */
|
|
|
|
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/*
|
|
|
|
* Insert the correct position into CTID of the stored tuple, too (unless
|
|
|
|
* this is a speculative insertion, in which case the token is held in
|
|
|
|
* CTID field instead)
|
|
|
|
*/
|
|
|
|
if (!token)
|
|
|
|
{
|
|
|
|
ItemId itemId = PageGetItemId(pageHeader, offnum);
|
2018-03-01 01:25:54 +01:00
|
|
|
HeapTupleHeader item = (HeapTupleHeader) PageGetItem(pageHeader, itemId);
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
|
2018-03-01 01:25:54 +01:00
|
|
|
item->t_ctid = tuple->t_self;
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2008-11-06 21:51:15 +01:00
|
|
|
/*
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
* Read in a buffer in mode, using bulk-insert strategy if bistate isn't NULL.
|
2008-11-06 21:51:15 +01:00
|
|
|
*/
|
|
|
|
static Buffer
|
|
|
|
ReadBufferBI(Relation relation, BlockNumber targetBlock,
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
ReadBufferMode mode, BulkInsertState bistate)
|
2008-11-06 21:51:15 +01:00
|
|
|
{
|
|
|
|
Buffer buffer;
|
|
|
|
|
|
|
|
/* If not bulk-insert, exactly like ReadBuffer */
|
|
|
|
if (!bistate)
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
return ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
|
|
|
|
mode, NULL);
|
2008-11-06 21:51:15 +01:00
|
|
|
|
|
|
|
/* If we have the desired block already pinned, re-pin and return it */
|
|
|
|
if (bistate->current_buf != InvalidBuffer)
|
|
|
|
{
|
|
|
|
if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
|
|
|
|
{
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
/*
|
|
|
|
* Currently the LOCK variants are only used for extending
|
|
|
|
* relation, which should never reach this branch.
|
|
|
|
*/
|
|
|
|
Assert(mode != RBM_ZERO_AND_LOCK &&
|
|
|
|
mode != RBM_ZERO_AND_CLEANUP_LOCK);
|
|
|
|
|
2008-11-06 21:51:15 +01:00
|
|
|
IncrBufferRefCount(bistate->current_buf);
|
|
|
|
return bistate->current_buf;
|
|
|
|
}
|
|
|
|
/* ... else drop the old buffer */
|
|
|
|
ReleaseBuffer(bistate->current_buf);
|
|
|
|
bistate->current_buf = InvalidBuffer;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Perform a read using the buffer strategy */
|
|
|
|
buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
mode, bistate->strategy);
|
2008-11-06 21:51:15 +01:00
|
|
|
|
|
|
|
/* Save the selected block as target for future inserts */
|
|
|
|
IncrBufferRefCount(buffer);
|
|
|
|
bistate->current_buf = buffer;
|
|
|
|
|
|
|
|
return buffer;
|
|
|
|
}
|
|
|
|
|
2011-06-27 19:55:55 +02:00
|
|
|
/*
|
|
|
|
* For each heap page which is all-visible, acquire a pin on the appropriate
|
|
|
|
* visibility map page, if we haven't already got one.
|
|
|
|
*
|
2023-04-06 19:27:30 +02:00
|
|
|
* To avoid complexity in the callers, either buffer1 or buffer2 may be
|
|
|
|
* InvalidBuffer if only one buffer is involved. For the same reason, block2
|
|
|
|
* may be smaller than block1.
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
*
|
|
|
|
* Returns whether buffer locks were temporarily released.
|
2011-06-27 19:55:55 +02:00
|
|
|
*/
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
static bool
|
2011-06-27 19:55:55 +02:00
|
|
|
GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
|
|
|
|
BlockNumber block1, BlockNumber block2,
|
|
|
|
Buffer *vmbuffer1, Buffer *vmbuffer2)
|
|
|
|
{
|
|
|
|
bool need_to_pin_buffer1;
|
|
|
|
bool need_to_pin_buffer2;
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
bool released_locks = false;
|
2011-06-27 19:55:55 +02:00
|
|
|
|
2023-04-06 19:27:30 +02:00
|
|
|
/*
|
|
|
|
* Swap buffers around to handle case of a single block/buffer, and to
|
|
|
|
* handle if lock ordering rules require to lock block2 first.
|
|
|
|
*/
|
|
|
|
if (!BufferIsValid(buffer1) ||
|
|
|
|
(BufferIsValid(buffer2) && block1 > block2))
|
|
|
|
{
|
|
|
|
Buffer tmpbuf = buffer1;
|
|
|
|
Buffer *tmpvmbuf = vmbuffer1;
|
|
|
|
BlockNumber tmpblock = block1;
|
|
|
|
|
|
|
|
buffer1 = buffer2;
|
|
|
|
vmbuffer1 = vmbuffer2;
|
|
|
|
block1 = block2;
|
|
|
|
|
|
|
|
buffer2 = tmpbuf;
|
|
|
|
vmbuffer2 = tmpvmbuf;
|
|
|
|
block2 = tmpblock;
|
|
|
|
}
|
|
|
|
|
2011-06-27 19:55:55 +02:00
|
|
|
Assert(BufferIsValid(buffer1));
|
2019-02-02 11:17:00 +01:00
|
|
|
Assert(buffer2 == InvalidBuffer || block1 <= block2);
|
2011-06-27 19:55:55 +02:00
|
|
|
|
|
|
|
while (1)
|
|
|
|
{
|
|
|
|
/* Figure out which pins we need but don't have. */
|
2016-04-20 15:31:19 +02:00
|
|
|
need_to_pin_buffer1 = PageIsAllVisible(BufferGetPage(buffer1))
|
2011-06-27 19:55:55 +02:00
|
|
|
&& !visibilitymap_pin_ok(block1, *vmbuffer1);
|
|
|
|
need_to_pin_buffer2 = buffer2 != InvalidBuffer
|
2016-04-20 15:31:19 +02:00
|
|
|
&& PageIsAllVisible(BufferGetPage(buffer2))
|
2011-06-27 19:55:55 +02:00
|
|
|
&& !visibilitymap_pin_ok(block2, *vmbuffer2);
|
|
|
|
if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
break;
|
2011-06-27 19:55:55 +02:00
|
|
|
|
|
|
|
/* We must unlock both buffers before doing any I/O. */
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
released_locks = true;
|
2011-06-27 19:55:55 +02:00
|
|
|
LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
|
|
|
|
if (buffer2 != InvalidBuffer && buffer2 != buffer1)
|
|
|
|
LockBuffer(buffer2, BUFFER_LOCK_UNLOCK);
|
|
|
|
|
|
|
|
/* Get pins. */
|
|
|
|
if (need_to_pin_buffer1)
|
|
|
|
visibilitymap_pin(relation, block1, vmbuffer1);
|
|
|
|
if (need_to_pin_buffer2)
|
|
|
|
visibilitymap_pin(relation, block2, vmbuffer2);
|
|
|
|
|
|
|
|
/* Relock buffers. */
|
|
|
|
LockBuffer(buffer1, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
if (buffer2 != InvalidBuffer && buffer2 != buffer1)
|
|
|
|
LockBuffer(buffer2, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there are two buffers involved and we pinned just one of them,
|
|
|
|
* it's possible that the second one became all-visible while we were
|
|
|
|
* busy pinning the first one. If it looks like that's a possible
|
|
|
|
* scenario, we'll need to make a second pass through this loop.
|
|
|
|
*/
|
|
|
|
if (buffer2 == InvalidBuffer || buffer1 == buffer2
|
|
|
|
|| (need_to_pin_buffer1 && need_to_pin_buffer2))
|
|
|
|
break;
|
|
|
|
}
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
|
|
|
|
return released_locks;
|
2011-06-27 19:55:55 +02:00
|
|
|
}
|
|
|
|
|
2016-04-08 08:04:46 +02:00
|
|
|
/*
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
* Extend the relation. By multiple pages, if beneficial.
|
|
|
|
*
|
|
|
|
* If the caller needs multiple pages (num_pages > 1), we always try to extend
|
|
|
|
* by at least that much.
|
|
|
|
*
|
|
|
|
* If there is contention on the extension lock, we don't just extend "for
|
|
|
|
* ourselves", but we try to help others. We can do so by adding empty pages
|
|
|
|
* into the FSM. Typically there is no contention when we can't use the FSM.
|
|
|
|
*
|
|
|
|
* We do have to limit the number of pages to extend by to some value, as the
|
|
|
|
* buffers for all the extended pages need to, temporarily, be pinned. For now
|
|
|
|
* we define MAX_BUFFERS_TO_EXTEND_BY to be 64 buffers, it's hard to see
|
|
|
|
* benefits with higher numbers. This partially is because copyfrom.c's
|
|
|
|
* MAX_BUFFERED_TUPLES / MAX_BUFFERED_BYTES prevents larger multi_inserts.
|
|
|
|
*
|
|
|
|
* Returns a buffer for a newly extended block. If possible, the buffer is
|
|
|
|
* returned exclusively locked. *did_unlock is set to true if the lock had to
|
|
|
|
* be released, false otherwise.
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* XXX: It would likely be beneficial for some workloads to extend more
|
|
|
|
* aggressively, e.g. using a heuristic based on the relation size.
|
2016-04-08 08:04:46 +02:00
|
|
|
*/
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
static Buffer
|
|
|
|
RelationAddBlocks(Relation relation, BulkInsertState bistate,
|
|
|
|
int num_pages, bool use_fsm, bool *did_unlock)
|
2016-04-08 08:04:46 +02:00
|
|
|
{
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
#define MAX_BUFFERS_TO_EXTEND_BY 64
|
|
|
|
Buffer victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
|
|
|
|
BlockNumber first_block = InvalidBlockNumber;
|
|
|
|
BlockNumber last_block = InvalidBlockNumber;
|
|
|
|
uint32 extend_by_pages;
|
|
|
|
uint32 not_in_fsm_pages;
|
|
|
|
Buffer buffer;
|
|
|
|
Page page;
|
2016-04-08 08:04:46 +02:00
|
|
|
|
|
|
|
/*
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
* Determine by how many pages to try to extend by.
|
2016-04-08 08:04:46 +02:00
|
|
|
*/
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
if (bistate == NULL && !use_fsm)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we have neither bistate, nor can use the FSM, we can't bulk
|
|
|
|
* extend - there'd be no way to find the additional pages.
|
|
|
|
*/
|
|
|
|
extend_by_pages = 1;
|
|
|
|
}
|
|
|
|
else
|
2016-04-08 08:04:46 +02:00
|
|
|
{
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
uint32 waitcount;
|
2018-03-29 18:22:37 +02:00
|
|
|
|
|
|
|
/*
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
* Try to extend at least by the number of pages the caller needs. We
|
|
|
|
* can remember the additional pages (either via FSM or bistate).
|
2018-03-29 18:22:37 +02:00
|
|
|
*/
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
extend_by_pages = num_pages;
|
2018-03-29 18:22:37 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
if (!RELATION_IS_LOCAL(relation))
|
|
|
|
waitcount = RelationExtensionLockWaiterCount(relation);
|
|
|
|
else
|
|
|
|
waitcount = 0;
|
2018-03-29 18:22:37 +02:00
|
|
|
|
|
|
|
/*
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
* Multiply the number of pages to extend by the number of waiters. Do
|
|
|
|
* this even if we're not using the FSM, as it still relieves
|
|
|
|
* contention, by deferring the next time this backend needs to
|
|
|
|
* extend. In that case the extended pages will be found via
|
|
|
|
* bistate->next_free.
|
2018-03-29 18:22:37 +02:00
|
|
|
*/
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
extend_by_pages += extend_by_pages * waitcount;
|
2019-01-29 02:16:56 +01:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/*
|
|
|
|
* Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
|
|
|
|
* them all concurrently.
|
|
|
|
*/
|
|
|
|
extend_by_pages = Min(extend_by_pages, MAX_BUFFERS_TO_EXTEND_BY);
|
|
|
|
}
|
2018-03-29 18:22:37 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/*
|
|
|
|
* How many of the extended pages should be entered into the FSM?
|
|
|
|
*
|
|
|
|
* If we have a bistate, only enter pages that we don't need ourselves
|
|
|
|
* into the FSM. Otherwise every other backend will immediately try to
|
|
|
|
* use the pages this backend needs for itself, causing unnecessary
|
|
|
|
* contention. If we don't have a bistate, we can't avoid the FSM.
|
|
|
|
*
|
|
|
|
* Never enter the page returned into the FSM, we'll immediately use it.
|
|
|
|
*/
|
|
|
|
if (num_pages > 1 && bistate == NULL)
|
|
|
|
not_in_fsm_pages = 1;
|
|
|
|
else
|
|
|
|
not_in_fsm_pages = num_pages;
|
2016-04-08 08:04:46 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/* prepare to put another buffer into the bistate */
|
|
|
|
if (bistate && bistate->current_buf != InvalidBuffer)
|
|
|
|
{
|
|
|
|
ReleaseBuffer(bistate->current_buf);
|
|
|
|
bistate->current_buf = InvalidBuffer;
|
|
|
|
}
|
2016-04-08 08:04:46 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/*
|
|
|
|
* Extend the relation. We ask for the first returned page to be locked,
|
|
|
|
* so that we are sure that nobody has inserted into the page
|
|
|
|
* concurrently.
|
|
|
|
*
|
|
|
|
* With the current MAX_BUFFERS_TO_EXTEND_BY there's no danger of
|
|
|
|
* [auto]vacuum trying to truncate later pages as REL_TRUNCATE_MINIMUM is
|
|
|
|
* way larger.
|
|
|
|
*/
|
|
|
|
first_block = ExtendBufferedRelBy(EB_REL(relation), MAIN_FORKNUM,
|
|
|
|
bistate ? bistate->strategy : NULL,
|
|
|
|
EB_LOCK_FIRST,
|
|
|
|
extend_by_pages,
|
|
|
|
victim_buffers,
|
|
|
|
&extend_by_pages);
|
|
|
|
buffer = victim_buffers[0]; /* the buffer the function will return */
|
|
|
|
last_block = first_block + (extend_by_pages - 1);
|
|
|
|
Assert(first_block == BufferGetBlockNumber(buffer));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Relation is now extended. Initialize the page. We do this here, before
|
|
|
|
* potentially releasing the lock on the page, because it allows us to
|
|
|
|
* double check that the page contents are empty (this should never
|
|
|
|
* happen, but if it does we don't want to risk wiping out valid data).
|
|
|
|
*/
|
|
|
|
page = BufferGetPage(buffer);
|
|
|
|
if (!PageIsNew(page))
|
|
|
|
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
|
|
|
|
first_block,
|
|
|
|
RelationGetRelationName(relation));
|
|
|
|
|
|
|
|
PageInit(page, BufferGetPageSize(buffer), 0);
|
|
|
|
MarkBufferDirty(buffer);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we decided to put pages into the FSM, release the buffer lock (but
|
|
|
|
* not pin), we don't want to do IO while holding a buffer lock. This will
|
|
|
|
* necessitate a bit more extensive checking in our caller.
|
|
|
|
*/
|
|
|
|
if (use_fsm && not_in_fsm_pages < extend_by_pages)
|
|
|
|
{
|
|
|
|
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
|
|
|
|
*did_unlock = true;
|
2016-04-08 08:04:46 +02:00
|
|
|
}
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
else
|
|
|
|
*did_unlock = false;
|
2016-04-08 08:04:46 +02:00
|
|
|
|
|
|
|
/*
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
* Relation is now extended. Release pins on all buffers, except for the
|
|
|
|
* first (which we'll return). If we decided to put pages into the FSM,
|
|
|
|
* we can do that as part of the same loop.
|
2016-04-08 08:04:46 +02:00
|
|
|
*/
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
for (uint32 i = 1; i < extend_by_pages; i++)
|
|
|
|
{
|
|
|
|
BlockNumber curBlock = first_block + i;
|
|
|
|
|
|
|
|
Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
|
|
|
|
Assert(BlockNumberIsValid(curBlock));
|
|
|
|
|
|
|
|
ReleaseBuffer(victim_buffers[i]);
|
|
|
|
|
|
|
|
if (use_fsm && i >= not_in_fsm_pages)
|
|
|
|
{
|
|
|
|
Size freespace = BufferGetPageSize(victim_buffers[i]) -
|
|
|
|
SizeOfPageHeaderData;
|
|
|
|
|
|
|
|
RecordPageWithFreeSpace(relation, curBlock, freespace);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (use_fsm && not_in_fsm_pages < extend_by_pages)
|
|
|
|
{
|
|
|
|
BlockNumber first_fsm_block = first_block + not_in_fsm_pages;
|
|
|
|
|
|
|
|
FreeSpaceMapVacuumRange(relation, first_fsm_block, last_block);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bistate)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Remember the additional pages we extended by, so we later can use
|
|
|
|
* them without looking into the FSM.
|
|
|
|
*/
|
|
|
|
if (extend_by_pages > 1)
|
|
|
|
{
|
|
|
|
bistate->next_free = first_block + 1;
|
|
|
|
bistate->last_free = last_block;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bistate->next_free = InvalidBlockNumber;
|
|
|
|
bistate->last_free = InvalidBlockNumber;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* maintain bistate->current_buf */
|
|
|
|
IncrBufferRefCount(buffer);
|
|
|
|
bistate->current_buf = buffer;
|
|
|
|
}
|
|
|
|
|
|
|
|
return buffer;
|
|
|
|
#undef MAX_BUFFERS_TO_EXTEND_BY
|
2016-04-08 08:04:46 +02:00
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2000-07-03 04:54:21 +02:00
|
|
|
* RelationGetBufferForTuple
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2001-06-29 23:08:25 +02:00
|
|
|
* Returns pinned and exclusive-locked buffer of a page in given relation
|
|
|
|
* with free space >= given len.
|
|
|
|
*
|
2023-04-06 23:18:24 +02:00
|
|
|
* If num_pages is > 1, we will try to extend the relation by at least that
|
|
|
|
* many pages when we decide to extend the relation. This is more efficient
|
|
|
|
* for callers that know they will need multiple pages
|
|
|
|
* (e.g. heap_multi_insert()).
|
|
|
|
*
|
2001-06-29 23:08:25 +02:00
|
|
|
* If otherBuffer is not InvalidBuffer, then it references a previously
|
|
|
|
* pinned buffer of another page in the same relation; on return, this
|
|
|
|
* buffer will also be exclusive-locked. (This case is used by heap_update;
|
|
|
|
* the otherBuffer contains the tuple being updated.)
|
2000-09-07 11:58:38 +02:00
|
|
|
*
|
2001-06-29 23:08:25 +02:00
|
|
|
* The reason for passing otherBuffer is that if two backends are doing
|
|
|
|
* concurrent heap_update operations, a deadlock could occur if they try
|
|
|
|
* to lock the same two buffers in opposite orders. To ensure that this
|
|
|
|
* can't happen, we impose the rule that buffers of a relation must be
|
|
|
|
* locked in increasing page number order. This is most conveniently done
|
|
|
|
* by having RelationGetBufferForTuple lock them both, with suitable care
|
|
|
|
* for ordering.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2001-06-29 23:08:25 +02:00
|
|
|
* NOTE: it is unlikely, but not quite impossible, for otherBuffer to be the
|
|
|
|
* same buffer we select for insertion of the new tuple (this could only
|
|
|
|
* happen if space is freed in that page after heap_update finds there's not
|
|
|
|
* enough there). In that case, the page will be pinned and locked only once.
|
|
|
|
*
|
Avoid improbable PANIC during heap_update.
heap_update needs to clear any existing "all visible" flag on
the old tuple's page (and on the new page too, if different).
Per coding rules, to do this it must acquire pin on the appropriate
visibility-map page while not holding exclusive buffer lock;
which creates a race condition since someone else could set the
flag whenever we're not holding the buffer lock. The code is
supposed to handle that by re-checking the flag after acquiring
buffer lock and retrying if it became set. However, one code
path through heap_update itself, as well as one in its subroutine
RelationGetBufferForTuple, failed to do this. The end result,
in the unlikely event that a concurrent VACUUM did set the flag
while we're transiently not holding lock, is a non-recurring
"PANIC: wrong buffer passed to visibilitymap_clear" failure.
This has been seen a few times in the buildfarm since recent VACUUM
changes that added code paths that could set the all-visible flag
while holding only exclusive buffer lock. Previously, the flag
was (usually?) set only after doing LockBufferForCleanup, which
would insist on buffer pin count zero, thus preventing the flag
from becoming set partway through heap_update. However, it's
clear that it's heap_update not VACUUM that's at fault here.
What's less clear is whether there is any hazard from these bugs
in released branches. heap_update is certainly violating API
expectations, but if there is no code path that can set all-visible
without a cleanup lock then it's only a latent bug. That's not
100% certain though, besides which we should worry about extensions
or future back-patch fixes that could introduce such code paths.
I chose to back-patch to v12. Fixing RelationGetBufferForTuple
before that would require also back-patching portions of older
fixes (notably 0d1fe9f74), which is more code churn than seems
prudent to fix a hypothetical issue.
Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 18:17:24 +02:00
|
|
|
* We also handle the possibility that the all-visible flag will need to be
|
|
|
|
* cleared on one or both pages. If so, pin on the associated visibility map
|
|
|
|
* page must be acquired before acquiring buffer lock(s), to avoid possibly
|
|
|
|
* doing I/O while holding buffer locks. The pins are passed back to the
|
|
|
|
* caller using the input-output arguments vmbuffer and vmbuffer_other.
|
|
|
|
* Note that in some cases the caller might have already acquired such pins,
|
|
|
|
* which is indicated by these arguments not being InvalidBuffer on entry.
|
2011-09-27 15:30:23 +02:00
|
|
|
*
|
2008-11-06 21:51:15 +01:00
|
|
|
* We normally use FSM to help us find free space. However,
|
|
|
|
* if HEAP_INSERT_SKIP_FSM is specified, we just append a new empty page to
|
|
|
|
* the end of the relation if the tuple won't fit on the current target page.
|
2005-06-20 20:37:02 +02:00
|
|
|
* This can save some cycles when we know the relation is new and doesn't
|
|
|
|
* contain useful amounts of free space.
|
|
|
|
*
|
2008-11-06 21:51:15 +01:00
|
|
|
* HEAP_INSERT_SKIP_FSM is also useful for non-WAL-logged additions to a
|
2005-06-20 20:37:02 +02:00
|
|
|
* relation, if the caller holds exclusive lock and is careful to invalidate
|
2010-02-09 22:43:30 +01:00
|
|
|
* relation's smgr_targblock before the first insertion --- that ensures that
|
2005-06-20 20:37:02 +02:00
|
|
|
* all insertions will occur into newly added pages and not be intermixed
|
|
|
|
* with tuples from other transactions. That way, a crash can't risk losing
|
|
|
|
* any committed data of other transactions. (See heap_insert's comments
|
|
|
|
* for additional constraints needed for safe usage of this behavior.)
|
|
|
|
*
|
2008-11-06 21:51:15 +01:00
|
|
|
* The caller can also provide a BulkInsertState object to optimize many
|
|
|
|
* insertions into the same relation. This keeps a pin on the current
|
|
|
|
* insertion target page (to save pin/unpin cycles) and also passes a
|
|
|
|
* BULKWRITE buffer selection strategy object to the buffer manager.
|
|
|
|
* Passing NULL for bistate selects the default behavior.
|
|
|
|
*
|
2021-03-31 03:53:44 +02:00
|
|
|
* We don't fill existing pages further than the fillfactor, except for large
|
|
|
|
* tuples in nearly-empty pages. This is OK since this routine is not
|
|
|
|
* consulted when updating a tuple and keeping it on the same page, which is
|
|
|
|
* the scenario fillfactor is meant to reserve space for.
|
2006-07-04 00:45:41 +02:00
|
|
|
*
|
2003-07-21 22:29:40 +02:00
|
|
|
* ereport(ERROR) is allowed here, so this routine *must* be called
|
2001-05-17 00:35:12 +02:00
|
|
|
* before any (unlogged) changes are made in buffer pool.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-07-03 04:54:21 +02:00
|
|
|
Buffer
|
2001-05-17 00:35:12 +02:00
|
|
|
RelationGetBufferForTuple(Relation relation, Size len,
|
2008-11-06 21:51:15 +01:00
|
|
|
Buffer otherBuffer, int options,
|
2011-10-12 22:53:54 +02:00
|
|
|
BulkInsertState bistate,
|
2023-04-06 23:18:24 +02:00
|
|
|
Buffer *vmbuffer, Buffer *vmbuffer_other,
|
|
|
|
int num_pages)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2008-11-06 21:51:15 +01:00
|
|
|
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
|
2001-05-12 21:58:28 +02:00
|
|
|
Buffer buffer = InvalidBuffer;
|
2008-07-13 22:45:47 +02:00
|
|
|
Page page;
|
2021-03-31 03:53:44 +02:00
|
|
|
Size nearlyEmptyFreeSpace,
|
|
|
|
pageFreeSpace = 0,
|
|
|
|
saveFreeSpace = 0,
|
|
|
|
targetFreeSpace = 0;
|
2001-06-29 23:08:25 +02:00
|
|
|
BlockNumber targetBlock,
|
|
|
|
otherBlock;
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
bool unlockedTargetBuffer;
|
|
|
|
bool recheckVmPins;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2000-07-03 04:54:21 +02:00
|
|
|
len = MAXALIGN(len); /* be conservative */
|
1999-11-29 05:34:55 +01:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/* if the caller doesn't know by how many pages to extend, extend by 1 */
|
|
|
|
if (num_pages <= 0)
|
|
|
|
num_pages = 1;
|
|
|
|
|
2008-11-06 21:51:15 +01:00
|
|
|
/* Bulk insert is not supported for updates, only inserts. */
|
|
|
|
Assert(otherBuffer == InvalidBuffer || !bistate);
|
|
|
|
|
1999-11-29 05:34:55 +01:00
|
|
|
/*
|
2000-07-03 04:54:21 +02:00
|
|
|
* If we're gonna fail for oversize tuple, do it right away
|
1999-11-29 05:34:55 +01:00
|
|
|
*/
|
2007-02-05 05:22:18 +01:00
|
|
|
if (len > MaxHeapTupleSize)
|
2003-07-21 22:29:40 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
|
2014-01-23 23:18:23 +01:00
|
|
|
errmsg("row is too big: size %zu, maximum size %zu",
|
|
|
|
len, MaxHeapTupleSize)));
|
1999-11-29 05:34:55 +01:00
|
|
|
|
2006-07-04 00:45:41 +02:00
|
|
|
/* Compute desired extra freespace due to fillfactor option */
|
|
|
|
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
|
|
|
|
HEAP_DEFAULT_FILLFACTOR);
|
|
|
|
|
2021-03-31 03:53:44 +02:00
|
|
|
/*
|
|
|
|
* Since pages without tuples can still have line pointers, we consider
|
|
|
|
* pages "empty" when the unavailable space is slight. This threshold is
|
|
|
|
* somewhat arbitrary, but it should prevent most unnecessary relation
|
|
|
|
* extensions while inserting large tuples into low-fillfactor tables.
|
|
|
|
*/
|
|
|
|
nearlyEmptyFreeSpace = MaxHeapTupleSize -
|
|
|
|
(MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
|
|
|
|
if (len + saveFreeSpace > nearlyEmptyFreeSpace)
|
|
|
|
targetFreeSpace = Max(len, nearlyEmptyFreeSpace);
|
|
|
|
else
|
|
|
|
targetFreeSpace = len + saveFreeSpace;
|
|
|
|
|
2001-06-29 23:08:25 +02:00
|
|
|
if (otherBuffer != InvalidBuffer)
|
|
|
|
otherBlock = BufferGetBlockNumber(otherBuffer);
|
|
|
|
else
|
|
|
|
otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2001-06-29 23:08:25 +02:00
|
|
|
* We first try to put the tuple on the same page we last inserted a tuple
|
2008-11-06 21:51:15 +01:00
|
|
|
* on, as cached in the BulkInsertState or relcache entry. If that
|
|
|
|
* doesn't work, we ask the Free Space Map to locate a suitable page.
|
|
|
|
* Since the FSM's info might be out of date, we have to be prepared to
|
|
|
|
* loop around and retry multiple times. (To insure this isn't an infinite
|
|
|
|
* loop, we must update the FSM with the correct amount of free space on
|
|
|
|
* each page that proves not to be suitable.) If the FSM has no record of
|
|
|
|
* a page with enough free space, we give up and extend the relation.
|
2005-06-20 20:37:02 +02:00
|
|
|
*
|
|
|
|
* When use_fsm is false, we either put the tuple onto the existing target
|
|
|
|
* page or extend the relation.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2021-03-31 03:53:44 +02:00
|
|
|
if (bistate && bistate->current_buf != InvalidBuffer)
|
2008-11-06 21:51:15 +01:00
|
|
|
targetBlock = BufferGetBlockNumber(bistate->current_buf);
|
|
|
|
else
|
2010-02-09 22:43:30 +01:00
|
|
|
targetBlock = RelationGetTargetBlock(relation);
|
2001-06-29 23:08:25 +02:00
|
|
|
|
2005-06-20 20:37:02 +02:00
|
|
|
if (targetBlock == InvalidBlockNumber && use_fsm)
|
2001-06-29 23:08:25 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We have no cached target page, so ask the FSM for an initial
|
|
|
|
* target.
|
|
|
|
*/
|
2021-03-31 03:53:44 +02:00
|
|
|
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
|
2021-06-08 19:24:27 +02:00
|
|
|
}
|
2019-05-07 06:00:24 +02:00
|
|
|
|
2021-06-08 19:24:27 +02:00
|
|
|
/*
|
|
|
|
* If the FSM knows nothing of the rel, try the last page before we give
|
|
|
|
* up and extend. This avoids one-tuple-per-page syndrome during
|
|
|
|
* bootstrapping or in a recently-started system.
|
|
|
|
*/
|
|
|
|
if (targetBlock == InvalidBlockNumber)
|
|
|
|
{
|
|
|
|
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
|
2019-05-07 06:00:24 +02:00
|
|
|
|
2021-06-08 19:24:27 +02:00
|
|
|
if (nblocks > 0)
|
|
|
|
targetBlock = nblocks - 1;
|
2001-06-29 23:08:25 +02:00
|
|
|
}
|
|
|
|
|
2016-04-08 08:04:46 +02:00
|
|
|
loop:
|
2001-06-29 23:08:25 +02:00
|
|
|
while (targetBlock != InvalidBlockNumber)
|
2000-09-07 11:58:38 +02:00
|
|
|
{
|
2001-06-29 23:08:25 +02:00
|
|
|
/*
|
|
|
|
* Read and exclusive-lock the target block, as well as the other
|
|
|
|
* block if one was given, taking suitable care with lock ordering and
|
|
|
|
* the possibility they are the same block.
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
*
|
|
|
|
* If the page-level all-visible flag is set, caller will need to
|
|
|
|
* clear both that and the corresponding visibility map bit. However,
|
|
|
|
* by the time we return, we'll have x-locked the buffer, and we don't
|
|
|
|
* want to do any I/O while in that state. So we check the bit here
|
|
|
|
* before taking the lock, and pin the page if it appears necessary.
|
|
|
|
* Checking without the lock creates a risk of getting the wrong
|
|
|
|
* answer, so we'll have to recheck after acquiring the lock.
|
2001-06-29 23:08:25 +02:00
|
|
|
*/
|
|
|
|
if (otherBuffer == InvalidBuffer)
|
|
|
|
{
|
|
|
|
/* easy case */
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
buffer = ReadBufferBI(relation, targetBlock, RBM_NORMAL, bistate);
|
2016-04-20 15:31:19 +02:00
|
|
|
if (PageIsAllVisible(BufferGetPage(buffer)))
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
visibilitymap_pin(relation, targetBlock, vmbuffer);
|
Set PD_ALL_VISIBLE and visibility map bits in COPY FREEZE
Make sure COPY FREEZE marks the pages as PD_ALL_VISIBLE and updates the
visibility map. Until now we only marked individual tuples as frozen,
but page-level flags were not updated, so the first VACUUM after the
COPY FREEZE had to rewrite the whole table.
This is a fairly old patch, and multiple people worked on it. The first
version was written by Jeff Janes, and then reworked by Pavan Deolasee
and Anastasia Lubennikova.
Author: Anastasia Lubennikova, Pavan Deolasee, Jeff Janes
Reviewed-by: Kuntal Ghosh, Jeff Janes, Tomas Vondra, Masahiko Sawada,
Andres Freund, Ibrar Ahmed, Robert Haas, Tatsuro Ishii,
Darafei Praliaskouski
Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
Discussion: https://postgr.es/m/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
2021-01-17 22:11:39 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page is empty, pin vmbuffer to set all_frozen bit later.
|
|
|
|
*/
|
|
|
|
if ((options & HEAP_INSERT_FROZEN) &&
|
|
|
|
(PageGetMaxOffsetNumber(BufferGetPage(buffer)) == 0))
|
|
|
|
visibilitymap_pin(relation, targetBlock, vmbuffer);
|
|
|
|
|
2001-06-29 23:08:25 +02:00
|
|
|
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
}
|
|
|
|
else if (otherBlock == targetBlock)
|
|
|
|
{
|
|
|
|
/* also easy case */
|
|
|
|
buffer = otherBuffer;
|
2016-04-20 15:31:19 +02:00
|
|
|
if (PageIsAllVisible(BufferGetPage(buffer)))
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
visibilitymap_pin(relation, targetBlock, vmbuffer);
|
2001-06-29 23:08:25 +02:00
|
|
|
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
}
|
|
|
|
else if (otherBlock < targetBlock)
|
2001-05-17 00:35:12 +02:00
|
|
|
{
|
2001-06-29 23:08:25 +02:00
|
|
|
/* lock other buffer first */
|
|
|
|
buffer = ReadBuffer(relation, targetBlock);
|
2016-04-20 15:31:19 +02:00
|
|
|
if (PageIsAllVisible(BufferGetPage(buffer)))
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
visibilitymap_pin(relation, targetBlock, vmbuffer);
|
2001-06-29 23:08:25 +02:00
|
|
|
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
|
2001-05-17 00:35:12 +02:00
|
|
|
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
}
|
2001-06-29 23:08:25 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/* lock target buffer first */
|
|
|
|
buffer = ReadBuffer(relation, targetBlock);
|
2016-04-20 15:31:19 +02:00
|
|
|
if (PageIsAllVisible(BufferGetPage(buffer)))
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
visibilitymap_pin(relation, targetBlock, vmbuffer);
|
2001-06-29 23:08:25 +02:00
|
|
|
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
/*
|
2011-06-27 19:55:55 +02:00
|
|
|
* We now have the target page (and the other buffer, if any) pinned
|
|
|
|
* and locked. However, since our initial PageIsAllVisible checks
|
|
|
|
* were performed before acquiring the lock, the results might now be
|
|
|
|
* out of date, either for the selected victim buffer, or for the
|
|
|
|
* other buffer passed by the caller. In that case, we'll need to
|
|
|
|
* give up our locks, go get the pin(s) we failed to get earlier, and
|
|
|
|
* re-lock. That's pretty painful, but hopefully shouldn't happen
|
|
|
|
* often.
|
|
|
|
*
|
|
|
|
* Note that there's a small possibility that we didn't pin the page
|
|
|
|
* above but still have the correct page pinned anyway, either because
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
* we've already made a previous pass through this loop, or because
|
|
|
|
* caller passed us the right page anyway.
|
|
|
|
*
|
|
|
|
* Note also that it's possible that by the time we get the pin and
|
|
|
|
* retake the buffer locks, the visibility map bit will have been
|
|
|
|
* cleared by some other backend anyway. In that case, we'll have
|
|
|
|
* done a bit of extra work for no gain, but there's no real harm
|
|
|
|
* done.
|
|
|
|
*/
|
2023-04-06 19:27:30 +02:00
|
|
|
GetVisibilityMapPins(relation, buffer, otherBuffer,
|
|
|
|
targetBlock, otherBlock, vmbuffer,
|
|
|
|
vmbuffer_other);
|
Make the visibility map crash-safe.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
2011-06-22 05:04:40 +02:00
|
|
|
|
2001-06-29 23:08:25 +02:00
|
|
|
/*
|
|
|
|
* Now we can check to see if there's enough free space here. If so,
|
|
|
|
* we're done.
|
|
|
|
*/
|
2016-04-20 15:31:19 +02:00
|
|
|
page = BufferGetPage(buffer);
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If necessary initialize page, it'll be used soon. We could avoid
|
|
|
|
* dirtying the buffer here, and rely on the caller to do so whenever
|
|
|
|
* it puts a tuple onto the page, but there seems not much benefit in
|
|
|
|
* doing so.
|
|
|
|
*/
|
|
|
|
if (PageIsNew(page))
|
|
|
|
{
|
|
|
|
PageInit(page, BufferGetPageSize(buffer), 0);
|
|
|
|
MarkBufferDirty(buffer);
|
|
|
|
}
|
|
|
|
|
2008-07-13 22:45:47 +02:00
|
|
|
pageFreeSpace = PageGetHeapFreeSpace(page);
|
2021-03-31 03:53:44 +02:00
|
|
|
if (targetFreeSpace <= pageFreeSpace)
|
2001-06-29 23:08:25 +02:00
|
|
|
{
|
|
|
|
/* use this page as future insert target, too */
|
2010-02-09 22:43:30 +01:00
|
|
|
RelationSetTargetBlock(relation, targetBlock);
|
2001-06-29 23:08:25 +02:00
|
|
|
return buffer;
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2001-06-29 23:08:25 +02:00
|
|
|
/*
|
|
|
|
* Not enough space, so we must give up our page locks and pin (if
|
|
|
|
* any) and prepare to look elsewhere. We don't care which order we
|
|
|
|
* unlock the two buffers in, so this can be slightly simpler than the
|
|
|
|
* code above.
|
|
|
|
*/
|
|
|
|
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
|
|
|
|
if (otherBuffer == InvalidBuffer)
|
|
|
|
ReleaseBuffer(buffer);
|
|
|
|
else if (otherBlock != targetBlock)
|
|
|
|
{
|
|
|
|
LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
|
|
|
|
ReleaseBuffer(buffer);
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/* Is there an ongoing bulk extension? */
|
|
|
|
if (bistate && bistate->next_free != InvalidBlockNumber)
|
2016-04-08 08:04:46 +02:00
|
|
|
{
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
Assert(bistate->next_free <= bistate->last_free);
|
2016-04-08 08:04:46 +02:00
|
|
|
|
|
|
|
/*
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
* We bulk extended the relation before, and there are still some
|
|
|
|
* unused pages from that extension, so we don't need to look in
|
|
|
|
* the FSM for a new page. But do record the free space from the
|
|
|
|
* last page, somebody might insert narrower tuples later.
|
2016-04-08 08:04:46 +02:00
|
|
|
*/
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
if (use_fsm)
|
|
|
|
RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
|
2016-04-08 08:04:46 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
targetBlock = bistate->next_free;
|
|
|
|
if (bistate->next_free >= bistate->last_free)
|
2016-04-08 08:04:46 +02:00
|
|
|
{
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
bistate->next_free = InvalidBlockNumber;
|
|
|
|
bistate->last_free = InvalidBlockNumber;
|
2016-04-08 08:04:46 +02:00
|
|
|
}
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
else
|
|
|
|
bistate->next_free++;
|
|
|
|
}
|
|
|
|
else if (!use_fsm)
|
|
|
|
{
|
|
|
|
/* Without FSM, always fall out of the loop and extend */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Update FSM as to condition of this page, and ask for another
|
|
|
|
* page to try.
|
|
|
|
*/
|
|
|
|
targetBlock = RecordAndGetPageWithFreeSpace(relation,
|
|
|
|
targetBlock,
|
|
|
|
pageFreeSpace,
|
|
|
|
targetFreeSpace);
|
2016-04-08 08:04:46 +02:00
|
|
|
}
|
|
|
|
}
|
2001-05-12 21:58:28 +02:00
|
|
|
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
/* Have to extend the relation */
|
|
|
|
buffer = RelationAddBlocks(relation, bistate, num_pages, use_fsm,
|
|
|
|
&unlockedTargetBuffer);
|
2023-04-02 02:50:18 +02:00
|
|
|
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
targetBlock = BufferGetBlockNumber(buffer);
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
page = BufferGetPage(buffer);
|
2001-05-12 21:58:28 +02:00
|
|
|
|
Set PD_ALL_VISIBLE and visibility map bits in COPY FREEZE
Make sure COPY FREEZE marks the pages as PD_ALL_VISIBLE and updates the
visibility map. Until now we only marked individual tuples as frozen,
but page-level flags were not updated, so the first VACUUM after the
COPY FREEZE had to rewrite the whole table.
This is a fairly old patch, and multiple people worked on it. The first
version was written by Jeff Janes, and then reworked by Pavan Deolasee
and Anastasia Lubennikova.
Author: Anastasia Lubennikova, Pavan Deolasee, Jeff Janes
Reviewed-by: Kuntal Ghosh, Jeff Janes, Tomas Vondra, Masahiko Sawada,
Andres Freund, Ibrar Ahmed, Robert Haas, Tatsuro Ishii,
Darafei Praliaskouski
Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
Discussion: https://postgr.es/m/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
2021-01-17 22:11:39 +01:00
|
|
|
/*
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
* The page is empty, pin vmbuffer to set all_frozen bit. We don't want to
|
|
|
|
* do IO while the buffer is locked, so we unlock the page first if IO is
|
|
|
|
* needed (necessitating checks below).
|
Set PD_ALL_VISIBLE and visibility map bits in COPY FREEZE
Make sure COPY FREEZE marks the pages as PD_ALL_VISIBLE and updates the
visibility map. Until now we only marked individual tuples as frozen,
but page-level flags were not updated, so the first VACUUM after the
COPY FREEZE had to rewrite the whole table.
This is a fairly old patch, and multiple people worked on it. The first
version was written by Jeff Janes, and then reworked by Pavan Deolasee
and Anastasia Lubennikova.
Author: Anastasia Lubennikova, Pavan Deolasee, Jeff Janes
Reviewed-by: Kuntal Ghosh, Jeff Janes, Tomas Vondra, Masahiko Sawada,
Andres Freund, Ibrar Ahmed, Robert Haas, Tatsuro Ishii,
Darafei Praliaskouski
Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
Discussion: https://postgr.es/m/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
2021-01-17 22:11:39 +01:00
|
|
|
*/
|
|
|
|
if (options & HEAP_INSERT_FROZEN)
|
|
|
|
{
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
Assert(PageGetMaxOffsetNumber(page) == 0);
|
|
|
|
|
|
|
|
if (!visibilitymap_pin_ok(targetBlock, *vmbuffer))
|
|
|
|
{
|
hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
|
|
|
if (!unlockedTargetBuffer)
|
|
|
|
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
unlockedTargetBuffer = true;
|
|
|
|
visibilitymap_pin(relation, targetBlock, vmbuffer);
|
|
|
|
}
|
Set PD_ALL_VISIBLE and visibility map bits in COPY FREEZE
Make sure COPY FREEZE marks the pages as PD_ALL_VISIBLE and updates the
visibility map. Until now we only marked individual tuples as frozen,
but page-level flags were not updated, so the first VACUUM after the
COPY FREEZE had to rewrite the whole table.
This is a fairly old patch, and multiple people worked on it. The first
version was written by Jeff Janes, and then reworked by Pavan Deolasee
and Anastasia Lubennikova.
Author: Anastasia Lubennikova, Pavan Deolasee, Jeff Janes
Reviewed-by: Kuntal Ghosh, Jeff Janes, Tomas Vondra, Masahiko Sawada,
Andres Freund, Ibrar Ahmed, Robert Haas, Tatsuro Ishii,
Darafei Praliaskouski
Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
Discussion: https://postgr.es/m/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
2021-01-17 22:11:39 +01:00
|
|
|
}
|
|
|
|
|
2005-05-07 23:32:24 +02:00
|
|
|
/*
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
* Reacquire locks if necessary.
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
*
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
* If the target buffer was unlocked above, or is unlocked while
|
|
|
|
* reacquiring the lock on otherBuffer below, it's unlikely, but possible,
|
|
|
|
* that another backend used space on this page. We check for that below,
|
|
|
|
* and retry if necessary.
|
2005-05-07 23:32:24 +02:00
|
|
|
*/
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
recheckVmPins = false;
|
|
|
|
if (unlockedTargetBuffer)
|
|
|
|
{
|
|
|
|
/* released lock on target buffer above */
|
|
|
|
if (otherBuffer != InvalidBuffer)
|
|
|
|
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
recheckVmPins = true;
|
|
|
|
}
|
|
|
|
else if (otherBuffer != InvalidBuffer)
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
{
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
/*
|
|
|
|
* We did not release the target buffer, and otherBuffer is valid,
|
|
|
|
* need to lock the other buffer. It's guaranteed to be of a lower
|
|
|
|
* page number than the new page. To conform with the deadlock
|
|
|
|
* prevent rules, we ought to lock otherBuffer first, but that would
|
|
|
|
* give other backends a chance to put tuples on our page. To reduce
|
|
|
|
* the likelihood of that, attempt to lock the other buffer
|
|
|
|
* conditionally, that's very likely to work.
|
|
|
|
*
|
|
|
|
* Alternatively, we could acquire the lock on otherBuffer before
|
|
|
|
* extending the relation, but that'd require holding the lock while
|
|
|
|
* performing IO, which seems worse than an unlikely retry.
|
|
|
|
*/
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
Assert(otherBuffer != buffer);
|
Avoid improbable PANIC during heap_update.
heap_update needs to clear any existing "all visible" flag on
the old tuple's page (and on the new page too, if different).
Per coding rules, to do this it must acquire pin on the appropriate
visibility-map page while not holding exclusive buffer lock;
which creates a race condition since someone else could set the
flag whenever we're not holding the buffer lock. The code is
supposed to handle that by re-checking the flag after acquiring
buffer lock and retrying if it became set. However, one code
path through heap_update itself, as well as one in its subroutine
RelationGetBufferForTuple, failed to do this. The end result,
in the unlikely event that a concurrent VACUUM did set the flag
while we're transiently not holding lock, is a non-recurring
"PANIC: wrong buffer passed to visibilitymap_clear" failure.
This has been seen a few times in the buildfarm since recent VACUUM
changes that added code paths that could set the all-visible flag
while holding only exclusive buffer lock. Previously, the flag
was (usually?) set only after doing LockBufferForCleanup, which
would insist on buffer pin count zero, thus preventing the flag
from becoming set partway through heap_update. However, it's
clear that it's heap_update not VACUUM that's at fault here.
What's less clear is whether there is any hazard from these bugs
in released branches. heap_update is certainly violating API
expectations, but if there is no code path that can set all-visible
without a cleanup lock then it's only a latent bug. That's not
100% certain though, besides which we should worry about extensions
or future back-patch fixes that could introduce such code paths.
I chose to back-patch to v12. Fixing RelationGetBufferForTuple
before that would require also back-patching portions of older
fixes (notably 0d1fe9f74), which is more code churn than seems
prudent to fix a hypothetical issue.
Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 18:17:24 +02:00
|
|
|
Assert(targetBlock > otherBlock);
|
2006-01-06 01:15:50 +01:00
|
|
|
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
if (unlikely(!ConditionalLockBuffer(otherBuffer)))
|
|
|
|
{
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
unlockedTargetBuffer = true;
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
|
|
|
|
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
|
|
|
|
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
|
2022-10-01 01:36:46 +02:00
|
|
|
}
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
recheckVmPins = true;
|
|
|
|
}
|
2006-01-06 01:15:50 +01:00
|
|
|
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
/*
|
|
|
|
* If one of the buffers was unlocked (always the case if otherBuffer is
|
|
|
|
* valid), it's possible, although unlikely, that an all-visible flag
|
|
|
|
* became set. We can use GetVisibilityMapPins to deal with that. It's
|
|
|
|
* possible that GetVisibilityMapPins() might need to temporarily release
|
|
|
|
* buffer locks, in which case we'll need to check if there's still enough
|
|
|
|
* space on the page below.
|
|
|
|
*/
|
|
|
|
if (recheckVmPins)
|
|
|
|
{
|
|
|
|
if (GetVisibilityMapPins(relation, otherBuffer, buffer,
|
|
|
|
otherBlock, targetBlock, vmbuffer_other,
|
|
|
|
vmbuffer))
|
|
|
|
unlockedTargetBuffer = true;
|
|
|
|
}
|
Avoid improbable PANIC during heap_update.
heap_update needs to clear any existing "all visible" flag on
the old tuple's page (and on the new page too, if different).
Per coding rules, to do this it must acquire pin on the appropriate
visibility-map page while not holding exclusive buffer lock;
which creates a race condition since someone else could set the
flag whenever we're not holding the buffer lock. The code is
supposed to handle that by re-checking the flag after acquiring
buffer lock and retrying if it became set. However, one code
path through heap_update itself, as well as one in its subroutine
RelationGetBufferForTuple, failed to do this. The end result,
in the unlikely event that a concurrent VACUUM did set the flag
while we're transiently not holding lock, is a non-recurring
"PANIC: wrong buffer passed to visibilitymap_clear" failure.
This has been seen a few times in the buildfarm since recent VACUUM
changes that added code paths that could set the all-visible flag
while holding only exclusive buffer lock. Previously, the flag
was (usually?) set only after doing LockBufferForCleanup, which
would insist on buffer pin count zero, thus preventing the flag
from becoming set partway through heap_update. However, it's
clear that it's heap_update not VACUUM that's at fault here.
What's less clear is whether there is any hazard from these bugs
in released branches. heap_update is certainly violating API
expectations, but if there is no code path that can set all-visible
without a cleanup lock then it's only a latent bug. That's not
100% certain though, besides which we should worry about extensions
or future back-patch fixes that could introduce such code paths.
I chose to back-patch to v12. Fixing RelationGetBufferForTuple
before that would require also back-patching portions of older
fixes (notably 0d1fe9f74), which is more code churn than seems
prudent to fix a hypothetical issue.
Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 18:17:24 +02:00
|
|
|
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
/*
|
|
|
|
* If the target buffer was temporarily unlocked since the relation
|
|
|
|
* extension, it's possible, although unlikely, that all the space on the
|
|
|
|
* page was already used. If so, we just retry from the start. If we
|
|
|
|
* didn't unlock, something has gone wrong if there's not enough space -
|
|
|
|
* the test at the top should have prevented reaching this case.
|
|
|
|
*/
|
|
|
|
pageFreeSpace = PageGetHeapFreeSpace(page);
|
|
|
|
if (len > pageFreeSpace)
|
|
|
|
{
|
|
|
|
if (unlockedTargetBuffer)
|
2022-10-01 01:36:46 +02:00
|
|
|
{
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
if (otherBuffer != InvalidBuffer)
|
|
|
|
LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
|
2022-10-01 01:36:46 +02:00
|
|
|
UnlockReleaseBuffer(buffer);
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
|
2022-10-01 01:36:46 +02:00
|
|
|
goto loop;
|
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
|
|
|
}
|
2014-01-23 23:18:23 +01:00
|
|
|
elog(PANIC, "tuple is too big: size %zu", len);
|
2001-05-12 21:58:28 +02:00
|
|
|
}
|
1998-12-15 13:47:01 +01:00
|
|
|
|
2001-06-29 23:08:25 +02:00
|
|
|
/*
|
|
|
|
* Remember the new page as our target for future insertions.
|
|
|
|
*
|
|
|
|
* XXX should we enter the new page into the free space map immediately,
|
|
|
|
* or just keep it for this backend's exclusive use in the short run
|
|
|
|
* (until VACUUM sees it)? Seems to depend on whether you expect the
|
|
|
|
* current backend to make more insertions or not, which is probably a
|
|
|
|
* good bet most of the time. So for now, don't add it to FSM yet.
|
|
|
|
*/
|
hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
|
|
|
RelationSetTargetBlock(relation, targetBlock);
|
2001-06-29 23:08:25 +02:00
|
|
|
|
2001-05-12 21:58:28 +02:00
|
|
|
return buffer;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|