postgresql/src/backend/access/heap/hio.c

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

868 lines
28 KiB
C
Raw Normal View History

/*-------------------------------------------------------------------------
*
* hio.c
* POSTGRES heap access method input/output code.
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*
* IDENTIFICATION
2010-09-20 22:08:53 +02:00
* src/backend/access/heap/hio.c
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include "access/heapam.h"
1999-07-16 07:00:38 +02:00
#include "access/hio.h"
#include "access/htup_details.h"
#include "access/visibilitymap.h"
#include "storage/bufmgr.h"
#include "storage/freespace.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
1996-10-21 07:59:49 +02:00
/*
2000-07-03 04:54:21 +02:00
* RelationPutHeapTuple - place tuple at specified page
*
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!! Must PANIC on failure!!!
*
* Note - caller must hold BUFFER_LOCK_EXCLUSIVE on the buffer.
*/
void
RelationPutHeapTuple(Relation relation,
Buffer buffer,
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE. The newly added ON CONFLICT clause allows to specify an alternative to raising a unique or exclusion constraint violation error when inserting. ON CONFLICT refers to constraints that can either be specified using a inference clause (by specifying the columns of a unique constraint) or by naming a unique or exclusion constraint. DO NOTHING avoids the constraint violation, without touching the pre-existing row. DO UPDATE SET ... [WHERE ...] updates the pre-existing tuple, and has access to both the tuple proposed for insertion and the existing tuple; the optional WHERE clause can be used to prevent an update from being executed. The UPDATE SET and WHERE clauses have access to the tuple proposed for insertion using the "magic" EXCLUDED alias, and to the pre-existing tuple using the table name or its alias. This feature is often referred to as upsert. This is implemented using a new infrastructure called "speculative insertion". It is an optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert. If a violating tuple was inserted concurrently, the speculatively inserted tuple is deleted and a new attempt is made. If the pre-check finds a matching tuple the alternative DO NOTHING or DO UPDATE action is taken. If the insertion succeeds without detecting a conflict, the tuple is deemed inserted. To handle the possible ambiguity between the excluded alias and a table named excluded, and for convenience with long relation names, INSERT INTO now can alias its target table. Bumps catversion as stored rules change. Author: Peter Geoghegan, with significant contributions from Heikki Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes. Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs, Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
HeapTuple tuple,
bool token)
{
Page pageHeader;
OffsetNumber offnum;
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE. The newly added ON CONFLICT clause allows to specify an alternative to raising a unique or exclusion constraint violation error when inserting. ON CONFLICT refers to constraints that can either be specified using a inference clause (by specifying the columns of a unique constraint) or by naming a unique or exclusion constraint. DO NOTHING avoids the constraint violation, without touching the pre-existing row. DO UPDATE SET ... [WHERE ...] updates the pre-existing tuple, and has access to both the tuple proposed for insertion and the existing tuple; the optional WHERE clause can be used to prevent an update from being executed. The UPDATE SET and WHERE clauses have access to the tuple proposed for insertion using the "magic" EXCLUDED alias, and to the pre-existing tuple using the table name or its alias. This feature is often referred to as upsert. This is implemented using a new infrastructure called "speculative insertion". It is an optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert. If a violating tuple was inserted concurrently, the speculatively inserted tuple is deleted and a new attempt is made. If the pre-check finds a matching tuple the alternative DO NOTHING or DO UPDATE action is taken. If the insertion succeeds without detecting a conflict, the tuple is deemed inserted. To handle the possible ambiguity between the excluded alias and a table named excluded, and for convenience with long relation names, INSERT INTO now can alias its target table. Bumps catversion as stored rules change. Author: Peter Geoghegan, with significant contributions from Heikki Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes. Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs, Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
/*
* A tuple that's being inserted speculatively should already have its
* token set.
*/
Assert(!token || HeapTupleHeaderIsSpeculative(tuple->t_data));
/*
* Do not allow tuples with invalid combinations of hint bits to be placed
* on a page. This combination is detected as corruption by the
* contrib/amcheck logic, so if you disable this assertion, make
* corresponding changes there.
*/
Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_COMMITTED) &&
(tuple->t_data->t_infomask & HEAP_XMAX_IS_MULTI)));
/* Add the tuple to the page */
pageHeader = BufferGetPage(buffer);
offnum = PageAddItem(pageHeader, (Item) tuple->t_data,
tuple->t_len, InvalidOffsetNumber, false, true);
2000-07-03 04:54:21 +02:00
if (offnum == InvalidOffsetNumber)
elog(PANIC, "failed to add tuple to page");
2000-07-03 04:54:21 +02:00
/* Update tuple->t_self to the actual position where it was stored */
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE. The newly added ON CONFLICT clause allows to specify an alternative to raising a unique or exclusion constraint violation error when inserting. ON CONFLICT refers to constraints that can either be specified using a inference clause (by specifying the columns of a unique constraint) or by naming a unique or exclusion constraint. DO NOTHING avoids the constraint violation, without touching the pre-existing row. DO UPDATE SET ... [WHERE ...] updates the pre-existing tuple, and has access to both the tuple proposed for insertion and the existing tuple; the optional WHERE clause can be used to prevent an update from being executed. The UPDATE SET and WHERE clauses have access to the tuple proposed for insertion using the "magic" EXCLUDED alias, and to the pre-existing tuple using the table name or its alias. This feature is often referred to as upsert. This is implemented using a new infrastructure called "speculative insertion". It is an optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert. If a violating tuple was inserted concurrently, the speculatively inserted tuple is deleted and a new attempt is made. If the pre-check finds a matching tuple the alternative DO NOTHING or DO UPDATE action is taken. If the insertion succeeds without detecting a conflict, the tuple is deemed inserted. To handle the possible ambiguity between the excluded alias and a table named excluded, and for convenience with long relation names, INSERT INTO now can alias its target table. Bumps catversion as stored rules change. Author: Peter Geoghegan, with significant contributions from Heikki Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes. Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs, Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
/*
* Insert the correct position into CTID of the stored tuple, too (unless
* this is a speculative insertion, in which case the token is held in
* CTID field instead)
*/
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
HeapTupleHeader item = (HeapTupleHeader) PageGetItem(pageHeader, itemId);
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE. The newly added ON CONFLICT clause allows to specify an alternative to raising a unique or exclusion constraint violation error when inserting. ON CONFLICT refers to constraints that can either be specified using a inference clause (by specifying the columns of a unique constraint) or by naming a unique or exclusion constraint. DO NOTHING avoids the constraint violation, without touching the pre-existing row. DO UPDATE SET ... [WHERE ...] updates the pre-existing tuple, and has access to both the tuple proposed for insertion and the existing tuple; the optional WHERE clause can be used to prevent an update from being executed. The UPDATE SET and WHERE clauses have access to the tuple proposed for insertion using the "magic" EXCLUDED alias, and to the pre-existing tuple using the table name or its alias. This feature is often referred to as upsert. This is implemented using a new infrastructure called "speculative insertion". It is an optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert. If a violating tuple was inserted concurrently, the speculatively inserted tuple is deleted and a new attempt is made. If the pre-check finds a matching tuple the alternative DO NOTHING or DO UPDATE action is taken. If the insertion succeeds without detecting a conflict, the tuple is deemed inserted. To handle the possible ambiguity between the excluded alias and a table named excluded, and for convenience with long relation names, INSERT INTO now can alias its target table. Bumps catversion as stored rules change. Author: Peter Geoghegan, with significant contributions from Heikki Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes. Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs, Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
item->t_ctid = tuple->t_self;
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE. The newly added ON CONFLICT clause allows to specify an alternative to raising a unique or exclusion constraint violation error when inserting. ON CONFLICT refers to constraints that can either be specified using a inference clause (by specifying the columns of a unique constraint) or by naming a unique or exclusion constraint. DO NOTHING avoids the constraint violation, without touching the pre-existing row. DO UPDATE SET ... [WHERE ...] updates the pre-existing tuple, and has access to both the tuple proposed for insertion and the existing tuple; the optional WHERE clause can be used to prevent an update from being executed. The UPDATE SET and WHERE clauses have access to the tuple proposed for insertion using the "magic" EXCLUDED alias, and to the pre-existing tuple using the table name or its alias. This feature is often referred to as upsert. This is implemented using a new infrastructure called "speculative insertion". It is an optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert. If a violating tuple was inserted concurrently, the speculatively inserted tuple is deleted and a new attempt is made. If the pre-check finds a matching tuple the alternative DO NOTHING or DO UPDATE action is taken. If the insertion succeeds without detecting a conflict, the tuple is deemed inserted. To handle the possible ambiguity between the excluded alias and a table named excluded, and for convenience with long relation names, INSERT INTO now can alias its target table. Bumps catversion as stored rules change. Author: Peter Geoghegan, with significant contributions from Heikki Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes. Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs, Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
}
}
/*
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
* Read in a buffer in mode, using bulk-insert strategy if bistate isn't NULL.
*/
static Buffer
ReadBufferBI(Relation relation, BlockNumber targetBlock,
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
ReadBufferMode mode, BulkInsertState bistate)
{
Buffer buffer;
/* If not bulk-insert, exactly like ReadBuffer */
if (!bistate)
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
return ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
mode, NULL);
/* If we have the desired block already pinned, re-pin and return it */
if (bistate->current_buf != InvalidBuffer)
{
if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
{
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
/*
* Currently the LOCK variants are only used for extending
* relation, which should never reach this branch.
*/
Assert(mode != RBM_ZERO_AND_LOCK &&
mode != RBM_ZERO_AND_CLEANUP_LOCK);
IncrBufferRefCount(bistate->current_buf);
return bistate->current_buf;
}
/* ... else drop the old buffer */
ReleaseBuffer(bistate->current_buf);
bistate->current_buf = InvalidBuffer;
}
/* Perform a read using the buffer strategy */
buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
mode, bistate->strategy);
/* Save the selected block as target for future inserts */
IncrBufferRefCount(buffer);
bistate->current_buf = buffer;
return buffer;
}
/*
* For each heap page which is all-visible, acquire a pin on the appropriate
* visibility map page, if we haven't already got one.
*
* To avoid complexity in the callers, either buffer1 or buffer2 may be
* InvalidBuffer if only one buffer is involved. For the same reason, block2
* may be smaller than block1.
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
*
* Returns whether buffer locks were temporarily released.
*/
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
static bool
GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
bool need_to_pin_buffer1;
bool need_to_pin_buffer2;
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
bool released_locks = false;
/*
* Swap buffers around to handle case of a single block/buffer, and to
* handle if lock ordering rules require to lock block2 first.
*/
if (!BufferIsValid(buffer1) ||
(BufferIsValid(buffer2) && block1 > block2))
{
Buffer tmpbuf = buffer1;
Buffer *tmpvmbuf = vmbuffer1;
BlockNumber tmpblock = block1;
buffer1 = buffer2;
vmbuffer1 = vmbuffer2;
block1 = block2;
buffer2 = tmpbuf;
vmbuffer2 = tmpvmbuf;
block2 = tmpblock;
}
Assert(BufferIsValid(buffer1));
Assert(buffer2 == InvalidBuffer || block1 <= block2);
while (1)
{
/* Figure out which pins we need but don't have. */
need_to_pin_buffer1 = PageIsAllVisible(BufferGetPage(buffer1))
&& !visibilitymap_pin_ok(block1, *vmbuffer1);
need_to_pin_buffer2 = buffer2 != InvalidBuffer
&& PageIsAllVisible(BufferGetPage(buffer2))
&& !visibilitymap_pin_ok(block2, *vmbuffer2);
if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
break;
/* We must unlock both buffers before doing any I/O. */
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
released_locks = true;
LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
if (buffer2 != InvalidBuffer && buffer2 != buffer1)
LockBuffer(buffer2, BUFFER_LOCK_UNLOCK);
/* Get pins. */
if (need_to_pin_buffer1)
visibilitymap_pin(relation, block1, vmbuffer1);
if (need_to_pin_buffer2)
visibilitymap_pin(relation, block2, vmbuffer2);
/* Relock buffers. */
LockBuffer(buffer1, BUFFER_LOCK_EXCLUSIVE);
if (buffer2 != InvalidBuffer && buffer2 != buffer1)
LockBuffer(buffer2, BUFFER_LOCK_EXCLUSIVE);
/*
* If there are two buffers involved and we pinned just one of them,
* it's possible that the second one became all-visible while we were
* busy pinning the first one. If it looks like that's a possible
* scenario, we'll need to make a second pass through this loop.
*/
if (buffer2 == InvalidBuffer || buffer1 == buffer2
|| (need_to_pin_buffer1 && need_to_pin_buffer2))
break;
}
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
return released_locks;
}
/*
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
* Extend the relation. By multiple pages, if beneficial.
*
* If the caller needs multiple pages (num_pages > 1), we always try to extend
* by at least that much.
*
* If there is contention on the extension lock, we don't just extend "for
* ourselves", but we try to help others. We can do so by adding empty pages
* into the FSM. Typically there is no contention when we can't use the FSM.
*
* We do have to limit the number of pages to extend by to some value, as the
* buffers for all the extended pages need to, temporarily, be pinned. For now
* we define MAX_BUFFERS_TO_EXTEND_BY to be 64 buffers, it's hard to see
* benefits with higher numbers. This partially is because copyfrom.c's
* MAX_BUFFERED_TUPLES / MAX_BUFFERED_BYTES prevents larger multi_inserts.
*
* Returns a buffer for a newly extended block. If possible, the buffer is
* returned exclusively locked. *did_unlock is set to true if the lock had to
* be released, false otherwise.
*
*
* XXX: It would likely be beneficial for some workloads to extend more
* aggressively, e.g. using a heuristic based on the relation size.
*/
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
static Buffer
RelationAddBlocks(Relation relation, BulkInsertState bistate,
int num_pages, bool use_fsm, bool *did_unlock)
{
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
#define MAX_BUFFERS_TO_EXTEND_BY 64
Buffer victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
BlockNumber first_block = InvalidBlockNumber;
BlockNumber last_block = InvalidBlockNumber;
uint32 extend_by_pages;
uint32 not_in_fsm_pages;
Buffer buffer;
Page page;
/*
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
* Determine by how many pages to try to extend by.
*/
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
if (bistate == NULL && !use_fsm)
{
/*
* If we have neither bistate, nor can use the FSM, we can't bulk
* extend - there'd be no way to find the additional pages.
*/
extend_by_pages = 1;
}
else
{
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
uint32 waitcount;
/*
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
* Try to extend at least by the number of pages the caller needs. We
* can remember the additional pages (either via FSM or bistate).
*/
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
extend_by_pages = num_pages;
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
if (!RELATION_IS_LOCAL(relation))
waitcount = RelationExtensionLockWaiterCount(relation);
else
waitcount = 0;
/*
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
* Multiply the number of pages to extend by the number of waiters. Do
* this even if we're not using the FSM, as it still relieves
* contention, by deferring the next time this backend needs to
* extend. In that case the extended pages will be found via
* bistate->next_free.
*/
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
extend_by_pages += extend_by_pages * waitcount;
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/*
* Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
* them all concurrently.
*/
extend_by_pages = Min(extend_by_pages, MAX_BUFFERS_TO_EXTEND_BY);
}
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/*
* How many of the extended pages should be entered into the FSM?
*
* If we have a bistate, only enter pages that we don't need ourselves
* into the FSM. Otherwise every other backend will immediately try to
* use the pages this backend needs for itself, causing unnecessary
* contention. If we don't have a bistate, we can't avoid the FSM.
*
* Never enter the page returned into the FSM, we'll immediately use it.
*/
if (num_pages > 1 && bistate == NULL)
not_in_fsm_pages = 1;
else
not_in_fsm_pages = num_pages;
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/* prepare to put another buffer into the bistate */
if (bistate && bistate->current_buf != InvalidBuffer)
{
ReleaseBuffer(bistate->current_buf);
bistate->current_buf = InvalidBuffer;
}
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/*
* Extend the relation. We ask for the first returned page to be locked,
* so that we are sure that nobody has inserted into the page
* concurrently.
*
* With the current MAX_BUFFERS_TO_EXTEND_BY there's no danger of
* [auto]vacuum trying to truncate later pages as REL_TRUNCATE_MINIMUM is
* way larger.
*/
first_block = ExtendBufferedRelBy(EB_REL(relation), MAIN_FORKNUM,
bistate ? bistate->strategy : NULL,
EB_LOCK_FIRST,
extend_by_pages,
victim_buffers,
&extend_by_pages);
buffer = victim_buffers[0]; /* the buffer the function will return */
last_block = first_block + (extend_by_pages - 1);
Assert(first_block == BufferGetBlockNumber(buffer));
/*
* Relation is now extended. Initialize the page. We do this here, before
* potentially releasing the lock on the page, because it allows us to
* double check that the page contents are empty (this should never
* happen, but if it does we don't want to risk wiping out valid data).
*/
page = BufferGetPage(buffer);
if (!PageIsNew(page))
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
first_block,
RelationGetRelationName(relation));
PageInit(page, BufferGetPageSize(buffer), 0);
MarkBufferDirty(buffer);
/*
* If we decided to put pages into the FSM, release the buffer lock (but
* not pin), we don't want to do IO while holding a buffer lock. This will
* necessitate a bit more extensive checking in our caller.
*/
if (use_fsm && not_in_fsm_pages < extend_by_pages)
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
*did_unlock = true;
}
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
else
*did_unlock = false;
/*
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
* Relation is now extended. Release pins on all buffers, except for the
* first (which we'll return). If we decided to put pages into the FSM,
* we can do that as part of the same loop.
*/
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
for (uint32 i = 1; i < extend_by_pages; i++)
{
BlockNumber curBlock = first_block + i;
Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
Assert(BlockNumberIsValid(curBlock));
ReleaseBuffer(victim_buffers[i]);
if (use_fsm && i >= not_in_fsm_pages)
{
Size freespace = BufferGetPageSize(victim_buffers[i]) -
SizeOfPageHeaderData;
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
RecordPageWithFreeSpace(relation, curBlock, freespace);
}
}
if (use_fsm && not_in_fsm_pages < extend_by_pages)
{
BlockNumber first_fsm_block = first_block + not_in_fsm_pages;
FreeSpaceMapVacuumRange(relation, first_fsm_block, last_block);
}
if (bistate)
{
/*
* Remember the additional pages we extended by, so we later can use
* them without looking into the FSM.
*/
if (extend_by_pages > 1)
{
bistate->next_free = first_block + 1;
bistate->last_free = last_block;
}
else
{
bistate->next_free = InvalidBlockNumber;
bistate->last_free = InvalidBlockNumber;
}
/* maintain bistate->current_buf */
IncrBufferRefCount(buffer);
bistate->current_buf = buffer;
}
return buffer;
#undef MAX_BUFFERS_TO_EXTEND_BY
}
/*
2000-07-03 04:54:21 +02:00
* RelationGetBufferForTuple
*
* Returns pinned and exclusive-locked buffer of a page in given relation
* with free space >= given len.
*
* If num_pages is > 1, we will try to extend the relation by at least that
* many pages when we decide to extend the relation. This is more efficient
* for callers that know they will need multiple pages
* (e.g. heap_multi_insert()).
*
* If otherBuffer is not InvalidBuffer, then it references a previously
* pinned buffer of another page in the same relation; on return, this
* buffer will also be exclusive-locked. (This case is used by heap_update;
* the otherBuffer contains the tuple being updated.)
*
* The reason for passing otherBuffer is that if two backends are doing
* concurrent heap_update operations, a deadlock could occur if they try
* to lock the same two buffers in opposite orders. To ensure that this
* can't happen, we impose the rule that buffers of a relation must be
* locked in increasing page number order. This is most conveniently done
* by having RelationGetBufferForTuple lock them both, with suitable care
* for ordering.
*
* NOTE: it is unlikely, but not quite impossible, for otherBuffer to be the
* same buffer we select for insertion of the new tuple (this could only
* happen if space is freed in that page after heap_update finds there's not
* enough there). In that case, the page will be pinned and locked only once.
*
Avoid improbable PANIC during heap_update. heap_update needs to clear any existing "all visible" flag on the old tuple's page (and on the new page too, if different). Per coding rules, to do this it must acquire pin on the appropriate visibility-map page while not holding exclusive buffer lock; which creates a race condition since someone else could set the flag whenever we're not holding the buffer lock. The code is supposed to handle that by re-checking the flag after acquiring buffer lock and retrying if it became set. However, one code path through heap_update itself, as well as one in its subroutine RelationGetBufferForTuple, failed to do this. The end result, in the unlikely event that a concurrent VACUUM did set the flag while we're transiently not holding lock, is a non-recurring "PANIC: wrong buffer passed to visibilitymap_clear" failure. This has been seen a few times in the buildfarm since recent VACUUM changes that added code paths that could set the all-visible flag while holding only exclusive buffer lock. Previously, the flag was (usually?) set only after doing LockBufferForCleanup, which would insist on buffer pin count zero, thus preventing the flag from becoming set partway through heap_update. However, it's clear that it's heap_update not VACUUM that's at fault here. What's less clear is whether there is any hazard from these bugs in released branches. heap_update is certainly violating API expectations, but if there is no code path that can set all-visible without a cleanup lock then it's only a latent bug. That's not 100% certain though, besides which we should worry about extensions or future back-patch fixes that could introduce such code paths. I chose to back-patch to v12. Fixing RelationGetBufferForTuple before that would require also back-patching portions of older fixes (notably 0d1fe9f74), which is more code churn than seems prudent to fix a hypothetical issue. Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 18:17:24 +02:00
* We also handle the possibility that the all-visible flag will need to be
* cleared on one or both pages. If so, pin on the associated visibility map
* page must be acquired before acquiring buffer lock(s), to avoid possibly
* doing I/O while holding buffer locks. The pins are passed back to the
* caller using the input-output arguments vmbuffer and vmbuffer_other.
* Note that in some cases the caller might have already acquired such pins,
* which is indicated by these arguments not being InvalidBuffer on entry.
*
* We normally use FSM to help us find free space. However,
* if HEAP_INSERT_SKIP_FSM is specified, we just append a new empty page to
* the end of the relation if the tuple won't fit on the current target page.
* This can save some cycles when we know the relation is new and doesn't
* contain useful amounts of free space.
*
* HEAP_INSERT_SKIP_FSM is also useful for non-WAL-logged additions to a
* relation, if the caller holds exclusive lock and is careful to invalidate
* relation's smgr_targblock before the first insertion --- that ensures that
* all insertions will occur into newly added pages and not be intermixed
* with tuples from other transactions. That way, a crash can't risk losing
* any committed data of other transactions. (See heap_insert's comments
* for additional constraints needed for safe usage of this behavior.)
*
* The caller can also provide a BulkInsertState object to optimize many
* insertions into the same relation. This keeps a pin on the current
* insertion target page (to save pin/unpin cycles) and also passes a
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.
*
* We don't fill existing pages further than the fillfactor, except for large
* tuples in nearly-empty pages. This is OK since this routine is not
* consulted when updating a tuple and keeping it on the same page, which is
* the scenario fillfactor is meant to reserve space for.
*
* ereport(ERROR) is allowed here, so this routine *must* be called
* before any (unlogged) changes are made in buffer pool.
*/
2000-07-03 04:54:21 +02:00
Buffer
RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
Buffer *vmbuffer, Buffer *vmbuffer_other,
int num_pages)
{
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
Buffer buffer = InvalidBuffer;
Page page;
Size nearlyEmptyFreeSpace,
pageFreeSpace = 0,
saveFreeSpace = 0,
targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
bool unlockedTargetBuffer;
bool recheckVmPins;
2000-07-03 04:54:21 +02:00
len = MAXALIGN(len); /* be conservative */
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/* if the caller doesn't know by how many pages to extend, extend by 1 */
if (num_pages <= 0)
num_pages = 1;
/* Bulk insert is not supported for updates, only inserts. */
Assert(otherBuffer == InvalidBuffer || !bistate);
/*
2000-07-03 04:54:21 +02:00
* If we're gonna fail for oversize tuple, do it right away
*/
if (len > MaxHeapTupleSize)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("row is too big: size %zu, maximum size %zu",
len, MaxHeapTupleSize)));
/* Compute desired extra freespace due to fillfactor option */
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
/*
* Since pages without tuples can still have line pointers, we consider
* pages "empty" when the unavailable space is slight. This threshold is
* somewhat arbitrary, but it should prevent most unnecessary relation
* extensions while inserting large tuples into low-fillfactor tables.
*/
nearlyEmptyFreeSpace = MaxHeapTupleSize -
(MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
if (len + saveFreeSpace > nearlyEmptyFreeSpace)
targetFreeSpace = Max(len, nearlyEmptyFreeSpace);
else
targetFreeSpace = len + saveFreeSpace;
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */
/*
* We first try to put the tuple on the same page we last inserted a tuple
* on, as cached in the BulkInsertState or relcache entry. If that
* doesn't work, we ask the Free Space Map to locate a suitable page.
* Since the FSM's info might be out of date, we have to be prepared to
* loop around and retry multiple times. (To insure this isn't an infinite
* loop, we must update the FSM with the correct amount of free space on
* each page that proves not to be suitable.) If the FSM has no record of
* a page with enough free space, we give up and extend the relation.
*
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
if (targetBlock == InvalidBlockNumber && use_fsm)
{
/*
* We have no cached target page, so ask the FSM for an initial
* target.
*/
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
* If the FSM knows nothing of the rel, try the last page before we give
* up and extend. This avoids one-tuple-per-page syndrome during
* bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
if (nblocks > 0)
targetBlock = nblocks - 1;
}
loop:
while (targetBlock != InvalidBlockNumber)
{
/*
* Read and exclusive-lock the target block, as well as the other
* block if one was given, taking suitable care with lock ordering and
* the possibility they are the same block.
*
* If the page-level all-visible flag is set, caller will need to
* clear both that and the corresponding visibility map bit. However,
* by the time we return, we'll have x-locked the buffer, and we don't
* want to do any I/O while in that state. So we check the bit here
* before taking the lock, and pin the page if it appears necessary.
* Checking without the lock creates a risk of getting the wrong
* answer, so we'll have to recheck after acquiring the lock.
*/
if (otherBuffer == InvalidBuffer)
{
/* easy case */
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
buffer = ReadBufferBI(relation, targetBlock, RBM_NORMAL, bistate);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
/*
* If the page is empty, pin vmbuffer to set all_frozen bit later.
*/
if ((options & HEAP_INSERT_FROZEN) &&
(PageGetMaxOffsetNumber(BufferGetPage(buffer)) == 0))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock == targetBlock)
{
/* also easy case */
buffer = otherBuffer;
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock < targetBlock)
{
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else
{
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
}
/*
* We now have the target page (and the other buffer, if any) pinned
* and locked. However, since our initial PageIsAllVisible checks
* were performed before acquiring the lock, the results might now be
* out of date, either for the selected victim buffer, or for the
* other buffer passed by the caller. In that case, we'll need to
* give up our locks, go get the pin(s) we failed to get earlier, and
* re-lock. That's pretty painful, but hopefully shouldn't happen
* often.
*
* Note that there's a small possibility that we didn't pin the page
* above but still have the correct page pinned anyway, either because
* we've already made a previous pass through this loop, or because
* caller passed us the right page anyway.
*
* Note also that it's possible that by the time we get the pin and
* retake the buffer locks, the visibility map bit will have been
* cleared by some other backend anyway. In that case, we'll have
* done a bit of extra work for no gain, but there's no real harm
* done.
*/
GetVisibilityMapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
/*
* Now we can check to see if there's enough free space here. If so,
* we're done.
*/
page = BufferGetPage(buffer);
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
/*
* If necessary initialize page, it'll be used soon. We could avoid
* dirtying the buffer here, and rely on the caller to do so whenever
* it puts a tuple onto the page, but there seems not much benefit in
* doing so.
*/
if (PageIsNew(page))
{
PageInit(page, BufferGetPageSize(buffer), 0);
MarkBufferDirty(buffer);
}
pageFreeSpace = PageGetHeapFreeSpace(page);
if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
return buffer;
}
/*
* Not enough space, so we must give up our page locks and pin (if
* any) and prepare to look elsewhere. We don't care which order we
* unlock the two buffers in, so this can be slightly simpler than the
* code above.
*/
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
if (otherBuffer == InvalidBuffer)
ReleaseBuffer(buffer);
else if (otherBlock != targetBlock)
{
LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buffer);
}
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/* Is there an ongoing bulk extension? */
if (bistate && bistate->next_free != InvalidBlockNumber)
{
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
Assert(bistate->next_free <= bistate->last_free);
/*
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
* We bulk extended the relation before, and there are still some
* unused pages from that extension, so we don't need to look in
* the FSM for a new page. But do record the free space from the
* last page, somebody might insert narrower tuples later.
*/
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
if (use_fsm)
RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
targetBlock = bistate->next_free;
if (bistate->next_free >= bistate->last_free)
{
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
bistate->next_free = InvalidBlockNumber;
bistate->last_free = InvalidBlockNumber;
}
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
else
bistate->next_free++;
}
else if (!use_fsm)
{
/* Without FSM, always fall out of the loop and extend */
break;
}
else
{
/*
* Update FSM as to condition of this page, and ask for another
* page to try.
*/
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
targetFreeSpace);
}
}
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
/* Have to extend the relation */
buffer = RelationAddBlocks(relation, bistate, num_pages, use_fsm,
&unlockedTargetBuffer);
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
targetBlock = BufferGetBlockNumber(buffer);
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
page = BufferGetPage(buffer);
/*
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
* The page is empty, pin vmbuffer to set all_frozen bit. We don't want to
* do IO while the buffer is locked, so we unlock the page first if IO is
* needed (necessitating checks below).
*/
if (options & HEAP_INSERT_FROZEN)
{
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
Assert(PageGetMaxOffsetNumber(page) == 0);
if (!visibilitymap_pin_ok(targetBlock, *vmbuffer))
{
hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-07 01:35:21 +02:00
if (!unlockedTargetBuffer)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
unlockedTargetBuffer = true;
visibilitymap_pin(relation, targetBlock, vmbuffer);
}
}
/*
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
* Reacquire locks if necessary.
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
*
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
* If the target buffer was unlocked above, or is unlocked while
* reacquiring the lock on otherBuffer below, it's unlikely, but possible,
* that another backend used space on this page. We check for that below,
* and retry if necessary.
*/
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
recheckVmPins = false;
if (unlockedTargetBuffer)
{
/* released lock on target buffer above */
if (otherBuffer != InvalidBuffer)
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
recheckVmPins = true;
}
else if (otherBuffer != InvalidBuffer)
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
{
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
/*
* We did not release the target buffer, and otherBuffer is valid,
* need to lock the other buffer. It's guaranteed to be of a lower
* page number than the new page. To conform with the deadlock
* prevent rules, we ought to lock otherBuffer first, but that would
* give other backends a chance to put tuples on our page. To reduce
* the likelihood of that, attempt to lock the other buffer
* conditionally, that's very likely to work.
*
* Alternatively, we could acquire the lock on otherBuffer before
* extending the relation, but that'd require holding the lock while
* performing IO, which seems worse than an unlikely retry.
*/
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
Assert(otherBuffer != buffer);
Avoid improbable PANIC during heap_update. heap_update needs to clear any existing "all visible" flag on the old tuple's page (and on the new page too, if different). Per coding rules, to do this it must acquire pin on the appropriate visibility-map page while not holding exclusive buffer lock; which creates a race condition since someone else could set the flag whenever we're not holding the buffer lock. The code is supposed to handle that by re-checking the flag after acquiring buffer lock and retrying if it became set. However, one code path through heap_update itself, as well as one in its subroutine RelationGetBufferForTuple, failed to do this. The end result, in the unlikely event that a concurrent VACUUM did set the flag while we're transiently not holding lock, is a non-recurring "PANIC: wrong buffer passed to visibilitymap_clear" failure. This has been seen a few times in the buildfarm since recent VACUUM changes that added code paths that could set the all-visible flag while holding only exclusive buffer lock. Previously, the flag was (usually?) set only after doing LockBufferForCleanup, which would insist on buffer pin count zero, thus preventing the flag from becoming set partway through heap_update. However, it's clear that it's heap_update not VACUUM that's at fault here. What's less clear is whether there is any hazard from these bugs in released branches. heap_update is certainly violating API expectations, but if there is no code path that can set all-visible without a cleanup lock then it's only a latent bug. That's not 100% certain though, besides which we should worry about extensions or future back-patch fixes that could introduce such code paths. I chose to back-patch to v12. Fixing RelationGetBufferForTuple before that would require also back-patching portions of older fixes (notably 0d1fe9f74), which is more code churn than seems prudent to fix a hypothetical issue. Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 18:17:24 +02:00
Assert(targetBlock > otherBlock);
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
if (unlikely(!ConditionalLockBuffer(otherBuffer)))
{
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
unlockedTargetBuffer = true;
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
recheckVmPins = true;
}
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
/*
* If one of the buffers was unlocked (always the case if otherBuffer is
* valid), it's possible, although unlikely, that an all-visible flag
* became set. We can use GetVisibilityMapPins to deal with that. It's
* possible that GetVisibilityMapPins() might need to temporarily release
* buffer locks, in which case we'll need to check if there's still enough
* space on the page below.
*/
if (recheckVmPins)
{
if (GetVisibilityMapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer))
unlockedTargetBuffer = true;
}
Avoid improbable PANIC during heap_update. heap_update needs to clear any existing "all visible" flag on the old tuple's page (and on the new page too, if different). Per coding rules, to do this it must acquire pin on the appropriate visibility-map page while not holding exclusive buffer lock; which creates a race condition since someone else could set the flag whenever we're not holding the buffer lock. The code is supposed to handle that by re-checking the flag after acquiring buffer lock and retrying if it became set. However, one code path through heap_update itself, as well as one in its subroutine RelationGetBufferForTuple, failed to do this. The end result, in the unlikely event that a concurrent VACUUM did set the flag while we're transiently not holding lock, is a non-recurring "PANIC: wrong buffer passed to visibilitymap_clear" failure. This has been seen a few times in the buildfarm since recent VACUUM changes that added code paths that could set the all-visible flag while holding only exclusive buffer lock. Previously, the flag was (usually?) set only after doing LockBufferForCleanup, which would insist on buffer pin count zero, thus preventing the flag from becoming set partway through heap_update. However, it's clear that it's heap_update not VACUUM that's at fault here. What's less clear is whether there is any hazard from these bugs in released branches. heap_update is certainly violating API expectations, but if there is no code path that can set all-visible without a cleanup lock then it's only a latent bug. That's not 100% certain though, besides which we should worry about extensions or future back-patch fixes that could introduce such code paths. I chose to back-patch to v12. Fixing RelationGetBufferForTuple before that would require also back-patching portions of older fixes (notably 0d1fe9f74), which is more code churn than seems prudent to fix a hypothetical issue. Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 18:17:24 +02:00
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
/*
* If the target buffer was temporarily unlocked since the relation
* extension, it's possible, although unlikely, that all the space on the
* page was already used. If so, we just retry from the start. If we
* didn't unlock, something has gone wrong if there's not enough space -
* the test at the top should have prevented reaching this case.
*/
pageFreeSpace = PageGetHeapFreeSpace(page);
if (len > pageFreeSpace)
{
if (unlockedTargetBuffer)
{
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
if (otherBuffer != InvalidBuffer)
LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
UnlockReleaseBuffer(buffer);
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
goto loop;
Move page initialization from RelationAddExtraBlocks() to use, take 2. Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 684200543b. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
2019-02-03 10:27:19 +01:00
}
elog(PANIC, "tuple is too big: size %zu", len);
}
/*
* Remember the new page as our target for future insertions.
*
* XXX should we enter the new page into the free space map immediately,
* or just keep it for this backend's exclusive use in the short run
* (until VACUUM sees it)? Seems to depend on whether you expect the
* current backend to make more insertions or not, which is probably a
* good bet most of the time. So for now, don't add it to FSM yet.
*/
hio: Don't pin the VM while holding buffer lock while extending Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 20:11:13 +02:00
RelationSetTargetBlock(relation, targetBlock);
return buffer;
}