1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* localbuf.c
|
2002-08-06 04:36:35 +02:00
|
|
|
* local buffer manager. Fast buffer manager for temporary tables,
|
|
|
|
* which never need to be WAL-logged or checkpointed, etc.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994-5, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2000-11-30 20:06:37 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/storage/buffer/localbuf.c
|
2000-11-30 20:06:37 +01:00
|
|
|
*
|
1996-07-09 08:22:35 +02:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
2000-11-20 17:47:32 +01:00
|
|
|
#include "postgres.h"
|
2000-10-28 18:21:00 +02:00
|
|
|
|
Improve the situation for parallel query versus temp relations.
Transmit the leader's temp-namespace state to workers. This is important
because without it, the workers do not really have the same search path
as the leader. For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.
We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.
Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session. While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway. We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables. (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)
Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open(). This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.
Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
2016-06-10 02:16:11 +02:00
|
|
|
#include "access/parallel.h"
|
2008-11-11 15:17:02 +01:00
|
|
|
#include "catalog/catalog.h"
|
2009-12-15 05:57:48 +01:00
|
|
|
#include "executor/instrument.h"
|
2023-02-10 07:22:26 +01:00
|
|
|
#include "pgstat.h"
|
2000-11-30 02:39:08 +01:00
|
|
|
#include "storage/buf_internals.h"
|
|
|
|
#include "storage/bufmgr.h"
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
#include "storage/fd.h"
|
Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation. It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:
* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
built-in GUC variables, along with some already-exported constant
tables.
* GUC check/assign/show hook functions are moved to the variable's
home module, whenever that's clearly identifiable. A few hard-
to-classify hooks ended up in commands/variable.c, which was
already a home for miscellaneous GUC hook functions.
To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module. That allowed removal of #include "guc.h"
from some existing headers. The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.
There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments. But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.
Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.
Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 17:05:07 +02:00
|
|
|
#include "utils/guc_hooks.h"
|
2005-03-19 18:39:43 +01:00
|
|
|
#include "utils/memutils.h"
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
#include "utils/resowner.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1997-04-18 04:53:37 +02:00
|
|
|
|
2002-08-06 04:36:35 +02:00
|
|
|
/*#define LBDEBUG*/
|
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
/* entry for buffer lookup hashtable */
|
|
|
|
typedef struct
|
|
|
|
{
|
|
|
|
BufferTag key; /* Tag of a disk page */
|
|
|
|
int id; /* Associated local buffer's index */
|
|
|
|
} LocalBufferLookupEnt;
|
|
|
|
|
2005-03-04 21:21:07 +01:00
|
|
|
/* Note: this macro only works on local buffers, not shared ones! */
|
|
|
|
#define LocalBufHdrGetBlock(bufHdr) \
|
|
|
|
LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
|
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
int NLocBuffer = 0; /* until buffers are initialized */
|
2002-08-06 04:36:35 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
BufferDesc *LocalBufferDescriptors = NULL;
|
2000-11-30 02:39:08 +01:00
|
|
|
Block *LocalBufferBlockPointers = NULL;
|
2004-04-22 09:21:55 +02:00
|
|
|
int32 *LocalRefCount = NULL;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
static int nextFreeLocalBufId = 0;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
static HTAB *LocalBufHash = NULL;
|
|
|
|
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
/* number of local buffers pinned at least once */
|
|
|
|
static int NLocalPinnedBuffers = 0;
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2005-03-20 00:27:11 +01:00
|
|
|
static void InitLocalBuffers(void);
|
2006-12-27 23:31:54 +01:00
|
|
|
static Block GetLocalBufferStorage(void);
|
2023-04-05 22:47:46 +02:00
|
|
|
static Buffer GetLocalVictimBuffer(void);
|
2005-03-20 00:27:11 +01:00
|
|
|
|
|
|
|
|
2009-01-12 06:10:45 +01:00
|
|
|
/*
|
2020-04-08 03:36:45 +02:00
|
|
|
* PrefetchLocalBuffer -
|
2009-01-12 06:10:45 +01:00
|
|
|
* initiate asynchronous read of a block of a relation
|
|
|
|
*
|
|
|
|
* Do PrefetchBuffer's work for temporary relations.
|
|
|
|
* No-op if prefetching isn't compiled in.
|
|
|
|
*/
|
2020-04-08 03:36:45 +02:00
|
|
|
PrefetchBufferResult
|
|
|
|
PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
|
2009-01-12 06:10:45 +01:00
|
|
|
BlockNumber blockNum)
|
|
|
|
{
|
2020-04-08 03:36:45 +02:00
|
|
|
PrefetchBufferResult result = {InvalidBuffer, false};
|
2009-01-12 06:10:45 +01:00
|
|
|
BufferTag newTag; /* identity of requested block */
|
|
|
|
LocalBufferLookupEnt *hresult;
|
|
|
|
|
2022-07-27 19:54:37 +02:00
|
|
|
InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
|
2009-01-12 06:10:45 +01:00
|
|
|
|
|
|
|
/* Initialize local buffers if first request in this session */
|
|
|
|
if (LocalBufHash == NULL)
|
|
|
|
InitLocalBuffers();
|
|
|
|
|
|
|
|
/* See if the desired buffer already exists */
|
|
|
|
hresult = (LocalBufferLookupEnt *)
|
2023-02-06 09:05:20 +01:00
|
|
|
hash_search(LocalBufHash, &newTag, HASH_FIND, NULL);
|
2009-01-12 06:10:45 +01:00
|
|
|
|
|
|
|
if (hresult)
|
|
|
|
{
|
|
|
|
/* Yes, so nothing to do */
|
2020-04-08 03:36:45 +02:00
|
|
|
result.recent_buffer = -hresult->id - 1;
|
2009-01-12 06:10:45 +01:00
|
|
|
}
|
2020-04-08 03:36:45 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
#ifdef USE_PREFETCH
|
|
|
|
/* Not in buffers, so initiate prefetch */
|
2023-04-08 01:04:49 +02:00
|
|
|
if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
|
|
|
|
smgrprefetch(smgr, forkNum, blockNum))
|
|
|
|
{
|
|
|
|
result.initiated_io = true;
|
|
|
|
}
|
2009-01-12 06:10:45 +01:00
|
|
|
#endif /* USE_PREFETCH */
|
2020-04-08 03:36:45 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return result;
|
2009-01-12 06:10:45 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
|
|
|
* LocalBufferAlloc -
|
2005-03-19 18:39:43 +01:00
|
|
|
* Find or create a local buffer for the given page of the given relation.
|
2004-04-21 20:06:30 +02:00
|
|
|
*
|
|
|
|
* API is similar to bufmgr.c's BufferAlloc, except that we do not need
|
2005-03-04 21:21:07 +01:00
|
|
|
* to do any locking since this is all local. Also, IO_IN_PROGRESS
|
2007-05-30 22:12:03 +02:00
|
|
|
* does not get set. Lastly, we support only default access strategy
|
|
|
|
* (hence, usage_count is always advanced).
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
|
|
|
BufferDesc *
|
2008-08-11 13:05:11 +02:00
|
|
|
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
|
2023-03-31 04:22:40 +02:00
|
|
|
bool *foundPtr)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-04-21 20:06:30 +02:00
|
|
|
BufferTag newTag; /* identity of requested block */
|
2005-03-19 18:39:43 +01:00
|
|
|
LocalBufferLookupEnt *hresult;
|
2004-04-21 20:06:30 +02:00
|
|
|
BufferDesc *bufHdr;
|
2023-04-05 22:47:46 +02:00
|
|
|
Buffer victim_buffer;
|
|
|
|
int bufid;
|
2005-03-19 18:39:43 +01:00
|
|
|
bool found;
|
2004-04-21 20:06:30 +02:00
|
|
|
|
2022-07-27 19:54:37 +02:00
|
|
|
InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2005-03-20 00:27:11 +01:00
|
|
|
/* Initialize local buffers if first request in this session */
|
|
|
|
if (LocalBufHash == NULL)
|
|
|
|
InitLocalBuffers();
|
|
|
|
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
ResourceOwnerEnlarge(CurrentResourceOwner);
|
Move a few ResourceOwnerEnlarge() calls for safety and clarity.
These are functions where a lot of things happen between the
ResourceOwnerEnlarge and ResourceOwnerRemember calls. It's important
that there are no unrelated ResourceOwnerRemember calls in the code in
between, otherwise the reserved entry might be used up by the
intervening ResourceOwnerRemember and not be available at the intended
ResourceOwnerRemember call anymore. I don't see any bugs here, but the
longer the code path between the calls is, the harder it is to verify.
In bufmgr.c, there is a function similar to ResourceOwnerEnlarge,
ReservePrivateRefCountEntry(), to ensure that the private refcount
array has enough space. The ReservePrivateRefCountEntry() calls were
made at different places than the ResourceOwnerEnlargeBuffers()
calls. Move the ResourceOwnerEnlargeBuffers() and
ReservePrivateRefCountEntry() calls together for consistency.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:46 +01:00
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
/* See if the desired buffer already exists */
|
|
|
|
hresult = (LocalBufferLookupEnt *)
|
2023-02-06 09:05:20 +01:00
|
|
|
hash_search(LocalBufHash, &newTag, HASH_FIND, NULL);
|
2005-03-19 18:39:43 +01:00
|
|
|
|
|
|
|
if (hresult)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2023-04-05 22:47:46 +02:00
|
|
|
bufid = hresult->id;
|
|
|
|
bufHdr = GetLocalBufferDescriptor(bufid);
|
2022-07-27 19:54:37 +02:00
|
|
|
Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
|
2023-04-05 19:42:17 +02:00
|
|
|
*foundPtr = PinLocalBuffer(bufHdr, true);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2023-04-05 22:47:46 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
uint32 buf_state;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
victim_buffer = GetLocalVictimBuffer();
|
|
|
|
bufid = -victim_buffer - 1;
|
|
|
|
bufHdr = GetLocalBufferDescriptor(bufid);
|
|
|
|
|
|
|
|
hresult = (LocalBufferLookupEnt *)
|
|
|
|
hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
|
|
|
|
if (found) /* shouldn't happen */
|
|
|
|
elog(ERROR, "local buffer hash table corrupted");
|
|
|
|
hresult->id = bufid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* it's all ours now.
|
|
|
|
*/
|
|
|
|
bufHdr->tag = newTag;
|
|
|
|
|
|
|
|
buf_state = pg_atomic_read_u32(&bufHdr->state);
|
|
|
|
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
|
|
|
|
buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
|
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
|
|
|
|
|
|
|
*foundPtr = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return bufHdr;
|
|
|
|
}
|
|
|
|
|
|
|
|
static Buffer
|
|
|
|
GetLocalVictimBuffer(void)
|
|
|
|
{
|
|
|
|
int victim_bufid;
|
|
|
|
int trycounter;
|
|
|
|
uint32 buf_state;
|
|
|
|
BufferDesc *bufHdr;
|
|
|
|
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
ResourceOwnerEnlarge(CurrentResourceOwner);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2005-03-04 21:21:07 +01:00
|
|
|
/*
|
|
|
|
* Need to get a new buffer. We use a clock sweep algorithm (essentially
|
|
|
|
* the same as what freelist.c does now...)
|
|
|
|
*/
|
|
|
|
trycounter = NLocBuffer;
|
|
|
|
for (;;)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2023-04-05 22:47:46 +02:00
|
|
|
victim_bufid = nextFreeLocalBufId;
|
2005-03-04 21:21:07 +01:00
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
if (++nextFreeLocalBufId >= NLocBuffer)
|
|
|
|
nextFreeLocalBufId = 0;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
bufHdr = GetLocalBufferDescriptor(victim_bufid);
|
2005-03-04 21:21:07 +01:00
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
if (LocalRefCount[victim_bufid] == 0)
|
2005-03-04 21:21:07 +01:00
|
|
|
{
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
buf_state = pg_atomic_read_u32(&bufHdr->state);
|
|
|
|
|
|
|
|
if (BUF_STATE_GET_USAGECOUNT(buf_state) > 0)
|
2007-05-30 22:12:03 +02:00
|
|
|
{
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
buf_state -= BUF_USAGECOUNT_ONE;
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
2007-05-30 22:12:03 +02:00
|
|
|
trycounter = NLocBuffer;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Found a usable buffer */
|
2023-04-05 19:42:17 +02:00
|
|
|
PinLocalBuffer(bufHdr, false);
|
2007-05-30 22:12:03 +02:00
|
|
|
break;
|
|
|
|
}
|
2005-03-04 21:21:07 +01:00
|
|
|
}
|
|
|
|
else if (--trycounter == 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("no empty local buffer available")));
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
/*
|
|
|
|
* lazy memory allocation: allocate space on first use of a buffer.
|
|
|
|
*/
|
|
|
|
if (LocalBufHdrGetBlock(bufHdr) == NULL)
|
|
|
|
{
|
|
|
|
/* Set pointer for use by BufferGetBlock() macro */
|
|
|
|
LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
|
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2002-08-06 04:36:35 +02:00
|
|
|
* this buffer is not referenced but it might still be dirty. if that's
|
|
|
|
* the case, write it out before reusing it!
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
if (buf_state & BM_DIRTY)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2023-04-08 01:05:26 +02:00
|
|
|
instr_time io_start;
|
2013-03-22 14:54:07 +01:00
|
|
|
SMgrRelation oreln;
|
|
|
|
Page localpage = (char *) LocalBufHdrGetBlock(bufHdr);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-02-10 02:55:27 +01:00
|
|
|
/* Find smgr relation for buffer */
|
2022-08-24 21:50:48 +02:00
|
|
|
oreln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyBackendId);
|
2004-02-10 02:55:27 +01:00
|
|
|
|
2013-03-22 14:54:07 +01:00
|
|
|
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
|
|
|
|
|
2023-04-08 01:05:26 +02:00
|
|
|
io_start = pgstat_prepare_io_time();
|
2023-04-07 22:24:26 +02:00
|
|
|
|
2004-02-10 02:55:27 +01:00
|
|
|
/* And write... */
|
2005-01-10 21:02:24 +01:00
|
|
|
smgrwrite(oreln,
|
2022-08-24 21:50:48 +02:00
|
|
|
BufTagGetForkNum(&bufHdr->tag),
|
2004-02-10 02:55:27 +01:00
|
|
|
bufHdr->tag.blockNum,
|
2013-03-22 14:54:07 +01:00
|
|
|
localpage,
|
2010-08-13 22:10:54 +02:00
|
|
|
false);
|
1999-09-18 21:08:25 +02:00
|
|
|
|
2023-04-08 01:05:26 +02:00
|
|
|
/* Temporary table I/O does not use Buffer Access Strategies */
|
|
|
|
pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
|
|
|
|
IOOP_WRITE, io_start, 1);
|
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
/* Mark not-dirty now in case we error out below */
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
buf_state &= ~BM_DIRTY;
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
2005-03-19 18:39:43 +01:00
|
|
|
|
2009-12-15 05:57:48 +01:00
|
|
|
pgBufferUsage.local_blks_written++;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2023-04-05 22:47:46 +02:00
|
|
|
* Remove the victim buffer from the hashtable and mark as invalid.
|
2005-03-19 18:39:43 +01:00
|
|
|
*/
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
if (buf_state & BM_TAG_VALID)
|
2005-03-19 18:39:43 +01:00
|
|
|
{
|
2023-04-05 22:47:46 +02:00
|
|
|
LocalBufferLookupEnt *hresult;
|
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
hresult = (LocalBufferLookupEnt *)
|
2023-02-06 09:05:20 +01:00
|
|
|
hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
|
2005-03-19 18:39:43 +01:00
|
|
|
if (!hresult) /* shouldn't happen */
|
|
|
|
elog(ERROR, "local buffer hash table corrupted");
|
|
|
|
/* mark buffer invalid just in case hash insert fails */
|
2022-07-27 19:54:37 +02:00
|
|
|
ClearBufferTag(&bufHdr->tag);
|
2023-04-05 22:47:46 +02:00
|
|
|
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
2023-02-10 07:22:26 +01:00
|
|
|
pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
|
2005-03-19 18:39:43 +01:00
|
|
|
}
|
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
return BufferDescriptorGetBuffer(bufHdr);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
/* see LimitAdditionalPins() */
|
|
|
|
static void
|
|
|
|
LimitAdditionalLocalPins(uint32 *additional_pins)
|
|
|
|
{
|
|
|
|
uint32 max_pins;
|
|
|
|
|
|
|
|
if (*additional_pins <= 1)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In contrast to LimitAdditionalPins() other backends don't play a role
|
|
|
|
* here. We can allow up to NLocBuffer pins in total.
|
|
|
|
*/
|
|
|
|
max_pins = (NLocBuffer - NLocalPinnedBuffers);
|
|
|
|
|
|
|
|
if (*additional_pins >= max_pins)
|
|
|
|
*additional_pins = max_pins;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
|
|
|
|
* temporary buffers.
|
|
|
|
*/
|
|
|
|
BlockNumber
|
2023-08-23 02:10:18 +02:00
|
|
|
ExtendBufferedRelLocal(BufferManagerRelation bmr,
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
ForkNumber fork,
|
|
|
|
uint32 flags,
|
|
|
|
uint32 extend_by,
|
|
|
|
BlockNumber extend_upto,
|
|
|
|
Buffer *buffers,
|
|
|
|
uint32 *extended_by)
|
|
|
|
{
|
|
|
|
BlockNumber first_block;
|
2023-04-08 01:05:26 +02:00
|
|
|
instr_time io_start;
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
|
|
|
|
/* Initialize local buffers if first request in this session */
|
|
|
|
if (LocalBufHash == NULL)
|
|
|
|
InitLocalBuffers();
|
|
|
|
|
|
|
|
LimitAdditionalLocalPins(&extend_by);
|
|
|
|
|
|
|
|
for (uint32 i = 0; i < extend_by; i++)
|
|
|
|
{
|
|
|
|
BufferDesc *buf_hdr;
|
|
|
|
Block buf_block;
|
|
|
|
|
|
|
|
buffers[i] = GetLocalVictimBuffer();
|
|
|
|
buf_hdr = GetLocalBufferDescriptor(-buffers[i] - 1);
|
|
|
|
buf_block = LocalBufHdrGetBlock(buf_hdr);
|
|
|
|
|
|
|
|
/* new buffers are zero-filled */
|
|
|
|
MemSet((char *) buf_block, 0, BLCKSZ);
|
|
|
|
}
|
|
|
|
|
2023-08-23 02:10:18 +02:00
|
|
|
first_block = smgrnblocks(bmr.smgr, fork);
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
|
|
|
|
if (extend_upto != InvalidBlockNumber)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In contrast to shared relations, nothing could change the relation
|
|
|
|
* size concurrently. Thus we shouldn't end up finding that we don't
|
|
|
|
* need to do anything.
|
|
|
|
*/
|
|
|
|
Assert(first_block <= extend_upto);
|
|
|
|
|
|
|
|
Assert((uint64) first_block + extend_by <= extend_upto);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Fail if relation is already at maximum possible length */
|
|
|
|
if ((uint64) first_block + extend_by >= MaxBlockNumber)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
|
|
|
|
errmsg("cannot extend relation %s beyond %u blocks",
|
2023-08-23 02:10:18 +02:00
|
|
|
relpath(bmr.smgr->smgr_rlocator, fork),
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
MaxBlockNumber)));
|
|
|
|
|
2023-09-19 09:46:01 +02:00
|
|
|
for (uint32 i = 0; i < extend_by; i++)
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
{
|
|
|
|
int victim_buf_id;
|
|
|
|
BufferDesc *victim_buf_hdr;
|
|
|
|
BufferTag tag;
|
|
|
|
LocalBufferLookupEnt *hresult;
|
|
|
|
bool found;
|
|
|
|
|
|
|
|
victim_buf_id = -buffers[i] - 1;
|
|
|
|
victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
|
|
|
|
|
2023-08-23 02:10:18 +02:00
|
|
|
InitBufferTag(&tag, &bmr.smgr->smgr_rlocator.locator, fork, first_block + i);
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
|
|
|
|
hresult = (LocalBufferLookupEnt *)
|
|
|
|
hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
|
|
|
|
if (found)
|
|
|
|
{
|
|
|
|
BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
|
|
|
|
uint32 buf_state;
|
|
|
|
|
|
|
|
UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
|
|
|
|
|
|
|
|
existing_hdr = GetLocalBufferDescriptor(hresult->id);
|
|
|
|
PinLocalBuffer(existing_hdr, false);
|
|
|
|
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
|
|
|
|
|
|
|
|
buf_state = pg_atomic_read_u32(&existing_hdr->state);
|
|
|
|
Assert(buf_state & BM_TAG_VALID);
|
|
|
|
Assert(!(buf_state & BM_DIRTY));
|
|
|
|
buf_state &= BM_VALID;
|
|
|
|
pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
uint32 buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
|
|
|
|
|
|
|
|
Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
|
|
|
|
|
|
|
|
victim_buf_hdr->tag = tag;
|
|
|
|
|
|
|
|
buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
|
|
|
|
|
|
|
|
pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
|
|
|
|
|
|
|
|
hresult->id = victim_buf_id;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-04-08 01:05:26 +02:00
|
|
|
io_start = pgstat_prepare_io_time();
|
|
|
|
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
/* actually extend relation */
|
2023-08-23 02:10:18 +02:00
|
|
|
smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
|
2023-04-08 01:05:26 +02:00
|
|
|
pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
|
|
|
|
io_start, extend_by);
|
|
|
|
|
2023-09-19 09:46:01 +02:00
|
|
|
for (uint32 i = 0; i < extend_by; i++)
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
{
|
|
|
|
Buffer buf = buffers[i];
|
|
|
|
BufferDesc *buf_hdr;
|
|
|
|
uint32 buf_state;
|
|
|
|
|
|
|
|
buf_hdr = GetLocalBufferDescriptor(-buf - 1);
|
|
|
|
|
|
|
|
buf_state = pg_atomic_read_u32(&buf_hdr->state);
|
|
|
|
buf_state |= BM_VALID;
|
|
|
|
pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
|
|
|
|
}
|
|
|
|
|
|
|
|
*extended_by = extend_by;
|
|
|
|
|
2023-09-14 04:14:09 +02:00
|
|
|
pgBufferUsage.local_blks_written += extend_by;
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
|
|
|
|
return first_block;
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2006-04-01 01:32:07 +02:00
|
|
|
* MarkLocalBufferDirty -
|
|
|
|
* mark a local buffer dirty
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2002-06-15 21:59:59 +02:00
|
|
|
void
|
2006-04-01 01:32:07 +02:00
|
|
|
MarkLocalBufferDirty(Buffer buffer)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
|
|
|
int bufid;
|
2005-03-04 21:21:07 +01:00
|
|
|
BufferDesc *bufHdr;
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
uint32 buf_state;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
Assert(BufferIsLocal(buffer));
|
|
|
|
|
|
|
|
#ifdef LBDEBUG
|
2006-04-01 01:32:07 +02:00
|
|
|
fprintf(stderr, "LB DIRTY %d\n", buffer);
|
1996-07-09 08:22:35 +02:00
|
|
|
#endif
|
|
|
|
|
2023-03-30 18:50:18 +02:00
|
|
|
bufid = -buffer - 1;
|
2005-03-04 21:21:07 +01:00
|
|
|
|
|
|
|
Assert(LocalRefCount[bufid] > 0);
|
|
|
|
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
bufHdr = GetLocalBufferDescriptor(bufid);
|
2012-02-23 02:33:05 +01:00
|
|
|
|
2016-04-14 00:28:29 +02:00
|
|
|
buf_state = pg_atomic_read_u32(&bufHdr->state);
|
2012-02-23 02:33:05 +01:00
|
|
|
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
if (!(buf_state & BM_DIRTY))
|
|
|
|
pgBufferUsage.local_blks_dirtied++;
|
2016-04-14 00:28:29 +02:00
|
|
|
|
|
|
|
buf_state |= BM_DIRTY;
|
|
|
|
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2005-11-17 18:42:02 +01:00
|
|
|
/*
|
2022-07-12 16:26:48 +02:00
|
|
|
* DropRelationLocalBuffers
|
2005-11-17 18:42:02 +01:00
|
|
|
* This function removes from the buffer pool all the pages of the
|
|
|
|
* specified relation that have block numbers >= firstDelBlock.
|
|
|
|
* (In particular, with firstDelBlock = 0, all pages are removed.)
|
|
|
|
* Dirty pages are simply dropped, without bothering to write them
|
|
|
|
* out first. Therefore, this is NOT rollback-able, and so should be
|
|
|
|
* used only with extreme caution!
|
|
|
|
*
|
2022-07-12 16:26:48 +02:00
|
|
|
* See DropRelationBuffers in bufmgr.c for more notes.
|
2005-11-17 18:42:02 +01:00
|
|
|
*/
|
|
|
|
void
|
2022-07-12 16:26:48 +02:00
|
|
|
DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
|
|
|
|
BlockNumber firstDelBlock)
|
2005-11-17 18:42:02 +01:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < NLocBuffer; i++)
|
|
|
|
{
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
|
2005-11-17 18:42:02 +01:00
|
|
|
LocalBufferLookupEnt *hresult;
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
uint32 buf_state;
|
2005-11-17 18:42:02 +01:00
|
|
|
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
buf_state = pg_atomic_read_u32(&bufHdr->state);
|
|
|
|
|
|
|
|
if ((buf_state & BM_TAG_VALID) &&
|
2022-08-24 21:50:48 +02:00
|
|
|
BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator) &&
|
|
|
|
BufTagGetForkNum(&bufHdr->tag) == forkNum &&
|
2005-11-17 18:42:02 +01:00
|
|
|
bufHdr->tag.blockNum >= firstDelBlock)
|
|
|
|
{
|
|
|
|
if (LocalRefCount[i] != 0)
|
2008-11-11 14:19:16 +01:00
|
|
|
elog(ERROR, "block %u of %s is still referenced (local %u)",
|
2005-11-17 18:42:02 +01:00
|
|
|
bufHdr->tag.blockNum,
|
2022-08-24 21:50:48 +02:00
|
|
|
relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
|
|
|
|
MyBackendId,
|
|
|
|
BufTagGetForkNum(&bufHdr->tag)),
|
2005-11-17 18:42:02 +01:00
|
|
|
LocalRefCount[i]);
|
2022-08-24 21:50:48 +02:00
|
|
|
|
2005-11-17 18:42:02 +01:00
|
|
|
/* Remove entry from hashtable */
|
|
|
|
hresult = (LocalBufferLookupEnt *)
|
2023-02-06 09:05:20 +01:00
|
|
|
hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
|
2005-11-17 18:42:02 +01:00
|
|
|
if (!hresult) /* shouldn't happen */
|
|
|
|
elog(ERROR, "local buffer hash table corrupted");
|
|
|
|
/* Mark buffer invalid */
|
2022-07-27 19:54:37 +02:00
|
|
|
ClearBufferTag(&bufHdr->tag);
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
buf_state &= ~BUF_FLAG_MASK;
|
|
|
|
buf_state &= ~BUF_USAGECOUNT_MASK;
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
2005-11-17 18:42:02 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-06-07 23:42:27 +02:00
|
|
|
/*
|
2022-07-12 16:26:48 +02:00
|
|
|
* DropRelationAllLocalBuffers
|
2012-06-07 23:42:27 +02:00
|
|
|
* This function removes from the buffer pool all pages of all forks
|
|
|
|
* of the specified relation.
|
|
|
|
*
|
2022-07-12 16:26:48 +02:00
|
|
|
* See DropRelationsAllBuffers in bufmgr.c for more notes.
|
2012-06-07 23:42:27 +02:00
|
|
|
*/
|
|
|
|
void
|
2022-07-12 16:26:48 +02:00
|
|
|
DropRelationAllLocalBuffers(RelFileLocator rlocator)
|
2012-06-07 23:42:27 +02:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < NLocBuffer; i++)
|
|
|
|
{
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
|
2012-06-07 23:42:27 +02:00
|
|
|
LocalBufferLookupEnt *hresult;
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
uint32 buf_state;
|
|
|
|
|
|
|
|
buf_state = pg_atomic_read_u32(&bufHdr->state);
|
2012-06-07 23:42:27 +02:00
|
|
|
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
if ((buf_state & BM_TAG_VALID) &&
|
2022-08-24 21:50:48 +02:00
|
|
|
BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
|
2012-06-07 23:42:27 +02:00
|
|
|
{
|
|
|
|
if (LocalRefCount[i] != 0)
|
|
|
|
elog(ERROR, "block %u of %s is still referenced (local %u)",
|
|
|
|
bufHdr->tag.blockNum,
|
2022-08-24 21:50:48 +02:00
|
|
|
relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
|
|
|
|
MyBackendId,
|
|
|
|
BufTagGetForkNum(&bufHdr->tag)),
|
2012-06-07 23:42:27 +02:00
|
|
|
LocalRefCount[i]);
|
|
|
|
/* Remove entry from hashtable */
|
|
|
|
hresult = (LocalBufferLookupEnt *)
|
2023-02-06 09:05:20 +01:00
|
|
|
hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
|
2012-06-07 23:42:27 +02:00
|
|
|
if (!hresult) /* shouldn't happen */
|
|
|
|
elog(ERROR, "local buffer hash table corrupted");
|
|
|
|
/* Mark buffer invalid */
|
2022-07-27 19:54:37 +02:00
|
|
|
ClearBufferTag(&bufHdr->tag);
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
buf_state &= ~BUF_FLAG_MASK;
|
|
|
|
buf_state &= ~BUF_USAGECOUNT_MASK;
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
|
2012-06-07 23:42:27 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2005-03-20 00:27:11 +01:00
|
|
|
* InitLocalBuffers -
|
1996-07-09 08:22:35 +02:00
|
|
|
* init the local buffer cache. Since most queries (esp. multi-user ones)
|
2000-11-30 02:39:08 +01:00
|
|
|
* don't involve local buffers, we delay allocating actual memory for the
|
2002-08-06 04:36:35 +02:00
|
|
|
* buffers until we need them; just make the buffer headers here.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2005-03-20 00:27:11 +01:00
|
|
|
static void
|
|
|
|
InitLocalBuffers(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2005-03-20 00:27:11 +01:00
|
|
|
int nbufs = num_temp_buffers;
|
2005-03-19 18:39:43 +01:00
|
|
|
HASHCTL info;
|
1996-07-09 08:22:35 +02:00
|
|
|
int i;
|
|
|
|
|
Improve the situation for parallel query versus temp relations.
Transmit the leader's temp-namespace state to workers. This is important
because without it, the workers do not really have the same search path
as the leader. For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.
We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.
Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session. While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway. We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables. (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)
Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open(). This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.
Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
2016-06-10 02:16:11 +02:00
|
|
|
/*
|
|
|
|
* Parallel workers can't access data in temporary tables, because they
|
|
|
|
* have no visibility into the local buffers of their leader. This is a
|
|
|
|
* convenient, low-cost place to provide a backstop check for that. Note
|
|
|
|
* that we don't wish to prevent a parallel worker from accessing catalog
|
|
|
|
* metadata about a temp table, so checks at higher levels would be
|
|
|
|
* inappropriate.
|
|
|
|
*/
|
|
|
|
if (IsParallelWorker())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
|
|
|
|
errmsg("cannot access temporary tables during a parallel operation")));
|
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
/* Allocate and zero buffer headers and auxiliary arrays */
|
2005-08-21 01:26:37 +02:00
|
|
|
LocalBufferDescriptors = (BufferDesc *) calloc(nbufs, sizeof(BufferDesc));
|
|
|
|
LocalBufferBlockPointers = (Block *) calloc(nbufs, sizeof(Block));
|
|
|
|
LocalRefCount = (int32 *) calloc(nbufs, sizeof(int32));
|
|
|
|
if (!LocalBufferDescriptors || !LocalBufferBlockPointers || !LocalRefCount)
|
|
|
|
ereport(FATAL,
|
|
|
|
(errcode(ERRCODE_OUT_OF_MEMORY),
|
|
|
|
errmsg("out of memory")));
|
2005-03-19 18:39:43 +01:00
|
|
|
|
2023-04-05 22:47:46 +02:00
|
|
|
nextFreeLocalBufId = 0;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
/* initialize fields that need to start off nonzero */
|
|
|
|
for (i = 0; i < nbufs; i++)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
BufferDesc *buf = GetLocalBufferDescriptor(i);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/*
|
1996-07-09 08:22:35 +02:00
|
|
|
* negative to indicate local buffer. This is tricky: shared buffers
|
|
|
|
* start with 0. We have to start with -2. (Note that the routine
|
|
|
|
* BufferDescriptorGetBuffer adds 1 to buf_id so our first buffer id
|
|
|
|
* is -1.)
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
1996-07-09 08:22:35 +02:00
|
|
|
buf->buf_id = -i - 2;
|
2016-04-14 00:28:29 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Intentionally do not initialize the buffer's atomic variable
|
|
|
|
* (besides zeroing the underlying memory above). That way we get
|
|
|
|
* errors on platforms without atomics, if somebody (re-)introduces
|
|
|
|
* atomic operations for local buffers.
|
|
|
|
*/
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2005-03-19 18:39:43 +01:00
|
|
|
|
2005-03-20 00:27:11 +01:00
|
|
|
/* Create the lookup hash table */
|
|
|
|
info.keysize = sizeof(BufferTag);
|
|
|
|
info.entrysize = sizeof(LocalBufferLookupEnt);
|
|
|
|
|
|
|
|
LocalBufHash = hash_create("Local Buffer Lookup Table",
|
|
|
|
nbufs,
|
|
|
|
&info,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS);
|
2005-03-20 00:27:11 +01:00
|
|
|
|
|
|
|
if (!LocalBufHash)
|
|
|
|
elog(ERROR, "could not initialize local buffer hash table");
|
|
|
|
|
2005-03-19 18:39:43 +01:00
|
|
|
/* Initialization done, mark buffers allocated */
|
|
|
|
NLocBuffer = nbufs;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2023-04-05 19:42:17 +02:00
|
|
|
/*
|
|
|
|
* XXX: We could have a slightly more efficient version of PinLocalBuffer()
|
|
|
|
* that does not support adjusting the usagecount - but so far it does not
|
|
|
|
* seem worth the trouble.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
|
|
|
|
{
|
|
|
|
uint32 buf_state;
|
|
|
|
Buffer buffer = BufferDescriptorGetBuffer(buf_hdr);
|
|
|
|
int bufid = -buffer - 1;
|
|
|
|
|
|
|
|
buf_state = pg_atomic_read_u32(&buf_hdr->state);
|
|
|
|
|
|
|
|
if (LocalRefCount[bufid] == 0)
|
|
|
|
{
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
NLocalPinnedBuffers++;
|
2023-04-05 19:42:17 +02:00
|
|
|
if (adjust_usagecount &&
|
|
|
|
BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
|
|
|
|
{
|
|
|
|
buf_state += BUF_USAGECOUNT_ONE;
|
|
|
|
pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
LocalRefCount[bufid]++;
|
|
|
|
ResourceOwnerRememberBuffer(CurrentResourceOwner,
|
|
|
|
BufferDescriptorGetBuffer(buf_hdr));
|
|
|
|
|
|
|
|
return buf_state & BM_VALID;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
UnpinLocalBuffer(Buffer buffer)
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
{
|
|
|
|
UnpinLocalBufferNoOwner(buffer);
|
|
|
|
ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
UnpinLocalBufferNoOwner(Buffer buffer)
|
2023-04-05 19:42:17 +02:00
|
|
|
{
|
|
|
|
int buffid = -buffer - 1;
|
|
|
|
|
|
|
|
Assert(BufferIsLocal(buffer));
|
|
|
|
Assert(LocalRefCount[buffid] > 0);
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
Assert(NLocalPinnedBuffers > 0);
|
2023-04-05 19:42:17 +02:00
|
|
|
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
if (--LocalRefCount[buffid] == 0)
|
|
|
|
NLocalPinnedBuffers--;
|
2023-04-05 19:42:17 +02:00
|
|
|
}
|
|
|
|
|
Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation. It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:
* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
built-in GUC variables, along with some already-exported constant
tables.
* GUC check/assign/show hook functions are moved to the variable's
home module, whenever that's clearly identifiable. A few hard-
to-classify hooks ended up in commands/variable.c, which was
already a home for miscellaneous GUC hook functions.
To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module. That allowed removal of #include "guc.h"
from some existing headers. The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.
There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments. But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.
Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.
Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 17:05:07 +02:00
|
|
|
/*
|
|
|
|
* GUC check_hook for temp_buffers
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
check_temp_buffers(int *newval, void **extra, GucSource source)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Once local buffers have been initialized, it's too late to change this.
|
|
|
|
* However, if this is only a test call, allow it.
|
|
|
|
*/
|
|
|
|
if (source != PGC_S_TEST && NLocBuffer && NLocBuffer != *newval)
|
|
|
|
{
|
|
|
|
GUC_check_errdetail("\"temp_buffers\" cannot be changed after any temporary tables have been accessed in the session.");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2006-12-27 23:31:54 +01:00
|
|
|
/*
|
|
|
|
* GetLocalBufferStorage - allocate memory for a local buffer
|
|
|
|
*
|
|
|
|
* The idea of this function is to aggregate our requests for storage
|
|
|
|
* so that the memory manager doesn't see a whole lot of relatively small
|
|
|
|
* requests. Since we'll never give back a local buffer once it's created
|
|
|
|
* within a particular process, no point in burdening memmgr with separately
|
|
|
|
* managed chunks.
|
|
|
|
*/
|
|
|
|
static Block
|
|
|
|
GetLocalBufferStorage(void)
|
|
|
|
{
|
|
|
|
static char *cur_block = NULL;
|
|
|
|
static int next_buf_in_block = 0;
|
|
|
|
static int num_bufs_in_block = 0;
|
|
|
|
static int total_bufs_allocated = 0;
|
2010-08-19 18:16:20 +02:00
|
|
|
static MemoryContext LocalBufferContext = NULL;
|
2006-12-27 23:31:54 +01:00
|
|
|
|
|
|
|
char *this_buf;
|
|
|
|
|
|
|
|
Assert(total_bufs_allocated < NLocBuffer);
|
|
|
|
|
|
|
|
if (next_buf_in_block >= num_bufs_in_block)
|
|
|
|
{
|
|
|
|
/* Need to make a new request to memmgr */
|
|
|
|
int num_bufs;
|
|
|
|
|
2010-08-19 18:16:20 +02:00
|
|
|
/*
|
|
|
|
* We allocate local buffers in a context of their own, so that the
|
|
|
|
* space eaten for them is easily recognizable in MemoryContextStats
|
|
|
|
* output. Create the context on first use.
|
|
|
|
*/
|
|
|
|
if (LocalBufferContext == NULL)
|
|
|
|
LocalBufferContext =
|
|
|
|
AllocSetContextCreate(TopMemoryContext,
|
|
|
|
"LocalBufferContext",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2010-08-19 18:16:20 +02:00
|
|
|
|
2006-12-27 23:31:54 +01:00
|
|
|
/* Start with a 16-buffer request; subsequent ones double each time */
|
|
|
|
num_bufs = Max(num_bufs_in_block * 2, 16);
|
|
|
|
/* But not more than what we need for all remaining local bufs */
|
|
|
|
num_bufs = Min(num_bufs, NLocBuffer - total_bufs_allocated);
|
|
|
|
/* And don't overflow MaxAllocSize, either */
|
|
|
|
num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
|
|
|
|
|
2023-04-08 00:38:09 +02:00
|
|
|
/* Buffers should be I/O aligned. */
|
|
|
|
cur_block = (char *)
|
|
|
|
TYPEALIGN(PG_IO_ALIGN_SIZE,
|
|
|
|
MemoryContextAlloc(LocalBufferContext,
|
|
|
|
num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
|
2006-12-27 23:31:54 +01:00
|
|
|
next_buf_in_block = 0;
|
|
|
|
num_bufs_in_block = num_bufs;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Allocate next buffer in current memory block */
|
|
|
|
this_buf = cur_block + next_buf_in_block * BLCKSZ;
|
|
|
|
next_buf_in_block++;
|
|
|
|
total_bufs_allocated++;
|
|
|
|
|
|
|
|
return (Block) this_buf;
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2014-06-20 11:06:42 +02:00
|
|
|
* CheckForLocalBufferLeaks - ensure this backend holds no local buffer pins
|
2000-11-30 20:03:26 +01:00
|
|
|
*
|
2019-07-01 03:00:23 +02:00
|
|
|
* This is just like CheckForBufferLeaks(), but for local buffers.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2014-06-20 11:06:42 +02:00
|
|
|
static void
|
|
|
|
CheckForLocalBufferLeaks(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-10-16 20:57:26 +02:00
|
|
|
#ifdef USE_ASSERT_CHECKING
|
2014-06-20 11:06:42 +02:00
|
|
|
if (LocalRefCount)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2013-03-15 17:26:26 +01:00
|
|
|
int RefCountErrors = 0;
|
2005-08-08 21:44:22 +02:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < NLocBuffer; i++)
|
|
|
|
{
|
2013-03-15 17:26:26 +01:00
|
|
|
if (LocalRefCount[i] != 0)
|
|
|
|
{
|
|
|
|
Buffer b = -i - 1;
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
char *s;
|
|
|
|
|
|
|
|
s = DebugPrintBufferRefcount(b);
|
|
|
|
elog(WARNING, "local buffer refcount leak: %s", s);
|
|
|
|
pfree(s);
|
2013-03-15 17:26:26 +01:00
|
|
|
|
|
|
|
RefCountErrors++;
|
|
|
|
}
|
2005-08-08 21:44:22 +02:00
|
|
|
}
|
2013-03-15 17:26:26 +01:00
|
|
|
Assert(RefCountErrors == 0);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2004-10-16 20:57:26 +02:00
|
|
|
#endif
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2005-03-18 17:16:09 +01:00
|
|
|
|
2014-06-20 11:06:42 +02:00
|
|
|
/*
|
|
|
|
* AtEOXact_LocalBuffers - clean up at end of transaction.
|
|
|
|
*
|
|
|
|
* This is just like AtEOXact_Buffers, but for local buffers.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtEOXact_LocalBuffers(bool isCommit)
|
|
|
|
{
|
|
|
|
CheckForLocalBufferLeaks();
|
|
|
|
}
|
|
|
|
|
2005-03-18 17:16:09 +01:00
|
|
|
/*
|
|
|
|
* AtProcExit_LocalBuffers - ensure we have dropped pins during backend exit.
|
|
|
|
*
|
2014-06-20 11:06:42 +02:00
|
|
|
* This is just like AtProcExit_Buffers, but for local buffers.
|
2005-03-18 17:16:09 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtProcExit_LocalBuffers(void)
|
|
|
|
{
|
2014-06-20 11:06:42 +02:00
|
|
|
/*
|
|
|
|
* We shouldn't be holding any remaining pins; if we are, and assertions
|
2022-07-12 16:26:48 +02:00
|
|
|
* aren't enabled, we'll fail later in DropRelationBuffers while trying to
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* drop the temp rels.
|
2014-06-20 11:06:42 +02:00
|
|
|
*/
|
|
|
|
CheckForLocalBufferLeaks();
|
2005-03-18 17:16:09 +01:00
|
|
|
}
|