1996-08-28 03:59:28 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* buf_internals.h
|
2004-01-15 17:14:26 +01:00
|
|
|
* Internal definitions for buffer manager and the buffer replacement
|
|
|
|
* strategy.
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/storage/buf_internals.h
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef BUFMGR_INTERNALS_H
|
|
|
|
#define BUFMGR_INTERNALS_H
|
|
|
|
|
2023-02-10 07:22:26 +01:00
|
|
|
#include "pgstat.h"
|
2019-11-25 03:38:57 +01:00
|
|
|
#include "port/atomics.h"
|
1999-07-16 01:04:24 +02:00
|
|
|
#include "storage/buf.h"
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-02-19 21:13:05 +01:00
|
|
|
#include "storage/bufmgr.h"
|
2021-03-10 22:05:58 +01:00
|
|
|
#include "storage/condition_variable.h"
|
Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
2012-05-10 05:36:01 +02:00
|
|
|
#include "storage/latch.h"
|
2001-09-29 06:02:27 +02:00
|
|
|
#include "storage/lwlock.h"
|
2005-02-04 00:29:19 +01:00
|
|
|
#include "storage/shmem.h"
|
2008-06-12 11:12:31 +02:00
|
|
|
#include "storage/smgr.h"
|
2005-03-04 21:21:07 +01:00
|
|
|
#include "storage/spin.h"
|
2008-06-19 02:46:06 +02:00
|
|
|
#include "utils/relcache.h"
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
#include "utils/resowner.h"
|
1996-08-28 03:59:28 +02:00
|
|
|
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
/*
|
|
|
|
* Buffer state is a single 32-bit variable where following data is combined.
|
|
|
|
*
|
|
|
|
* - 18 bits refcount
|
|
|
|
* - 4 bits usage count
|
|
|
|
* - 10 bits of flags
|
|
|
|
*
|
|
|
|
* Combining these values allows to perform some operations without locking
|
|
|
|
* the buffer header, by modifying them together with a CAS loop.
|
|
|
|
*
|
|
|
|
* The definition of buffer state components is below.
|
|
|
|
*/
|
|
|
|
#define BUF_REFCOUNT_ONE 1
|
|
|
|
#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
|
|
|
|
#define BUF_USAGECOUNT_MASK 0x003C0000U
|
|
|
|
#define BUF_USAGECOUNT_ONE (1U << 18)
|
|
|
|
#define BUF_USAGECOUNT_SHIFT 18
|
|
|
|
#define BUF_FLAG_MASK 0xFFC00000U
|
|
|
|
|
|
|
|
/* Get refcount and usagecount from buffer state */
|
|
|
|
#define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
|
|
|
|
#define BUF_STATE_GET_USAGECOUNT(state) (((state) & BUF_USAGECOUNT_MASK) >> BUF_USAGECOUNT_SHIFT)
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/*
|
|
|
|
* Flags for buffer descriptors
|
2005-03-04 21:21:07 +01:00
|
|
|
*
|
2019-08-19 09:21:39 +02:00
|
|
|
* Note: BM_TAG_VALID essentially means that there is a buffer hashtable
|
2005-03-04 21:21:07 +01:00
|
|
|
* entry associated with the buffer's tag.
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
#define BM_LOCKED (1U << 22) /* buffer header is locked */
|
|
|
|
#define BM_DIRTY (1U << 23) /* data needs writing */
|
|
|
|
#define BM_VALID (1U << 24) /* data is valid */
|
|
|
|
#define BM_TAG_VALID (1U << 25) /* tag is assigned */
|
|
|
|
#define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */
|
|
|
|
#define BM_IO_ERROR (1U << 27) /* previous I/O failed */
|
|
|
|
#define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */
|
|
|
|
#define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */
|
|
|
|
#define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */
|
2017-03-14 16:51:11 +01:00
|
|
|
#define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged,
|
|
|
|
* or init fork) */
|
2005-03-04 21:21:07 +01:00
|
|
|
/*
|
|
|
|
* The maximum allowed value of usage_count represents a tradeoff between
|
|
|
|
* accuracy and speed of the clock-sweep buffer management algorithm. A
|
|
|
|
* large value (comparable to NBuffers) would approximate LRU semantics.
|
|
|
|
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
|
|
|
|
* clock sweeps to find a free buffer, so in practice we don't want the
|
|
|
|
* value to be very large.
|
|
|
|
*/
|
|
|
|
#define BM_MAX_USAGE_COUNT 5
|
|
|
|
|
2001-07-06 23:04:26 +02:00
|
|
|
/*
|
|
|
|
* Buffer tag identifies which disk block the buffer contains.
|
|
|
|
*
|
|
|
|
* Note: the BufferTag data must be sufficient to determine where to write the
|
2005-02-04 00:29:19 +01:00
|
|
|
* block, without reference to pg_class or pg_tablespace entries. It's
|
|
|
|
* possible that the backend flushing the buffer doesn't even believe the
|
|
|
|
* relation is visible yet (its xact may have started before the xact that
|
|
|
|
* created the rel). The storage manager must be able to cope anyway.
|
2004-04-20 01:27:17 +02:00
|
|
|
*
|
2022-07-27 19:54:37 +02:00
|
|
|
* Note: if there's any pad bytes in the struct, InitBufferTag will have
|
2004-04-20 01:27:17 +02:00
|
|
|
* to be fixed to zero them, since this struct is used as a hash key.
|
2001-07-06 23:04:26 +02:00
|
|
|
*/
|
1999-11-21 20:56:12 +01:00
|
|
|
typedef struct buftag
|
1996-08-28 03:59:28 +02:00
|
|
|
{
|
2022-08-24 21:50:48 +02:00
|
|
|
Oid spcOid; /* tablespace oid */
|
|
|
|
Oid dbOid; /* database oid */
|
2022-09-28 15:45:27 +02:00
|
|
|
RelFileNumber relNumber; /* relation file number */
|
|
|
|
ForkNumber forkNum; /* fork number */
|
1996-08-28 03:59:28 +02:00
|
|
|
BlockNumber blockNum; /* blknum relative to begin of reln */
|
1999-11-21 20:56:12 +01:00
|
|
|
} BufferTag;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2022-08-24 21:50:48 +02:00
|
|
|
static inline RelFileNumber
|
|
|
|
BufTagGetRelNumber(const BufferTag *tag)
|
|
|
|
{
|
2022-09-28 15:45:27 +02:00
|
|
|
return tag->relNumber;
|
2022-08-24 21:50:48 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline ForkNumber
|
|
|
|
BufTagGetForkNum(const BufferTag *tag)
|
|
|
|
{
|
2022-09-28 15:45:27 +02:00
|
|
|
return tag->forkNum;
|
2022-08-24 21:50:48 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
BufTagSetRelForkDetails(BufferTag *tag, RelFileNumber relnumber,
|
|
|
|
ForkNumber forknum)
|
|
|
|
{
|
2022-09-28 15:45:27 +02:00
|
|
|
tag->relNumber = relnumber;
|
|
|
|
tag->forkNum = forknum;
|
2022-08-24 21:50:48 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline RelFileLocator
|
|
|
|
BufTagGetRelFileLocator(const BufferTag *tag)
|
|
|
|
{
|
|
|
|
RelFileLocator rlocator;
|
|
|
|
|
|
|
|
rlocator.spcOid = tag->spcOid;
|
|
|
|
rlocator.dbOid = tag->dbOid;
|
|
|
|
rlocator.relNumber = BufTagGetRelNumber(tag);
|
|
|
|
|
|
|
|
return rlocator;
|
|
|
|
}
|
|
|
|
|
2022-07-27 19:54:37 +02:00
|
|
|
static inline void
|
|
|
|
ClearBufferTag(BufferTag *tag)
|
|
|
|
{
|
2022-08-24 21:50:48 +02:00
|
|
|
tag->spcOid = InvalidOid;
|
|
|
|
tag->dbOid = InvalidOid;
|
|
|
|
BufTagSetRelForkDetails(tag, InvalidRelFileNumber, InvalidForkNumber);
|
2022-07-27 19:54:37 +02:00
|
|
|
tag->blockNum = InvalidBlockNumber;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
InitBufferTag(BufferTag *tag, const RelFileLocator *rlocator,
|
|
|
|
ForkNumber forkNum, BlockNumber blockNum)
|
|
|
|
{
|
2022-08-24 21:50:48 +02:00
|
|
|
tag->spcOid = rlocator->spcOid;
|
|
|
|
tag->dbOid = rlocator->dbOid;
|
|
|
|
BufTagSetRelForkDetails(tag, rlocator->relNumber, forkNum);
|
2022-07-27 19:54:37 +02:00
|
|
|
tag->blockNum = blockNum;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
BufferTagsEqual(const BufferTag *tag1, const BufferTag *tag2)
|
|
|
|
{
|
2022-08-24 21:50:48 +02:00
|
|
|
return (tag1->spcOid == tag2->spcOid) &&
|
|
|
|
(tag1->dbOid == tag2->dbOid) &&
|
2022-09-28 15:45:27 +02:00
|
|
|
(tag1->relNumber == tag2->relNumber) &&
|
|
|
|
(tag1->blockNum == tag2->blockNum) &&
|
|
|
|
(tag1->forkNum == tag2->forkNum);
|
2022-07-27 19:54:37 +02:00
|
|
|
}
|
2003-11-13 15:57:15 +01:00
|
|
|
|
2022-08-24 21:50:48 +02:00
|
|
|
static inline bool
|
|
|
|
BufTagMatchesRelFileLocator(const BufferTag *tag,
|
|
|
|
const RelFileLocator *rlocator)
|
|
|
|
{
|
|
|
|
return (tag->spcOid == rlocator->spcOid) &&
|
|
|
|
(tag->dbOid == rlocator->dbOid) &&
|
|
|
|
(BufTagGetRelNumber(tag) == rlocator->relNumber);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2006-07-23 05:07:58 +02:00
|
|
|
/*
|
|
|
|
* The shared buffer mapping table is partitioned to reduce contention.
|
|
|
|
* To determine which partition lock a given tag requires, compute the tag's
|
|
|
|
* hash code with BufTableHashCode(), then apply BufMappingPartitionLock().
|
|
|
|
* NB: NUM_BUFFER_PARTITIONS must be a power of 2!
|
|
|
|
*/
|
2022-07-27 19:54:37 +02:00
|
|
|
static inline uint32
|
|
|
|
BufTableHashPartition(uint32 hashcode)
|
|
|
|
{
|
|
|
|
return hashcode % NUM_BUFFER_PARTITIONS;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline LWLock *
|
|
|
|
BufMappingPartitionLock(uint32 hashcode)
|
|
|
|
{
|
|
|
|
return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET +
|
|
|
|
BufTableHashPartition(hashcode)].lock;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline LWLock *
|
|
|
|
BufMappingPartitionLockByIndex(uint32 index)
|
|
|
|
{
|
|
|
|
return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + index].lock;
|
|
|
|
}
|
2006-07-23 05:07:58 +02:00
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/*
|
2004-10-16 20:05:07 +02:00
|
|
|
* BufferDesc -- shared descriptor/state data for a single shared buffer.
|
2005-03-04 21:21:07 +01:00
|
|
|
*
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
* Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
|
2021-12-16 00:40:15 +01:00
|
|
|
* tag, state or wait_backend_pgprocno fields. In general, buffer header lock
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
* is a spinlock which is combined with flags, refcount and usagecount into
|
|
|
|
* single atomic variable. This layout allow us to do some operations in a
|
|
|
|
* single atomic operation, without actually acquiring and releasing spinlock;
|
|
|
|
* for instance, increase or decrease refcount. buf_id field never changes
|
|
|
|
* after initialization, so does not need locking. freeNext is protected by
|
|
|
|
* the buffer_strategy_lock not buffer header lock. The LWLock can take care
|
|
|
|
* of itself. The buffer header lock is *not* used to control access to the
|
|
|
|
* data in the buffer!
|
|
|
|
*
|
|
|
|
* It's assumed that nobody changes the state field while buffer header lock
|
|
|
|
* is held. Thus buffer header lock holder can do complex updates of the
|
|
|
|
* state variable in single write, simultaneously with lock release (cleaning
|
|
|
|
* BM_LOCKED flag). On the other hand, updating of state without holding
|
|
|
|
* buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
|
|
|
|
* is not set. Atomic increment/decrement, OR/AND etc. are not allowed.
|
2005-03-04 21:21:07 +01:00
|
|
|
*
|
|
|
|
* An exception is that if we have the buffer pinned, its tag can't change
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
* underneath us, so we can examine the tag without locking the buffer header.
|
2005-03-04 21:21:07 +01:00
|
|
|
* Also, in places we do one-time reads of the flags without bothering to
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
* lock the buffer header; this is generally for situations where we don't
|
|
|
|
* expect the flag bit being tested to be changing.
|
2005-03-04 21:21:07 +01:00
|
|
|
*
|
|
|
|
* We can't physically remove items from a disk page if another backend has
|
|
|
|
* the buffer pinned. Hence, a backend may need to wait for all other pins
|
2021-12-16 00:40:15 +01:00
|
|
|
* to go away. This is signaled by storing its own pgprocno into
|
|
|
|
* wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER. At present,
|
2005-03-04 21:21:07 +01:00
|
|
|
* there can be only one such waiter per buffer.
|
|
|
|
*
|
2016-04-14 00:28:29 +02:00
|
|
|
* We use this same struct for local buffer headers, but the locks are not
|
|
|
|
* used and not all of the flag bits are useful either. To avoid unnecessary
|
|
|
|
* overhead, manipulations of the state field should be done without actual
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-08 01:55:15 +02:00
|
|
|
* atomic operations (i.e. only pg_atomic_read_u32() and
|
|
|
|
* pg_atomic_unlocked_write_u32()).
|
2015-12-15 19:32:54 +01:00
|
|
|
*
|
|
|
|
* Be careful to avoid increasing the size of the struct when adding or
|
|
|
|
* reordering members. Keeping it below 64 bytes (the most common CPU
|
|
|
|
* cache line size) is fairly important for performance.
|
2021-03-11 03:58:05 +01:00
|
|
|
*
|
|
|
|
* Per-buffer I/O condition variables are currently kept outside this struct in
|
|
|
|
* a separate array. They could be moved in here and still fit within that
|
|
|
|
* limit on common systems, but for now that is not done.
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
typedef struct BufferDesc
|
1996-08-28 03:59:28 +02:00
|
|
|
{
|
2005-03-04 21:21:07 +01:00
|
|
|
BufferTag tag; /* ID of page contained in buffer */
|
|
|
|
int buf_id; /* buffer's index number (from 0) */
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
|
|
|
|
/* state of the tag, containing flags, refcount and usagecount */
|
|
|
|
pg_atomic_uint32 state;
|
|
|
|
|
2021-12-16 00:40:15 +01:00
|
|
|
int wait_backend_pgprocno; /* backend of pin-count waiter */
|
2005-03-04 21:21:07 +01:00
|
|
|
int freeNext; /* link in freelist chain */
|
2015-12-15 19:32:54 +01:00
|
|
|
LWLock content_lock; /* to lock access to buffer contents */
|
1999-11-21 20:56:12 +01:00
|
|
|
} BufferDesc;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
/*
|
|
|
|
* Concurrent access to buffer headers has proven to be more efficient if
|
|
|
|
* they're cache line aligned. So we force the start of the BufferDescriptors
|
|
|
|
* array to be on a cache line boundary and force the elements to be cache
|
|
|
|
* line sized.
|
|
|
|
*
|
|
|
|
* XXX: As this is primarily matters in highly concurrent workloads which
|
|
|
|
* probably all are 64bit these days, and the space wastage would be a bit
|
|
|
|
* more noticeable on 32bit systems, we don't force the stride to be cache
|
|
|
|
* line sized on those. If somebody does actual performance testing, we can
|
|
|
|
* reevaluate.
|
|
|
|
*
|
|
|
|
* Note that local buffer descriptors aren't forced to be aligned - as there's
|
|
|
|
* no concurrent access to those it's unlikely to be beneficial.
|
|
|
|
*
|
2020-09-04 19:27:52 +02:00
|
|
|
* We use a 64-byte cache line size here, because that's the most common
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
* size. Making it bigger would be a waste of memory. Even if running on a
|
|
|
|
* platform with either 32 or 128 byte line sizes, it's good to align to
|
|
|
|
* boundaries and avoid false sharing.
|
|
|
|
*/
|
|
|
|
#define BUFFERDESC_PAD_TO_SIZE (SIZEOF_VOID_P == 8 ? 64 : 1)
|
|
|
|
|
|
|
|
typedef union BufferDescPadded
|
|
|
|
{
|
|
|
|
BufferDesc bufferdesc;
|
|
|
|
char pad[BUFFERDESC_PAD_TO_SIZE];
|
|
|
|
} BufferDescPadded;
|
|
|
|
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-02-19 21:13:05 +01:00
|
|
|
/*
|
|
|
|
* The PendingWriteback & WritebackContext structure are used to keep
|
|
|
|
* information about pending flush requests to be issued to the OS.
|
|
|
|
*/
|
|
|
|
typedef struct PendingWriteback
|
|
|
|
{
|
|
|
|
/* could store different types of pending flushes here */
|
|
|
|
BufferTag tag;
|
|
|
|
} PendingWriteback;
|
|
|
|
|
|
|
|
/* struct forward declared in bufmgr.h */
|
|
|
|
typedef struct WritebackContext
|
|
|
|
{
|
|
|
|
/* pointer to the max number of writeback requests to coalesce */
|
|
|
|
int *max_pending;
|
|
|
|
|
|
|
|
/* current number of pending writeback requests */
|
|
|
|
int nr_pending;
|
|
|
|
|
|
|
|
/* pending requests */
|
|
|
|
PendingWriteback pending_writebacks[WRITEBACK_MAX_PENDING_FLUSHES];
|
|
|
|
} WritebackContext;
|
|
|
|
|
2005-03-04 21:21:07 +01:00
|
|
|
/* in buf_init.c */
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 17:49:03 +01:00
|
|
|
extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
|
2022-07-27 19:54:37 +02:00
|
|
|
extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-02-19 21:13:05 +01:00
|
|
|
extern PGDLLIMPORT WritebackContext BackendWritebackContext;
|
2004-04-20 01:27:17 +02:00
|
|
|
|
2005-02-04 00:29:19 +01:00
|
|
|
/* in localbuf.c */
|
2022-04-08 14:16:38 +02:00
|
|
|
extern PGDLLIMPORT BufferDesc *LocalBufferDescriptors;
|
2004-08-29 07:07:03 +02:00
|
|
|
|
2022-07-27 19:54:37 +02:00
|
|
|
|
|
|
|
static inline BufferDesc *
|
|
|
|
GetBufferDescriptor(uint32 id)
|
|
|
|
{
|
|
|
|
return &(BufferDescriptors[id]).bufferdesc;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline BufferDesc *
|
|
|
|
GetLocalBufferDescriptor(uint32 id)
|
|
|
|
{
|
|
|
|
return &LocalBufferDescriptors[id];
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline Buffer
|
|
|
|
BufferDescriptorGetBuffer(const BufferDesc *bdesc)
|
|
|
|
{
|
|
|
|
return (Buffer) (bdesc->buf_id + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline ConditionVariable *
|
|
|
|
BufferDescriptorGetIOCV(const BufferDesc *bdesc)
|
|
|
|
{
|
|
|
|
return &(BufferIOCVArray[bdesc->buf_id]).cv;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline LWLock *
|
|
|
|
BufferDescriptorGetContentLock(const BufferDesc *bdesc)
|
|
|
|
{
|
|
|
|
return (LWLock *) (&bdesc->content_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The freeNext field is either the index of the next freelist entry,
|
|
|
|
* or one of these special values:
|
|
|
|
*/
|
|
|
|
#define FREENEXT_END_OF_LIST (-1)
|
|
|
|
#define FREENEXT_NOT_IN_LIST (-2)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
|
|
|
|
* not apply these to local buffers!
|
|
|
|
*/
|
|
|
|
extern uint32 LockBufHdr(BufferDesc *desc);
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
|
|
|
|
{
|
|
|
|
pg_write_barrier();
|
|
|
|
pg_atomic_write_u32(&desc->state, buf_state & (~BM_LOCKED));
|
|
|
|
}
|
|
|
|
|
2016-02-19 21:17:51 +01:00
|
|
|
/* in bufmgr.c */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Structure to sort buffers per file on checkpoints.
|
|
|
|
*
|
|
|
|
* This structure is allocated per buffer in shared memory, so it should be
|
|
|
|
* kept as small as possible.
|
|
|
|
*/
|
|
|
|
typedef struct CkptSortItem
|
|
|
|
{
|
|
|
|
Oid tsId;
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileNumber relNumber;
|
2016-02-19 21:17:51 +01:00
|
|
|
ForkNumber forkNum;
|
|
|
|
BlockNumber blockNum;
|
|
|
|
int buf_id;
|
|
|
|
} CkptSortItem;
|
|
|
|
|
2022-04-08 14:16:38 +02:00
|
|
|
extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
|
2002-08-06 04:36:35 +02:00
|
|
|
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
/* ResourceOwner callbacks to hold buffer I/Os and pins */
|
|
|
|
extern const ResourceOwnerDesc buffer_io_resowner_desc;
|
|
|
|
extern const ResourceOwnerDesc buffer_pin_resowner_desc;
|
|
|
|
|
|
|
|
/* Convenience wrappers over ResourceOwnerRemember/Forget */
|
|
|
|
static inline void
|
|
|
|
ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer)
|
|
|
|
{
|
|
|
|
ResourceOwnerRemember(owner, Int32GetDatum(buffer), &buffer_pin_resowner_desc);
|
|
|
|
}
|
|
|
|
static inline void
|
|
|
|
ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
|
|
|
|
{
|
|
|
|
ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_pin_resowner_desc);
|
|
|
|
}
|
|
|
|
static inline void
|
|
|
|
ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
|
|
|
|
{
|
|
|
|
ResourceOwnerRemember(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
|
|
|
|
}
|
|
|
|
static inline void
|
|
|
|
ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
|
|
|
|
{
|
|
|
|
ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
|
|
|
|
}
|
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/*
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-02-19 21:13:05 +01:00
|
|
|
* Internal buffer management routines
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-02-19 21:13:05 +01:00
|
|
|
/* bufmgr.c */
|
2016-07-01 23:27:53 +02:00
|
|
|
extern void WritebackContextInit(WritebackContext *context, int *max_pending);
|
2023-05-17 20:18:35 +02:00
|
|
|
extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
|
|
|
|
extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
|
|
|
|
IOContext io_context, BufferTag *tag);
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2004-04-20 01:27:17 +02:00
|
|
|
/* freelist.c */
|
2023-04-13 19:15:20 +02:00
|
|
|
extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
|
|
|
extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
|
2023-02-10 07:22:26 +01:00
|
|
|
uint32 *buf_state, bool *from_ring);
|
2015-11-17 00:50:06 +01:00
|
|
|
extern void StrategyFreeBuffer(BufferDesc *buf);
|
2007-05-30 22:12:03 +02:00
|
|
|
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
|
2023-02-10 07:22:26 +01:00
|
|
|
BufferDesc *buf, bool from_ring);
|
2007-05-30 22:12:03 +02:00
|
|
|
|
2007-09-25 22:03:38 +02:00
|
|
|
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
|
2014-12-25 18:24:20 +01:00
|
|
|
extern void StrategyNotifyBgWriter(int bgwprocno);
|
Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
2012-05-10 05:36:01 +02:00
|
|
|
|
2005-08-21 01:26:37 +02:00
|
|
|
extern Size StrategyShmemSize(void);
|
2003-11-13 15:57:15 +01:00
|
|
|
extern void StrategyInitialize(bool init);
|
2017-08-21 20:43:00 +02:00
|
|
|
extern bool have_free_buffer(void);
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* buf_table.c */
|
2005-08-21 01:26:37 +02:00
|
|
|
extern Size BufTableShmemSize(int size);
|
2003-11-13 15:57:15 +01:00
|
|
|
extern void InitBufTable(int size);
|
2006-07-23 05:07:58 +02:00
|
|
|
extern uint32 BufTableHashCode(BufferTag *tagPtr);
|
|
|
|
extern int BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
|
|
|
|
extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
|
|
|
|
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
|
1996-08-28 03:59:28 +02:00
|
|
|
|
|
|
|
/* localbuf.c */
|
2023-04-05 19:42:17 +02:00
|
|
|
extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
|
|
|
|
extern void UnpinLocalBuffer(Buffer buffer);
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
extern void UnpinLocalBufferNoOwner(Buffer buffer);
|
2020-04-08 03:36:45 +02:00
|
|
|
extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
|
|
|
|
ForkNumber forkNum,
|
|
|
|
BlockNumber blockNum);
|
2009-01-12 06:10:45 +01:00
|
|
|
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
|
2023-03-31 04:22:40 +02:00
|
|
|
BlockNumber blockNum, bool *foundPtr);
|
2023-08-23 02:10:18 +02:00
|
|
|
extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
|
bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 01:21:09 +02:00
|
|
|
ForkNumber fork,
|
|
|
|
uint32 flags,
|
|
|
|
uint32 extend_by,
|
|
|
|
BlockNumber extend_upto,
|
|
|
|
Buffer *buffers,
|
|
|
|
uint32 *extended_by);
|
2006-04-01 01:32:07 +02:00
|
|
|
extern void MarkLocalBufferDirty(Buffer buffer);
|
2022-07-12 16:26:48 +02:00
|
|
|
extern void DropRelationLocalBuffers(RelFileLocator rlocator,
|
|
|
|
ForkNumber forkNum,
|
|
|
|
BlockNumber firstDelBlock);
|
|
|
|
extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
|
2002-08-06 04:36:35 +02:00
|
|
|
extern void AtEOXact_LocalBuffers(bool isCommit);
|
2001-10-28 07:26:15 +01:00
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
#endif /* BUFMGR_INTERNALS_H */
|