postgresql/src/include/storage/s_lock.h

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

868 lines
25 KiB
C
Raw Normal View History

1997-09-18 16:21:02 +02:00
/*-------------------------------------------------------------------------
*
* s_lock.h
* Hardware-dependent implementation of spinlocks.
1997-09-18 16:21:02 +02:00
*
* NOTE: none of the macros in this file are intended to be called directly.
* Call them through the hardware-independent macros in spin.h.
1997-09-18 16:21:02 +02:00
*
* The following hardware-dependent macros must be provided for each
* supported platform:
*
* void S_INIT_LOCK(slock_t *lock)
* Initialize a spinlock (to the unlocked state).
*
* int S_LOCK(slock_t *lock)
* Acquire a spinlock, waiting if necessary.
* Time out and abort() if unable to acquire the lock in a
* "reasonable" amount of time --- typically ~ 1 minute.
* Should return number of "delays"; see s_lock.c
*
* void S_UNLOCK(slock_t *lock)
* Unlock a previously acquired lock.
*
* bool S_LOCK_FREE(slock_t *lock)
* Tests if the lock is free. Returns true if free, false if locked.
* This does *not* change the state of the lock.
*
* void SPIN_DELAY(void)
* Delay operation to occur inside spinlock wait loop.
*
* Note to implementors: there are default implementations for all these
* macros at the bottom of the file. Check if your platform can use
* these or needs to override them.
*
* Usually, S_LOCK() is implemented in terms of even lower-level macros
* TAS() and TAS_SPIN():
*
* int TAS(slock_t *lock)
* Atomic test-and-set instruction. Attempt to acquire the lock,
* but do *not* wait. Returns 0 if successful, nonzero if unable
* to acquire the lock.
*
* int TAS_SPIN(slock_t *lock)
* Like TAS(), but this version is used when waiting for a lock
* previously found to be contended. By default, this is the
* same as TAS(), but on some architectures it's better to poll a
* contended lock using an unlocked instruction and retry the
* atomic test-and-set only when it appears free.
1997-09-18 16:21:02 +02:00
*
* TAS() and TAS_SPIN() are NOT part of the API, and should never be called
* directly.
*
* CAUTION: on some platforms TAS() and/or TAS_SPIN() may sometimes report
* failure to acquire a lock even when the lock is not locked. For example,
* on Alpha TAS() will "fail" if interrupted. Therefore a retry loop must
* always be used, even if you are certain the lock is free.
*
* It is the responsibility of these macros to make sure that the compiler
* does not re-order accesses to shared memory to precede the actual lock
* acquisition, or follow the lock release. Prior to PostgreSQL 9.5, this
* was the caller's responsibility, which meant that callers had to use
* volatile-qualified pointers to refer to both the spinlock itself and the
* shared data being accessed within the spinlocked critical section. This
* was notationally awkward, easy to forget (and thus error-prone), and
* prevented some useful compiler optimizations. For these reasons, we
* now require that the macros themselves prevent compiler re-ordering,
* so that the caller doesn't need to take special precautions.
*
* On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
* S_UNLOCK() macros must further include hardware-level memory fence
* instructions to prevent similar re-ordering at the hardware level.
* TAS() and TAS_SPIN() must guarantee that loads and stores issued after
* the macro are not executed until the lock has been obtained. Conversely,
* S_UNLOCK() must guarantee that loads and stores issued before the macro
* have been executed before the lock is released.
*
* On most supported platforms, TAS() uses a tas() function written
* in assembly language to execute a hardware atomic-test-and-set
* instruction. Equivalent OS-supplied mutex routines could be used too.
*
* If no system-specific TAS() is available (ie, HAVE_SPINLOCKS is not
* defined), then we fall back on an emulation that uses SysV semaphores
* (see spin.c). This emulation will be MUCH MUCH slower than a proper TAS()
* implementation, because of the cost of a kernel call per lock or unlock.
* An old report is that Postgres spends around 40% of its time in semop(2)
* when using the SysV semaphore code.
1997-09-18 16:21:02 +02:00
*
*
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
2010-09-20 22:08:53 +02:00
* src/include/storage/s_lock.h
*
*-------------------------------------------------------------------------
1997-09-18 16:21:02 +02:00
*/
#ifndef S_LOCK_H
1997-09-18 16:21:02 +02:00
#define S_LOCK_H
#ifdef FRONTEND
#error "s_lock.h may not be included from frontend code"
#endif
#ifdef HAVE_SPINLOCKS /* skip spinlocks if requested */
#if defined(__GNUC__) || defined(__INTEL_COMPILER)
/*************************************************************************
* All the gcc inlines
* Gcc consistently defines the CPU as __cpu__.
* Other compilers use __cpu or __cpu__ so we test for both in those cases.
*/
/*----------
* Standard gcc asm format (assuming "volatile slock_t *lock"):
__asm__ __volatile__(
" instruction \n"
" instruction \n"
" instruction \n"
: "=r"(_res), "+m"(*lock) // return register, in/out lock value
: "r"(lock) // lock pointer, in input register
: "memory", "cc"); // show clobbered registers here
* The output-operands list (after first colon) should always include
* "+m"(*lock), whether or not the asm code actually refers to this
* operand directly. This ensures that gcc believes the value in the
* lock variable is used and set by the asm code. Also, the clobbers
* list (after third colon) should always include "memory"; this prevents
* gcc from thinking it can cache the values of shared-memory fields
* across the asm code. Add "cc" if your asm code changes the condition
* code register, and also list any temp registers the code uses.
*----------
*/
2005-12-17 21:15:43 +01:00
#ifdef __i386__ /* 32-bit i386 */
#define HAS_TEST_AND_SET
typedef unsigned char slock_t;
#define TAS(lock) tas(lock)
static __inline__ int
tas(volatile slock_t *lock)
{
slock_t _res = 1;
/*
* Use a non-locking test before asserting the bus lock. Note that the
* extra test appears to be a small loss on some x86 platforms and a small
* win on others; it's by no means clear that we should keep it.
*
* When this was last tested, we didn't have separate TAS() and TAS_SPIN()
* macros. Nowadays it probably would be better to do a non-locking test
* in TAS_SPIN() but not in TAS(), like on x86_64, but no-one's done the
* testing to verify that. Without some empirical evidence, better to
* leave it alone.
*/
__asm__ __volatile__(
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
"1: \n"
: "+q"(_res), "+m"(*lock)
: /* no inputs */
: "memory", "cc");
return (int) _res;
}
#define SPIN_DELAY() spin_delay()
static __inline__ void
spin_delay(void)
{
/*
* This sequence is equivalent to the PAUSE instruction ("rep" is
* ignored by old IA32 processors if the following instruction is
* not a string operation); the IA-32 Architecture Software
* Developer's Manual, Vol. 3, Section 7.7.2 describes why using
* PAUSE in the inner loop of a spin lock is necessary for good
* performance:
*
* The PAUSE instruction improves the performance of IA-32
* processors supporting Hyper-Threading Technology when
* executing spin-wait loops and other routines where one
* thread is accessing a shared lock or semaphore in a tight
* polling loop. When executing a spin-wait loop, the
* processor can suffer a severe performance penalty when
* exiting the loop because it detects a possible memory order
* violation and flushes the core processor's pipeline. The
* PAUSE instruction provides a hint to the processor that the
* code sequence is a spin-wait loop. The processor uses this
* hint to avoid the memory order violation and prevent the
* pipeline flush. In addition, the PAUSE instruction
* de-pipelines the spin-wait loop to prevent it from
* consuming execution resources excessively.
*/
__asm__ __volatile__(
" rep; nop \n");
}
#endif /* __i386__ */
#ifdef __x86_64__ /* AMD Opteron, Intel EM64T */
#define HAS_TEST_AND_SET
typedef unsigned char slock_t;
#define TAS(lock) tas(lock)
/*
* On Intel EM64T, it's a win to use a non-locking test before the xchg proper,
* but only when spinning.
*
* See also Implementing Scalable Atomic Locks for Multi-Core Intel(tm) EM64T
* and IA32, by Michael Chynoweth and Mary R. Lee. As of this writing, it is
* available at:
* http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
*/
#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
static __inline__ int
tas(volatile slock_t *lock)
{
slock_t _res = 1;
__asm__ __volatile__(
" lock \n"
" xchgb %0,%1 \n"
: "+q"(_res), "+m"(*lock)
: /* no inputs */
: "memory", "cc");
return (int) _res;
}
#define SPIN_DELAY() spin_delay()
static __inline__ void
spin_delay(void)
{
/*
* Adding a PAUSE in the spin delay loop is demonstrably a no-op on
* Opteron, but it may be of some use on EM64T, so we keep it.
*/
__asm__ __volatile__(
" rep; nop \n");
}
#endif /* __x86_64__ */
/*
* On ARM and ARM64, we use __sync_lock_test_and_set(int *, int) if available.
*
* We use the int-width variant of the builtin because it works on more chips
* than other widths.
*/
#if defined(__arm__) || defined(__arm) || defined(__aarch64__)
Add a basic atomic ops API abstracting away platform/architecture details. Several upcoming performance/scalability improvements require atomic operations. This new API avoids the need to splatter compiler and architecture dependent code over all the locations employing atomic ops. For several of the potential usages it'd be problematic to maintain both, a atomics using implementation and one using spinlocks or similar. In all likelihood one of the implementations would not get tested regularly under concurrency. To avoid that scenario the new API provides a automatic fallback of atomic operations to spinlocks. All properties of atomic operations are maintained. This fallback - obviously - isn't as fast as just using atomic ops, but it's not bad either. For one of the future users the atomics ontop spinlocks implementation was actually slightly faster than the old purely spinlock using implementation. That's important because it reduces the fear of regressing older platforms when improving the scalability for new ones. The API, loosely modeled after the C11 atomics support, currently provides 'atomic flags' and 32 bit unsigned integers. If the platform efficiently supports atomic 64 bit unsigned integers those are also provided. To implement atomics support for a platform/architecture/compiler for a type of atomics 32bit compare and exchange needs to be implemented. If available and more efficient native support for flags, 32 bit atomic addition, and corresponding 64 bit operations may also be provided. Additional useful atomic operations are implemented generically ontop of these. The implementation for various versions of gcc, msvc and sun studio have been tested. Additional existing stub implementations for * Intel icc * HUPX acc * IBM xlc are included but have never been tested. These will likely require fixes based on buildfarm and user feedback. As atomic operations also require barriers for some operations the existing barrier support has been moved into the atomics code. Author: Andres Freund with contributions from Oskari Saarenmaa Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com, 20131015123303.GH5300@awork2.anarazel.de, 20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
#ifdef HAVE_GCC__SYNC_INT32_TAS
#define HAS_TEST_AND_SET
#define TAS(lock) tas(lock)
typedef int slock_t;
static __inline__ int
tas(volatile slock_t *lock)
{
return __sync_lock_test_and_set(lock, 1);
}
#define S_UNLOCK(lock) __sync_lock_release(lock)
/*
* Using an ISB instruction to delay in spinlock loops appears beneficial on
* high-core-count ARM64 processors. It seems mostly a wash for smaller gear,
* and ISB doesn't exist at all on pre-v7 ARM chips.
*/
#if defined(__aarch64__)
#define SPIN_DELAY() spin_delay()
static __inline__ void
spin_delay(void)
{
__asm__ __volatile__(
" isb; \n");
}
#endif /* __aarch64__ */
Add a basic atomic ops API abstracting away platform/architecture details. Several upcoming performance/scalability improvements require atomic operations. This new API avoids the need to splatter compiler and architecture dependent code over all the locations employing atomic ops. For several of the potential usages it'd be problematic to maintain both, a atomics using implementation and one using spinlocks or similar. In all likelihood one of the implementations would not get tested regularly under concurrency. To avoid that scenario the new API provides a automatic fallback of atomic operations to spinlocks. All properties of atomic operations are maintained. This fallback - obviously - isn't as fast as just using atomic ops, but it's not bad either. For one of the future users the atomics ontop spinlocks implementation was actually slightly faster than the old purely spinlock using implementation. That's important because it reduces the fear of regressing older platforms when improving the scalability for new ones. The API, loosely modeled after the C11 atomics support, currently provides 'atomic flags' and 32 bit unsigned integers. If the platform efficiently supports atomic 64 bit unsigned integers those are also provided. To implement atomics support for a platform/architecture/compiler for a type of atomics 32bit compare and exchange needs to be implemented. If available and more efficient native support for flags, 32 bit atomic addition, and corresponding 64 bit operations may also be provided. Additional useful atomic operations are implemented generically ontop of these. The implementation for various versions of gcc, msvc and sun studio have been tested. Additional existing stub implementations for * Intel icc * HUPX acc * IBM xlc are included but have never been tested. These will likely require fixes based on buildfarm and user feedback. As atomic operations also require barriers for some operations the existing barrier support has been moved into the atomics code. Author: Andres Freund with contributions from Oskari Saarenmaa Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com, 20131015123303.GH5300@awork2.anarazel.de, 20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
#endif /* HAVE_GCC__SYNC_INT32_TAS */
#endif /* __arm__ || __arm || __aarch64__ */
/* S/390 and S/390x Linux (32- and 64-bit zSeries) */
2005-12-17 21:15:43 +01:00
#if defined(__s390__) || defined(__s390x__)
#define HAS_TEST_AND_SET
typedef unsigned int slock_t;
#define TAS(lock) tas(lock)
static __inline__ int
tas(volatile slock_t *lock)
{
int _res = 0;
__asm__ __volatile__(
" cs %0,%3,0(%2) \n"
: "+d"(_res), "+m"(*lock)
: "a"(lock), "d"(1)
: "memory", "cc");
return _res;
}
#endif /* __s390__ || __s390x__ */
1999-04-13 19:42:26 +02:00
2005-12-17 21:15:43 +01:00
#if defined(__sparc__) /* Sparc */
/*
* Solaris has always run sparc processors in TSO (total store) mode, but
* linux didn't use to and the *BSDs still don't. So, be careful about
* acquire/release semantics. The CPU will treat superfluous members as
* NOPs, so it's just code space.
*/
#define HAS_TEST_AND_SET
typedef unsigned char slock_t;
#define TAS(lock) tas(lock)
static __inline__ int
tas(volatile slock_t *lock)
{
slock_t _res;
/*
* See comment in src/backend/port/tas/sunstudio_sparc.s for why this
* uses "ldstub", and that file uses "cas". gcc currently generates
* sparcv7-targeted binaries, so "cas" use isn't possible.
*/
__asm__ __volatile__(
" ldstub [%2], %0 \n"
: "=r"(_res), "+m"(*lock)
: "r"(lock)
: "memory");
#if defined(__sparcv7) || defined(__sparc_v7__)
/*
* No stbar or membar available, luckily no actually produced hardware
* requires a barrier.
*/
#elif defined(__sparcv8) || defined(__sparc_v8__)
/* stbar is available (and required for both PSO, RMO), membar isn't */
__asm__ __volatile__ ("stbar \n":::"memory");
#else
/*
* #LoadStore (RMO) | #LoadLoad (RMO) together are the appropriate acquire
* barrier for sparcv8+ upwards.
*/
__asm__ __volatile__ ("membar #LoadStore | #LoadLoad \n":::"memory");
#endif
return (int) _res;
}
#if defined(__sparcv7) || defined(__sparc_v7__)
/*
* No stbar or membar available, luckily no actually produced hardware
* requires a barrier. We fall through to the default gcc definition of
* S_UNLOCK in this case.
*/
#elif defined(__sparcv8) || defined(__sparc_v8__)
/* stbar is available (and required for both PSO, RMO), membar isn't */
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ ("stbar \n":::"memory"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#else
/*
* #LoadStore (RMO) | #StoreStore (RMO, PSO) together are the appropriate
* release barrier for sparcv8+ upwards.
*/
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ ("membar #LoadStore | #StoreStore \n":::"memory"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#endif
#endif /* __sparc__ */
2005-12-17 21:15:43 +01:00
/* PowerPC */
#if defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
#define HAS_TEST_AND_SET
typedef unsigned int slock_t;
#define TAS(lock) tas(lock)
/* On PPC, it's a win to use a non-locking test before the lwarx */
#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
/*
* The second operand of addi can hold a constant zero or a register number,
* hence constraint "=&b" to avoid allocating r0. "b" stands for "address
* base register"; most operands having this register-or-zero property are
* address bases, e.g. the second operand of lwax.
*
* NOTE: per the Enhanced PowerPC Architecture manual, v1.0 dated 7-May-2002,
* an isync is a sufficient synchronization barrier after a lwarx/stwcx loop.
* But if the spinlock is in ordinary memory, we can use lwsync instead for
* better performance.
*
* Ordinarily, we'd code the branches here using GNU-style local symbols, that
* is "1f" referencing "1:" and so on. But some people run gcc on AIX with
* IBM's assembler as backend, and IBM's assembler doesn't do local symbols.
* So hand-code the branch offsets; fortunately, all PPC instructions are
* exactly 4 bytes each, so it's not too hard to count.
*/
static __inline__ int
tas(volatile slock_t *lock)
{
slock_t _t;
int _res;
__asm__ __volatile__(
" lwarx %0,0,%3,1 \n"
" cmpwi %0,0 \n"
" bne $+16 \n" /* branch to li %1,1 */
" addi %0,%0,1 \n"
" stwcx. %0,0,%3 \n"
" beq $+12 \n" /* branch to lwsync */
" li %1,1 \n"
" b $+12 \n" /* branch to end of asm sequence */
" lwsync \n"
" li %1,0 \n"
: "=&b"(_t), "=r"(_res), "+m"(*lock)
: "r"(lock)
: "memory", "cc");
return _res;
}
/*
* PowerPC S_UNLOCK is almost standard but requires a "sync" instruction.
* But we can use lwsync instead for better performance.
*/
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ (" lwsync \n" ::: "memory"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#endif /* powerpc */
2005-12-17 21:15:43 +01:00
#if defined(__mips__) && !defined(__sgi) /* non-SGI MIPS */
#define HAS_TEST_AND_SET
typedef unsigned int slock_t;
#define TAS(lock) tas(lock)
/*
* Original MIPS-I processors lacked the LL/SC instructions, but if we are
* so unfortunate as to be running on one of those, we expect that the kernel
* will handle the illegal-instruction traps and emulate them for us. On
* anything newer (and really, MIPS-I is extinct) LL/SC is the only sane
* choice because any other synchronization method must involve a kernel
* call. Unfortunately, many toolchains still default to MIPS-I as the
* codegen target; if the symbol __mips shows that that's the case, we
* have to force the assembler to accept LL/SC.
*
* R10000 and up processors require a separate SYNC, which has the same
* issues as LL/SC.
*/
#if __mips < 2
#define MIPS_SET_MIPS2 " .set mips2 \n"
#else
#define MIPS_SET_MIPS2
#endif
static __inline__ int
tas(volatile slock_t *lock)
{
volatile slock_t *_l = lock;
int _res;
int _tmp;
__asm__ __volatile__(
" .set push \n"
MIPS_SET_MIPS2
" .set noreorder \n"
" .set nomacro \n"
" ll %0, %2 \n"
" or %1, %0, 1 \n"
" sc %1, %2 \n"
" xori %1, 1 \n"
" or %0, %0, %1 \n"
" sync \n"
" .set pop "
: "=&r" (_res), "=&r" (_tmp), "+R" (*_l)
: /* no inputs */
: "memory");
return _res;
}
/* MIPS S_UNLOCK is almost standard but requires a "sync" instruction */
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__( \
" .set push \n" \
MIPS_SET_MIPS2 \
" .set noreorder \n" \
" .set nomacro \n" \
" sync \n" \
" .set pop " \
: /* no outputs */ \
: /* no inputs */ \
: "memory"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#endif /* __mips__ && !__sgi */
#if defined(__hppa) || defined(__hppa__) /* HP PA-RISC */
/*
* HP's PA-RISC
*
* Because LDCWX requires a 16-byte-aligned address, we declare slock_t as a
* 16-byte struct. The active word in the struct is whichever has the aligned
* address; the other three words just sit at -1.
*/
#define HAS_TEST_AND_SET
typedef struct
{
int sema[4];
} slock_t;
#define TAS_ACTIVE_WORD(lock) ((volatile int *) (((uintptr_t) (lock) + 15) & ~15))
static __inline__ int
tas(volatile slock_t *lock)
{
volatile int *lockword = TAS_ACTIVE_WORD(lock);
int lockval;
/*
* The LDCWX instruction atomically clears the target word and
* returns the previous value. Hence, if the instruction returns
* 0, someone else has already acquired the lock before we tested
* it (i.e., we have failed).
*
* Notice that this means that we actually clear the word to set
* the lock and set the word to clear the lock. This is the
* opposite behavior from the SPARC LDSTUB instruction. For some
* reason everything that H-P does is rather baroque...
*
* For details about the LDCWX instruction, see the "Precision
* Architecture and Instruction Reference Manual" (09740-90014 of June
* 1987), p. 5-38.
*/
__asm__ __volatile__(
" ldcwx 0(0,%2),%0 \n"
: "=r"(lockval), "+m"(*lockword)
: "r"(lockword)
: "memory");
return (lockval == 0);
}
#define S_UNLOCK(lock) \
do { \
__asm__ __volatile__("" : : : "memory"); \
*TAS_ACTIVE_WORD(lock) = -1; \
} while (0)
#define S_INIT_LOCK(lock) \
2001-01-19 08:03:53 +01:00
do { \
volatile slock_t *lock_ = (lock); \
lock_->sema[0] = -1; \
lock_->sema[1] = -1; \
lock_->sema[2] = -1; \
lock_->sema[3] = -1; \
2001-01-19 08:03:53 +01:00
} while (0)
#define S_LOCK_FREE(lock) (*TAS_ACTIVE_WORD(lock) != 0)
#endif /* __hppa || __hppa__ */
/*
* If we have no platform-specific knowledge, but we found that the compiler
* provides __sync_lock_test_and_set(), use that. Prefer the int-width
* version over the char-width version if we have both, on the rather dubious
* grounds that that's known to be more likely to work in the ARM ecosystem.
* (But we dealt with ARM above.)
*/
#if !defined(HAS_TEST_AND_SET)
#if defined(HAVE_GCC__SYNC_INT32_TAS)
#define HAS_TEST_AND_SET
#define TAS(lock) tas(lock)
typedef int slock_t;
static __inline__ int
tas(volatile slock_t *lock)
{
return __sync_lock_test_and_set(lock, 1);
}
#define S_UNLOCK(lock) __sync_lock_release(lock)
#elif defined(HAVE_GCC__SYNC_CHAR_TAS)
#define HAS_TEST_AND_SET
#define TAS(lock) tas(lock)
typedef char slock_t;
static __inline__ int
tas(volatile slock_t *lock)
{
return __sync_lock_test_and_set(lock, 1);
}
#define S_UNLOCK(lock) __sync_lock_release(lock)
#endif /* HAVE_GCC__SYNC_INT32_TAS */
#endif /* !defined(HAS_TEST_AND_SET) */
/*
* Default implementation of S_UNLOCK() for gcc/icc.
*
* Note that this implementation is unsafe for any platform that can reorder
* a memory access (either load or store) after a following store. That
* happens not to be possible on x86 and most legacy architectures (some are
* single-processor!), but many modern systems have weaker memory ordering.
* Those that do must define their own version of S_UNLOCK() rather than
* relying on this one.
*/
#if !defined(S_UNLOCK)
#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory"); *(lock) = 0; } while (0)
#endif
#endif /* defined(__GNUC__) || defined(__INTEL_COMPILER) */
/*
* ---------------------------------------------------------------------
* Platforms that use non-gcc inline assembly:
* ---------------------------------------------------------------------
*/
#if !defined(HAS_TEST_AND_SET) /* We didn't trigger above, let's try here */
2005-12-17 21:39:16 +01:00
#if defined(_AIX) /* AIX */
/*
* AIX (POWER)
*/
#define HAS_TEST_AND_SET
#include <sys/atomic_op.h>
typedef int slock_t;
#define TAS(lock) _check_lock((slock_t *) (lock), 0, 1)
#define S_UNLOCK(lock) _clear_lock((slock_t *) (lock), 0)
#endif /* _AIX */
/* These are in sunstudio_(sparc|x86).s */
#if defined(__SUNPRO_C) && (defined(__i386) || defined(__x86_64__) || defined(__sparc__) || defined(__sparc))
#define HAS_TEST_AND_SET
2006-05-19 15:10:11 +02:00
#if defined(__i386) || defined(__x86_64__) || defined(__sparcv9) || defined(__sparcv8plus)
typedef unsigned int slock_t;
#else
typedef unsigned char slock_t;
#endif
extern slock_t pg_atomic_cas(volatile slock_t *lock, slock_t with,
slock_t cmp);
#define TAS(a) (pg_atomic_cas((a), 1, 0) != 0)
#endif
#ifdef _MSC_VER
typedef LONG slock_t;
#define HAS_TEST_AND_SET
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define SPIN_DELAY() spin_delay()
/* If using Visual C++ on Win64, inline assembly is unavailable.
* Use a _mm_pause intrinsic instead of rep nop.
*/
#if defined(_WIN64)
static __forceinline void
spin_delay(void)
{
_mm_pause();
}
#else
static __forceinline void
spin_delay(void)
{
/* See comment for gcc code. Same code, MASM syntax */
__asm rep nop;
}
#endif
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)
#define S_UNLOCK(lock) \
do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
#endif
#endif /* !defined(HAS_TEST_AND_SET) */
/* Blow up if we didn't have any way to do spinlocks */
#ifndef HAS_TEST_AND_SET
#error PostgreSQL does not have native spinlock support on this platform. To continue the compilation, rerun configure using --disable-spinlocks. However, performance will be poor. Please report this to pgsql-bugs@lists.postgresql.org.
#endif
#else /* !HAVE_SPINLOCKS */
/*
* Fake spinlock implementation using semaphores --- slow and prone
* to fall foul of kernel limits on number of semaphores, so don't use this
* unless you must! The subroutines appear in spin.c.
*/
typedef int slock_t;
extern bool s_lock_free_sema(volatile slock_t *lock);
extern void s_unlock_sema(volatile slock_t *lock);
Add a basic atomic ops API abstracting away platform/architecture details. Several upcoming performance/scalability improvements require atomic operations. This new API avoids the need to splatter compiler and architecture dependent code over all the locations employing atomic ops. For several of the potential usages it'd be problematic to maintain both, a atomics using implementation and one using spinlocks or similar. In all likelihood one of the implementations would not get tested regularly under concurrency. To avoid that scenario the new API provides a automatic fallback of atomic operations to spinlocks. All properties of atomic operations are maintained. This fallback - obviously - isn't as fast as just using atomic ops, but it's not bad either. For one of the future users the atomics ontop spinlocks implementation was actually slightly faster than the old purely spinlock using implementation. That's important because it reduces the fear of regressing older platforms when improving the scalability for new ones. The API, loosely modeled after the C11 atomics support, currently provides 'atomic flags' and 32 bit unsigned integers. If the platform efficiently supports atomic 64 bit unsigned integers those are also provided. To implement atomics support for a platform/architecture/compiler for a type of atomics 32bit compare and exchange needs to be implemented. If available and more efficient native support for flags, 32 bit atomic addition, and corresponding 64 bit operations may also be provided. Additional useful atomic operations are implemented generically ontop of these. The implementation for various versions of gcc, msvc and sun studio have been tested. Additional existing stub implementations for * Intel icc * HUPX acc * IBM xlc are included but have never been tested. These will likely require fixes based on buildfarm and user feedback. As atomic operations also require barriers for some operations the existing barrier support has been moved into the atomics code. Author: Andres Freund with contributions from Oskari Saarenmaa Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com, 20131015123303.GH5300@awork2.anarazel.de, 20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
extern void s_init_lock_sema(volatile slock_t *lock, bool nested);
extern int tas_sema(volatile slock_t *lock);
#define S_LOCK_FREE(lock) s_lock_free_sema(lock)
#define S_UNLOCK(lock) s_unlock_sema(lock)
Add a basic atomic ops API abstracting away platform/architecture details. Several upcoming performance/scalability improvements require atomic operations. This new API avoids the need to splatter compiler and architecture dependent code over all the locations employing atomic ops. For several of the potential usages it'd be problematic to maintain both, a atomics using implementation and one using spinlocks or similar. In all likelihood one of the implementations would not get tested regularly under concurrency. To avoid that scenario the new API provides a automatic fallback of atomic operations to spinlocks. All properties of atomic operations are maintained. This fallback - obviously - isn't as fast as just using atomic ops, but it's not bad either. For one of the future users the atomics ontop spinlocks implementation was actually slightly faster than the old purely spinlock using implementation. That's important because it reduces the fear of regressing older platforms when improving the scalability for new ones. The API, loosely modeled after the C11 atomics support, currently provides 'atomic flags' and 32 bit unsigned integers. If the platform efficiently supports atomic 64 bit unsigned integers those are also provided. To implement atomics support for a platform/architecture/compiler for a type of atomics 32bit compare and exchange needs to be implemented. If available and more efficient native support for flags, 32 bit atomic addition, and corresponding 64 bit operations may also be provided. Additional useful atomic operations are implemented generically ontop of these. The implementation for various versions of gcc, msvc and sun studio have been tested. Additional existing stub implementations for * Intel icc * HUPX acc * IBM xlc are included but have never been tested. These will likely require fixes based on buildfarm and user feedback. As atomic operations also require barriers for some operations the existing barrier support has been moved into the atomics code. Author: Andres Freund with contributions from Oskari Saarenmaa Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com, 20131015123303.GH5300@awork2.anarazel.de, 20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
#define S_INIT_LOCK(lock) s_init_lock_sema(lock, false)
#define TAS(lock) tas_sema(lock)
#endif /* HAVE_SPINLOCKS */
/*
* Default Definitions - override these above as needed.
*/
1997-09-18 16:21:02 +02:00
#if !defined(S_LOCK)
#define S_LOCK(lock) \
(TAS(lock) ? s_lock((lock), __FILE__, __LINE__, __func__) : 0)
#endif /* S_LOCK */
1997-09-18 16:21:02 +02:00
#if !defined(S_LOCK_FREE)
#define S_LOCK_FREE(lock) (*(lock) == 0)
#endif /* S_LOCK_FREE */
#if !defined(S_UNLOCK)
/*
* Our default implementation of S_UNLOCK is essentially *(lock) = 0. This
* is unsafe if the platform can reorder a memory access (either load or
* store) after a following store; platforms where this is possible must
* define their own S_UNLOCK. But CPU reordering is not the only concern:
* if we simply defined S_UNLOCK() as an inline macro, the compiler might
* reorder instructions from inside the critical section to occur after the
* lock release. Since the compiler probably can't know what the external
* function s_unlock is doing, putting the same logic there should be adequate.
* A sufficiently-smart globally optimizing compiler could break that
* assumption, though, and the cost of a function call for every spinlock
* release may hurt performance significantly, so we use this implementation
* only for platforms where we don't know of a suitable intrinsic. For the
* most part, those are relatively obscure platform/compiler combinations to
* which the PostgreSQL project does not have access.
*/
#define USE_DEFAULT_S_UNLOCK
extern void s_unlock(volatile slock_t *lock);
#define S_UNLOCK(lock) s_unlock(lock)
#endif /* S_UNLOCK */
1997-09-18 16:21:02 +02:00
#if !defined(S_INIT_LOCK)
1997-09-18 16:21:02 +02:00
#define S_INIT_LOCK(lock) S_UNLOCK(lock)
#endif /* S_INIT_LOCK */
1997-09-18 16:21:02 +02:00
#if !defined(SPIN_DELAY)
#define SPIN_DELAY() ((void) 0)
#endif /* SPIN_DELAY */
#if !defined(TAS)
extern int tas(volatile slock_t *lock); /* in port/.../tas.s, or
* s_lock.c */
1997-09-25 01:37:26 +02:00
#define TAS(lock) tas(lock)
#endif /* TAS */
#if !defined(TAS_SPIN)
#define TAS_SPIN(lock) TAS(lock)
#endif /* TAS_SPIN */
extern PGDLLIMPORT slock_t dummy_spinlock;
/*
* Platform-independent out-of-line support routines
*/
extern int s_lock(volatile slock_t *lock, const char *file, int line, const char *func);
/* Support for dynamic adjustment of spins_per_delay */
#define DEFAULT_SPINS_PER_DELAY 100
extern void set_spins_per_delay(int shared_spins_per_delay);
extern int update_spins_per_delay(int shared_spins_per_delay);
Allow Pin/UnpinBuffer to operate in a lockfree manner. Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
/*
* Support for spin delay which is useful in various places where
* spinlock-like procedures take place.
*/
typedef struct
{
int spins;
int delays;
int cur_delay;
const char *file;
int line;
const char *func;
Allow Pin/UnpinBuffer to operate in a lockfree manner. Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
} SpinDelayStatus;
static inline void
init_spin_delay(SpinDelayStatus *status,
const char *file, int line, const char *func)
{
status->spins = 0;
status->delays = 0;
status->cur_delay = 0;
status->file = file;
status->line = line;
status->func = func;
}
#define init_local_spin_delay(status) init_spin_delay(status, __FILE__, __LINE__, __func__)
extern void perform_spin_delay(SpinDelayStatus *status);
extern void finish_spin_delay(SpinDelayStatus *status);
Allow Pin/UnpinBuffer to operate in a lockfree manner. Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-11 05:12:32 +02:00
#endif /* S_LOCK_H */