Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* pg_popcount_avx512_choose.c
|
|
|
|
* Test whether we can use the AVX-512 pg_popcount() implementation.
|
|
|
|
*
|
|
|
|
* Copyright (c) 2024, PostgreSQL Global Development Group
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/port/pg_popcount_avx512_choose.c
|
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "c.h"
|
|
|
|
|
|
|
|
#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
|
|
|
|
#include <cpuid.h>
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef HAVE_XSAVE_INTRINSICS
|
|
|
|
#include <immintrin.h>
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
|
|
|
|
#include <intrin.h>
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#include "port/pg_bitutils.h"
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
|
|
|
|
* use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
|
|
|
|
* the function pointers that are only used when TRY_POPCNT_FAST is set.
|
|
|
|
*/
|
|
|
|
#ifdef TRY_POPCNT_FAST
|
|
|
|
|
|
|
|
/*
|
2024-04-23 17:54:04 +02:00
|
|
|
* Does CPUID say there's support for XSAVE instructions?
|
Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
*/
|
2024-04-23 17:54:04 +02:00
|
|
|
static inline bool
|
|
|
|
xsave_available(void)
|
Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
{
|
|
|
|
unsigned int exx[4] = {0, 0, 0, 0};
|
|
|
|
|
|
|
|
#if defined(HAVE__GET_CPUID)
|
|
|
|
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
|
|
|
|
#elif defined(HAVE__CPUID)
|
|
|
|
__cpuid(exx, 1);
|
|
|
|
#else
|
|
|
|
#error cpuid instruction not available
|
|
|
|
#endif
|
2024-04-23 17:54:04 +02:00
|
|
|
return (exx[2] & (1 << 27)) != 0; /* osxsave */
|
|
|
|
}
|
Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
|
2024-04-23 17:54:04 +02:00
|
|
|
/*
|
|
|
|
* Does XGETBV say the ZMM registers are enabled?
|
|
|
|
*
|
|
|
|
* NB: Caller is responsible for verifying that xsave_available() returns true
|
|
|
|
* before calling this.
|
|
|
|
*/
|
|
|
|
static inline bool
|
|
|
|
zmm_regs_available(void)
|
|
|
|
{
|
Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
#ifdef HAVE_XSAVE_INTRINSICS
|
2024-04-23 17:54:04 +02:00
|
|
|
return (_xgetbv(0) & 0xe6) == 0xe6;
|
Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
#else
|
|
|
|
return false;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2024-04-23 17:54:04 +02:00
|
|
|
/*
|
|
|
|
* Does CPUID say there's support for AVX-512 popcount and byte-and-word
|
|
|
|
* instructions?
|
|
|
|
*/
|
|
|
|
static inline bool
|
|
|
|
avx512_popcnt_available(void)
|
|
|
|
{
|
|
|
|
unsigned int exx[4] = {0, 0, 0, 0};
|
|
|
|
|
|
|
|
#if defined(HAVE__GET_CPUID_COUNT)
|
|
|
|
__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
|
|
|
|
#elif defined(HAVE__CPUIDEX)
|
|
|
|
__cpuidex(exx, 7, 0);
|
|
|
|
#else
|
|
|
|
#error cpuid instruction not available
|
|
|
|
#endif
|
|
|
|
return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
|
|
|
|
(exx[1] & (1 << 30)) != 0; /* avx512-bw */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns true if the CPU supports the instructions required for the AVX-512
|
|
|
|
* pg_popcount() implementation.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
pg_popcount_avx512_available(void)
|
|
|
|
{
|
|
|
|
return xsave_available() &&
|
|
|
|
zmm_regs_available() &&
|
|
|
|
avx512_popcnt_available();
|
|
|
|
}
|
|
|
|
|
Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-07 04:56:23 +02:00
|
|
|
#endif /* TRY_POPCNT_FAST */
|