postgresql/src/port/pg_crc32c_sse42.c

/*-------------------------------------------------------------------------
 *
 * pg_crc32c_sse42.c
 *	  Compute CRC-32C checksum using Intel SSE 4.2 instructions.
 *
 * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 *
 * IDENTIFICATION
 *	  src/port/pg_crc32c_sse42.c
 *
 *-------------------------------------------------------------------------
 */
#include "c.h"

#include "port/pg_crc32c.h"

#include <nmmintrin.h>

pg_crc32c
pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
{
	const unsigned char *p = data;
	const unsigned char *pend = p + len;

	/*
	 * Process eight bytes of data at a time.
	 *
	 * NB: We do unaligned accesses here. The Intel architecture allows that,
	 * and performance testing didn't show any performance gain from aligning
	 * the begin address.
	 */
#ifdef __x86_64__
	while (p + 8 <= pend)
	{
		crc = (uint32) _mm_crc32_u64(crc, *((const uint64 *) p));
		p += 8;
	}

	/* Process remaining full four bytes if any */
	if (p + 4 <= pend)
	{
		crc = _mm_crc32_u32(crc, *((const unsigned int *) p));
		p += 4;
	}
#else

	/*
	 * Process four bytes at a time. (The eight byte instruction is not
	 * available on the 32-bit x86 architecture).
	 */
	while (p + 4 <= pend)
	{
		crc = _mm_crc32_u32(crc, *((const unsigned int *) p));
		p += 4;
	}
#endif							/* __x86_64__ */

	/* Process any remaining bytes one at a time. */
	while (p < pend)
	{
		crc = _mm_crc32_u8(crc, *p);
		p++;
	}

	return crc;
}
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`/*-------------------------------------------------------------------------`
			`*`
			`* pg_crc32c_sse42.c`
			`* Compute CRC-32C checksum using Intel SSE 4.2 instructions.`
			`*`
Update copyright for 2019 Backpatch-through: certain files through 9.4 2019-01-02 18:44:25 +01:00			`* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`* Portions Copyright (c) 1994, Regents of the University of California`
			`*`
			`*`
			`* IDENTIFICATION`
			`* src/port/pg_crc32c_sse42.c`
			`*`
			`*-------------------------------------------------------------------------`
			`*/`
			`#include "c.h"`

			`#include "port/pg_crc32c.h"`

			`#include <nmmintrin.h>`

			`pg_crc32c`
			`pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)`
			`{`
			`const unsigned char *p = data;`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`const unsigned char *pend = p + len;`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00
			`/*`
			`* Process eight bytes of data at a time.`
			`*`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`* NB: We do unaligned accesses here. The Intel architecture allows that,`
			`* and performance testing didn't show any performance gain from aligning`
			`* the begin address.`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`*/`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`#ifdef __x86_64__`
			`while (p + 8 <= pend)`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`{`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`crc = (uint32) _mm_crc32_u64(crc, ((const uint64 ) p));`
			`p += 8;`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`}`

Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`/* Process remaining full four bytes if any */`
			`if (p + 4 <= pend)`
			`{`
			`crc = _mm_crc32_u32(crc, ((const unsigned int ) p));`
			`p += 4;`
			`}`
			`#else`
pgindent run for 9.5 2015-05-24 03:35:49 +02:00
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`/*`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`* Process four bytes at a time. (The eight byte instruction is not`
			`* available on the 32-bit x86 architecture).`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`*/`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`while (p + 4 <= pend)`
			`{`
			`crc = _mm_crc32_u32(crc, ((const unsigned int ) p));`
			`p += 4;`
			`}`
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us 2017-06-21 21:18:54 +02:00			`#endif /* __x86_64__ */`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00
			`/* Process any remaining bytes one at a time. */`
			`while (p < pend)`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`{`
Optimize pg_comp_crc32c_sse42 routine slightly, and also use it on x86. Eliminate the separate 'len' variable from the loops, and also use the 4 byte instruction. This shaves off a few more cycles. Even though this routine that uses the special SSE 4.2 instructions is much faster than a generic routine, it's still a hot spot, so let's make it as fast as possible. Change the configure test to not test _mm_crc32_u64. That variant is only available in the 64-bit x86-64 architecture, not in 32-bit x86. Modify pg_comp_crc32c_sse42 so that it only uses _mm_crc32_u64 on x86-64. With these changes, the SSE accelerated CRC-32C implementation can also be used on 32-bit x86 systems. This also fixes the 32-bit MSVC build. 2015-04-14 22:58:16 +02:00			`crc = _mm_crc32_u8(crc, *p);`
			`p++;`
Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. 2015-04-14 16:05:03 +02:00			`}`

			`return crc;`
			`}`