postgresql/src/include/lib/simplehash.h

1079 lines
28 KiB
C
Raw Normal View History

Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/*
* simplehash.h
*
* Hash table implementation which will be specialized to user-defined
* types, by including this file to generate the required code. It's
* probably not worthwhile to do so for hash tables that aren't performance
* or space sensitive.
*
* Usage notes:
*
* To generate a hash-table and associated functions for a use case several
* macros have to be #define'ed before this file is included. Including
* the file #undef's all those, so a new hash table can be generated
* afterwards.
* The relevant parameters are:
* - SH_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
* will result in hash table type 'foo_hash' and functions like
* 'foo_insert'/'foo_lookup' and so forth.
* - SH_ELEMENT_TYPE - type of the contained elements
* - SH_KEY_TYPE - type of the hashtable's key
* - SH_DECLARE - if defined function prototypes and type declarations are
* generated
* - SH_DEFINE - if defined function definitions are generated
* - SH_SCOPE - in which scope (e.g. extern, static inline) do function
* declarations reside
* - SH_RAW_ALLOCATOR - if defined, memory contexts are not used; instead,
* use this to allocate bytes
* - SH_USE_NONDEFAULT_ALLOCATOR - if defined no element allocator functions
* are defined, so you can supply your own
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
* The following parameters are only relevant when SH_DEFINE is defined:
* - SH_KEY - name of the element in SH_ELEMENT_TYPE containing the hash key
* - SH_EQUAL(table, a, b) - compare two table keys
* - SH_HASH_KEY(table, key) - generate hash for the key
* - SH_STORE_HASH - if defined the hash is stored in the elements
* - SH_GET_HASH(tb, a) - return the field to store the hash in
*
* For examples of usage look at tidbitmap.c (file local definition) and
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
* execnodes.h/execGrouping.c (exposed declaration, file local
* implementation).
*
* Hash table design:
*
* The hash table design chosen is a variant of linear open-addressing. The
* reason for doing so is that linear addressing is CPU cache & pipeline
* friendly. The biggest disadvantage of simple linear addressing schemes
* are highly variable lookup times due to clustering, and deletions
* leaving a lot of tombstones around. To address these issues a variant
* of "robin hood" hashing is employed. Robin hood hashing optimizes
* chaining lengths by moving elements close to their optimal bucket
* ("rich" elements), out of the way if a to-be-inserted element is further
* away from its optimal position (i.e. it's "poor"). While that can make
* insertions slower, the average lookup performance is a lot better, and
* higher fill factors can be used in a still performant manner. To avoid
* tombstones - which normally solve the issue that a deleted node's
* presence is relevant to determine whether a lookup needs to continue
* looking or is done - buckets following a deleted element are shifted
* backwards, unless they're empty or already at their optimal position.
*/
/* helpers */
#define SH_MAKE_PREFIX(a) CppConcat(a,_)
#define SH_MAKE_NAME(name) SH_MAKE_NAME_(SH_MAKE_PREFIX(SH_PREFIX),name)
#define SH_MAKE_NAME_(a,b) CppConcat(a,b)
/* name macros for: */
/* type declarations */
#define SH_TYPE SH_MAKE_NAME(hash)
#define SH_STATUS SH_MAKE_NAME(status)
#define SH_STATUS_EMPTY SH_MAKE_NAME(SH_EMPTY)
#define SH_STATUS_IN_USE SH_MAKE_NAME(SH_IN_USE)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#define SH_ITERATOR SH_MAKE_NAME(iterator)
/* function declarations */
#define SH_CREATE SH_MAKE_NAME(create)
#define SH_DESTROY SH_MAKE_NAME(destroy)
#define SH_RESET SH_MAKE_NAME(reset)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#define SH_INSERT SH_MAKE_NAME(insert)
#define SH_INSERT_HASH SH_MAKE_NAME(insert_hash)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#define SH_DELETE SH_MAKE_NAME(delete)
#define SH_LOOKUP SH_MAKE_NAME(lookup)
#define SH_LOOKUP_HASH SH_MAKE_NAME(lookup_hash)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#define SH_GROW SH_MAKE_NAME(grow)
#define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
#define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
#define SH_ITERATE SH_MAKE_NAME(iterate)
#define SH_ALLOCATE SH_MAKE_NAME(allocate)
#define SH_FREE SH_MAKE_NAME(free)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#define SH_STAT SH_MAKE_NAME(stat)
/* internal helper functions (no externally visible prototypes) */
#define SH_COMPUTE_PARAMETERS SH_MAKE_NAME(compute_parameters)
#define SH_NEXT SH_MAKE_NAME(next)
#define SH_PREV SH_MAKE_NAME(prev)
#define SH_DISTANCE_FROM_OPTIMAL SH_MAKE_NAME(distance)
#define SH_INITIAL_BUCKET SH_MAKE_NAME(initial_bucket)
#define SH_ENTRY_HASH SH_MAKE_NAME(entry_hash)
#define SH_INSERT_HASH_INTERNAL SH_MAKE_NAME(insert_hash_internal)
#define SH_LOOKUP_HASH_INTERNAL SH_MAKE_NAME(lookup_hash_internal)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* generate forward declarations necessary to use the hash table */
#ifdef SH_DECLARE
/* type definitions */
typedef struct SH_TYPE
{
/*
* Size of data / bucket array, 64 bits to handle UINT32_MAX sized hash
* tables. Note that the maximum number of elements is lower
* (SH_MAX_FILLFACTOR)
*/
uint64 size;
/* how many elements have valid contents */
uint32 members;
/* mask for bucket and size calculations, based on size */
uint32 sizemask;
/* boundary after which to grow hashtable */
uint32 grow_threshold;
/* hash buckets */
SH_ELEMENT_TYPE *data;
#ifndef SH_RAW_ALLOCATOR
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* memory context to use for allocations */
MemoryContext ctx;
#endif
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* user defined data, useful for callbacks */
void *private_data;
2017-06-21 20:39:04 +02:00
} SH_TYPE;
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
typedef enum SH_STATUS
{
SH_STATUS_EMPTY = 0x00,
SH_STATUS_IN_USE = 0x01
} SH_STATUS;
typedef struct SH_ITERATOR
{
uint32 cur; /* current element */
uint32 end;
bool done; /* iterator exhausted? */
2017-06-21 20:39:04 +02:00
} SH_ITERATOR;
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* externally visible function prototypes */
#ifdef SH_RAW_ALLOCATOR
SH_SCOPE SH_TYPE *SH_CREATE(uint32 nelements, void *private_data);
#else
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements,
void *private_data);
#endif
SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
SH_SCOPE void SH_RESET(SH_TYPE * tb);
SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found);
SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
uint32 hash, bool *found);
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key);
SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
uint32 hash);
SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
SH_SCOPE void SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at);
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_ELEMENT_TYPE *SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
SH_SCOPE void SH_STAT(SH_TYPE * tb);
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
#endif /* SH_DECLARE */
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* generate implementation of the hash table */
#ifdef SH_DEFINE
#ifndef SH_RAW_ALLOCATOR
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#include "utils/memutils.h"
#endif
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* max data array size,we allow up to PG_UINT32_MAX buckets, including 0 */
#define SH_MAX_SIZE (((uint64) PG_UINT32_MAX) + 1)
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
/* normal fillfactor, unless already close to maximum */
#ifndef SH_FILLFACTOR
#define SH_FILLFACTOR (0.9)
#endif
/* increase fillfactor if we otherwise would error out */
#define SH_MAX_FILLFACTOR (0.98)
/* grow if actual and optimal location bigger than */
#ifndef SH_GROW_MAX_DIB
#define SH_GROW_MAX_DIB 25
#endif
/* grow if more than elements to move when inserting */
#ifndef SH_GROW_MAX_MOVE
#define SH_GROW_MAX_MOVE 150
#endif
#ifndef SH_GROW_MIN_FILLFACTOR
/* but do not grow due to SH_GROW_MAX_* if below */
#define SH_GROW_MIN_FILLFACTOR 0.1
#endif
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#ifdef SH_STORE_HASH
#define SH_COMPARE_KEYS(tb, ahash, akey, b) (ahash == SH_GET_HASH(tb, b) && SH_EQUAL(tb, b->SH_KEY, akey))
#else
#define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey))
#endif
/*
* Wrap the following definitions in include guards, to avoid multiple
* definition errors if this header is included more than once. The rest of
* the file deliberately has no include guards, because it can be included
* with different parameters to define functions and types with non-colliding
* names.
*/
#ifndef SIMPLEHASH_H
#define SIMPLEHASH_H
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* FIXME: can we move these to a central location? */
/* calculate ceil(log base 2) of num */
static inline uint64
sh_log2(uint64 num)
{
int i;
uint64 limit;
for (i = 0, limit = 1; limit < num; i++, limit <<= 1)
;
return i;
}
/* calculate first power of 2 >= num */
static inline uint64
sh_pow2(uint64 num)
{
return ((uint64) 1) << sh_log2(num);
}
#ifdef FRONTEND
#define sh_error(...) pg_log_error(__VA_ARGS__)
#define sh_log(...) pg_log_info(__VA_ARGS__)
#else
#define sh_error(...) elog(ERROR, __VA_ARGS__)
#define sh_log(...) elog(LOG, __VA_ARGS__)
#endif
#endif
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/*
* Compute sizing parameters for hashtable. Called when creating and growing
* the hashtable.
*/
static inline void
SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
uint64 size;
/* supporting zero sized hashes would complicate matters */
size = Max(newsize, 2);
/* round up size to the next power of 2, that's how bucketing works */
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
size = sh_pow2(size);
Assert(size <= SH_MAX_SIZE);
/*
* Verify that allocation of ->data is possible on this platform, without
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
* overflowing Size.
*/
if ((((uint64) sizeof(SH_ELEMENT_TYPE)) * size) >= SIZE_MAX / 2)
sh_error("hash table too large");
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* now set size */
tb->size = size;
if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;
/*
* Compute the next threshold at which we need to grow the hash table
* again.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
if (tb->size == SH_MAX_SIZE)
tb->grow_threshold = ((double) tb->size) * SH_MAX_FILLFACTOR;
else
tb->grow_threshold = ((double) tb->size) * SH_FILLFACTOR;
}
/* return the optimal bucket for the hash */
static inline uint32
SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
return hash & tb->sizemask;
}
/* return next bucket after the current, handling wraparound */
static inline uint32
SH_NEXT(SH_TYPE * tb, uint32 curelem, uint32 startelem)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
curelem = (curelem + 1) & tb->sizemask;
Assert(curelem != startelem);
return curelem;
}
/* return bucket before the current, handling wraparound */
static inline uint32
SH_PREV(SH_TYPE * tb, uint32 curelem, uint32 startelem)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
curelem = (curelem - 1) & tb->sizemask;
Assert(curelem != startelem);
return curelem;
}
/* return distance between bucket and its optimal position */
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
static inline uint32
SH_DISTANCE_FROM_OPTIMAL(SH_TYPE * tb, uint32 optimal, uint32 bucket)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
if (optimal <= bucket)
return bucket - optimal;
else
return (tb->size + bucket) - optimal;
}
static inline uint32
SH_ENTRY_HASH(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
#ifdef SH_STORE_HASH
return SH_GET_HASH(tb, entry);
#else
return SH_HASH_KEY(tb, entry->SH_KEY);
#endif
}
/* default memory allocator function */
static inline void *SH_ALLOCATE(SH_TYPE * type, Size size);
static inline void SH_FREE(SH_TYPE * type, void *pointer);
#ifndef SH_USE_NONDEFAULT_ALLOCATOR
/* default memory allocator function */
static inline void *
SH_ALLOCATE(SH_TYPE * type, Size size)
{
#ifdef SH_RAW_ALLOCATOR
return SH_RAW_ALLOCATOR(size);
#else
return MemoryContextAllocExtended(type->ctx, size,
MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
#endif
}
/* default memory free function */
static inline void
SH_FREE(SH_TYPE * type, void *pointer)
{
pfree(pointer);
}
#endif
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/*
* Create a hash table with enough space for `nelements` distinct members.
* Memory for the hash table is allocated from the passed-in context. If
* desired, the array of elements can be allocated using a passed-in allocator;
* this could be useful in order to place the array of elements in a shared
* memory, or in a context that will outlive the rest of the hash table.
* Memory other than for the array of elements will still be allocated from
* the passed-in context.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
#ifdef SH_RAW_ALLOCATOR
SH_SCOPE SH_TYPE *
SH_CREATE(uint32 nelements, void *private_data)
#else
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_TYPE *
SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data)
#endif
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
SH_TYPE *tb;
uint64 size;
#ifdef SH_RAW_ALLOCATOR
tb = SH_RAW_ALLOCATOR(sizeof(SH_TYPE));
#else
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
tb = MemoryContextAllocZero(ctx, sizeof(SH_TYPE));
tb->ctx = ctx;
#endif
tb->private_data = private_data;
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* increase nelements by fillfactor, want to store nelements elements */
size = Min((double) SH_MAX_SIZE, ((double) nelements) / SH_FILLFACTOR);
SH_COMPUTE_PARAMETERS(tb, size);
tb->data = SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
return tb;
}
/* destroy a previously created hash table */
SH_SCOPE void
SH_DESTROY(SH_TYPE * tb)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
SH_FREE(tb, tb->data);
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
pfree(tb);
}
/* reset the contents of a previously created hash table */
SH_SCOPE void
SH_RESET(SH_TYPE * tb)
{
memset(tb->data, 0, sizeof(SH_ELEMENT_TYPE) * tb->size);
tb->members = 0;
}
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/*
* Grow a hash table to at least `newsize` buckets.
*
* Usually this will automatically be called by insertions/deletions, when
* necessary. But resizing to the exact input size can be advantageous
* performance-wise, when known at some point.
*/
SH_SCOPE void
SH_GROW(SH_TYPE * tb, uint32 newsize)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
uint64 oldsize = tb->size;
SH_ELEMENT_TYPE *olddata = tb->data;
SH_ELEMENT_TYPE *newdata;
uint32 i;
uint32 startelem = 0;
uint32 copyelem;
Assert(oldsize == sh_pow2(oldsize));
Assert(oldsize != SH_MAX_SIZE);
Assert(oldsize < newsize);
/* compute parameters for new table */
SH_COMPUTE_PARAMETERS(tb, newsize);
tb->data = SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
newdata = tb->data;
/*
* Copy entries from the old data to newdata. We theoretically could use
* SH_INSERT here, to avoid code duplication, but that's more general than
* we need. We neither want tb->members increased, nor do we need to do
* deal with deleted elements, nor do we need to compare keys. So a
* special-cased implementation is lot faster. As resizing can be time
* consuming and frequent, that's worthwhile to optimize.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*
* To be able to simply move entries over, we have to start not at the
* first bucket (i.e olddata[0]), but find the first bucket that's either
* empty, or is occupied by an entry at its optimal position. Such a
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
* bucket has to exist in any table with a load factor under 1, as not all
* buckets are occupied, i.e. there always has to be an empty bucket. By
* starting at such a bucket we can move the entries to the larger table,
* without having to deal with conflicts.
*/
/* search for the first element in the hash that's not wrapped around */
for (i = 0; i < oldsize; i++)
{
SH_ELEMENT_TYPE *oldentry = &olddata[i];
uint32 hash;
uint32 optimal;
if (oldentry->status != SH_STATUS_IN_USE)
{
startelem = i;
break;
}
hash = SH_ENTRY_HASH(tb, oldentry);
optimal = SH_INITIAL_BUCKET(tb, hash);
if (optimal == i)
{
startelem = i;
break;
}
}
/* and copy all elements in the old table */
copyelem = startelem;
for (i = 0; i < oldsize; i++)
{
SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
if (oldentry->status == SH_STATUS_IN_USE)
{
uint32 hash;
uint32 startelem;
uint32 curelem;
SH_ELEMENT_TYPE *newentry;
hash = SH_ENTRY_HASH(tb, oldentry);
startelem = SH_INITIAL_BUCKET(tb, hash);
curelem = startelem;
/* find empty element to put data into */
while (true)
{
newentry = &newdata[curelem];
if (newentry->status == SH_STATUS_EMPTY)
{
break;
}
curelem = SH_NEXT(tb, curelem, startelem);
}
/* copy entry to new slot */
memcpy(newentry, oldentry, sizeof(SH_ELEMENT_TYPE));
}
/* can't use SH_NEXT here, would use new size */
copyelem++;
if (copyelem >= oldsize)
{
copyelem = 0;
}
}
SH_FREE(tb, olddata);
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
}
/*
* This is a separate static inline function, so it can be reliably be inlined
* into its wrapper functions even if SH_SCOPE is extern.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
static inline SH_ELEMENT_TYPE *
SH_INSERT_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash, bool *found)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
uint32 startelem;
uint32 curelem;
SH_ELEMENT_TYPE *data;
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
uint32 insertdist;
restart:
insertdist = 0;
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/*
* We do the grow check even if the key is actually present, to avoid
* doing the check inside the loop. This also lets us avoid having to
* re-find our position in the hashtable after resizing.
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
*
* Note that this also reached when resizing the table due to
* SH_GROW_MAX_DIB / SH_GROW_MAX_MOVE.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
if (unlikely(tb->members >= tb->grow_threshold))
{
if (tb->size == SH_MAX_SIZE)
{
sh_error("hash table size exceeded");
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
}
/*
* When optimizing, it can be very useful to print these out.
*/
/* SH_STAT(tb); */
SH_GROW(tb, tb->size * 2);
/* SH_STAT(tb); */
}
/* perform insert, start bucket search at optimal location */
data = tb->data;
startelem = SH_INITIAL_BUCKET(tb, hash);
curelem = startelem;
while (true)
{
uint32 curdist;
uint32 curhash;
uint32 curoptimal;
SH_ELEMENT_TYPE *entry = &data[curelem];
/* any empty bucket can directly be used */
if (entry->status == SH_STATUS_EMPTY)
{
tb->members++;
entry->SH_KEY = key;
#ifdef SH_STORE_HASH
SH_GET_HASH(tb, entry) = hash;
#endif
entry->status = SH_STATUS_IN_USE;
*found = false;
return entry;
}
/*
* If the bucket is not empty, we either found a match (in which case
* we're done), or we have to decide whether to skip over or move the
* colliding entry. When the colliding element's distance to its
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
* optimal position is smaller than the to-be-inserted entry's, we
* shift the colliding entry (and its followers) forward by one.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
if (SH_COMPARE_KEYS(tb, hash, key, entry))
{
Assert(entry->status == SH_STATUS_IN_USE);
*found = true;
return entry;
}
curhash = SH_ENTRY_HASH(tb, entry);
curoptimal = SH_INITIAL_BUCKET(tb, curhash);
curdist = SH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
if (insertdist > curdist)
{
SH_ELEMENT_TYPE *lastentry = entry;
uint32 emptyelem = curelem;
uint32 moveelem;
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
int32 emptydist = 0;
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* find next empty bucket */
while (true)
{
SH_ELEMENT_TYPE *emptyentry;
emptyelem = SH_NEXT(tb, emptyelem, startelem);
emptyentry = &data[emptyelem];
if (emptyentry->status == SH_STATUS_EMPTY)
{
lastentry = emptyentry;
break;
}
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
/*
* To avoid negative consequences from overly imbalanced
* hashtables, grow the hashtable if collisions would require
* us to move a lot of entries. The most likely cause of such
* imbalance is filling a (currently) small table, from a
* currently big one, in hash-table order. Don't grow if the
* hashtable would be too empty, to prevent quick space
* explosion for some weird edge cases.
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
*/
if (unlikely(++emptydist > SH_GROW_MAX_MOVE) &&
((double) tb->members / tb->size) >= SH_GROW_MIN_FILLFACTOR)
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
{
tb->grow_threshold = 0;
goto restart;
}
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
}
/* shift forward, starting at last occupied element */
/*
* TODO: This could be optimized to be one memcpy in may cases,
* excepting wrapping around at the end of ->data. Hasn't shown up
* in profiles so far though.
*/
moveelem = emptyelem;
while (moveelem != curelem)
{
SH_ELEMENT_TYPE *moveentry;
moveelem = SH_PREV(tb, moveelem, startelem);
moveentry = &data[moveelem];
memcpy(lastentry, moveentry, sizeof(SH_ELEMENT_TYPE));
lastentry = moveentry;
}
/* and fill the now empty spot */
tb->members++;
entry->SH_KEY = key;
#ifdef SH_STORE_HASH
SH_GET_HASH(tb, entry) = hash;
#endif
entry->status = SH_STATUS_IN_USE;
*found = false;
return entry;
}
curelem = SH_NEXT(tb, curelem, startelem);
insertdist++;
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
/*
* To avoid negative consequences from overly imbalanced hashtables,
* grow the hashtable if collisions lead to large runs. The most
* likely cause of such imbalance is filling a (currently) small
* table, from a currently big one, in hash-table order. Don't grow
* if the hashtable would be too empty, to prevent quick space
* explosion for some weird edge cases.
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
*/
if (unlikely(insertdist > SH_GROW_MAX_DIB) &&
((double) tb->members / tb->size) >= SH_GROW_MIN_FILLFACTOR)
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
{
tb->grow_threshold = 0;
goto restart;
}
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
}
}
/*
* Insert the key key into the hash-table, set *found to true if the key
* already exists, false otherwise. Returns the hash-table entry in either
* case.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_ELEMENT_TYPE *
SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found)
{
uint32 hash = SH_HASH_KEY(tb, key);
return SH_INSERT_HASH_INTERNAL(tb, key, hash, found);
}
/*
* Insert the key key into the hash-table using an already-calculated
* hash. Set *found to true if the key already exists, false
* otherwise. Returns the hash-table entry in either case.
*/
SH_SCOPE SH_ELEMENT_TYPE *
SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash, bool *found)
{
return SH_INSERT_HASH_INTERNAL(tb, key, hash, found);
}
/*
* This is a separate static inline function, so it can be reliably be inlined
* into its wrapper functions even if SH_SCOPE is extern.
*/
static inline SH_ELEMENT_TYPE *
SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
const uint32 startelem = SH_INITIAL_BUCKET(tb, hash);
uint32 curelem = startelem;
while (true)
{
SH_ELEMENT_TYPE *entry = &tb->data[curelem];
if (entry->status == SH_STATUS_EMPTY)
{
return NULL;
}
Assert(entry->status == SH_STATUS_IN_USE);
if (SH_COMPARE_KEYS(tb, hash, key, entry))
return entry;
/*
* TODO: we could stop search based on distance. If the current
* buckets's distance-from-optimal is smaller than what we've skipped
* already, the entry doesn't exist. Probably only do so if
* SH_STORE_HASH is defined, to avoid re-computing hashes?
*/
curelem = SH_NEXT(tb, curelem, startelem);
}
}
/*
* Lookup up entry in hash table. Returns NULL if key not present.
*/
SH_SCOPE SH_ELEMENT_TYPE *
SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key)
{
uint32 hash = SH_HASH_KEY(tb, key);
return SH_LOOKUP_HASH_INTERNAL(tb, key, hash);
}
/*
* Lookup up entry in hash table using an already-calculated hash.
*
* Returns NULL if key not present.
*/
SH_SCOPE SH_ELEMENT_TYPE *
SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
{
return SH_LOOKUP_HASH_INTERNAL(tb, key, hash);
}
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/*
* Delete entry from hash table. Returns whether to-be-deleted key was
* present.
*/
SH_SCOPE bool
SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
uint32 hash = SH_HASH_KEY(tb, key);
uint32 startelem = SH_INITIAL_BUCKET(tb, hash);
uint32 curelem = startelem;
while (true)
{
SH_ELEMENT_TYPE *entry = &tb->data[curelem];
if (entry->status == SH_STATUS_EMPTY)
return false;
if (entry->status == SH_STATUS_IN_USE &&
SH_COMPARE_KEYS(tb, hash, key, entry))
{
SH_ELEMENT_TYPE *lastentry = entry;
tb->members--;
/*
* Backward shift following elements till either an empty element
* or an element at its optimal position is encountered.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*
* While that sounds expensive, the average chain length is short,
* and deletions would otherwise require tombstones.
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
*/
while (true)
{
SH_ELEMENT_TYPE *curentry;
uint32 curhash;
uint32 curoptimal;
curelem = SH_NEXT(tb, curelem, startelem);
curentry = &tb->data[curelem];
if (curentry->status != SH_STATUS_IN_USE)
{
lastentry->status = SH_STATUS_EMPTY;
break;
}
curhash = SH_ENTRY_HASH(tb, curentry);
curoptimal = SH_INITIAL_BUCKET(tb, curhash);
/* current is at optimal position, done */
if (curoptimal == curelem)
{
lastentry->status = SH_STATUS_EMPTY;
break;
}
/* shift */
memcpy(lastentry, curentry, sizeof(SH_ELEMENT_TYPE));
lastentry = curentry;
}
return true;
}
/* TODO: return false; if distance too big */
curelem = SH_NEXT(tb, curelem, startelem);
}
}
/*
* Initialize iterator.
*/
SH_SCOPE void
SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
int i;
uint64 startelem = PG_UINT64_MAX;
/*
* Search for the first empty element. As deletions during iterations are
* supported, we want to start/end at an element that cannot be affected
* by elements being shifted.
*/
for (i = 0; i < tb->size; i++)
{
SH_ELEMENT_TYPE *entry = &tb->data[i];
if (entry->status != SH_STATUS_IN_USE)
{
startelem = i;
break;
}
}
Assert(startelem < SH_MAX_SIZE);
/*
* Iterate backwards, that allows the current element to be deleted, even
* if there are backward shifts
*/
iter->cur = startelem;
iter->end = iter->cur;
iter->done = false;
}
/*
* Initialize iterator to a specific bucket. That's really only useful for
* cases where callers are partially iterating over the hashspace, and that
* iteration deletes and inserts elements based on visited entries. Doing that
* repeatedly could lead to an unbalanced keyspace when always starting at the
* same position.
*/
SH_SCOPE void
SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
/*
* Iterate backwards, that allows the current element to be deleted, even
* if there are backward shifts.
*/
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
iter->cur = at & tb->sizemask; /* ensure at is within a valid range */
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
iter->end = iter->cur;
iter->done = false;
}
/*
* Iterate over all entries in the hash-table. Return the next occupied entry,
* or NULL if done.
*
* During iteration the current entry in the hash table may be deleted,
* without leading to elements being skipped or returned twice. Additionally
* the rest of the table may be modified (i.e. there can be insertions or
* deletions), but if so, there's neither a guarantee that all nodes are
* visited at least once, nor a guarantee that a node is visited at most once.
*/
2017-06-21 20:39:04 +02:00
SH_SCOPE SH_ELEMENT_TYPE *
SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
while (!iter->done)
{
SH_ELEMENT_TYPE *elem;
elem = &tb->data[iter->cur];
/* next element in backward direction */
iter->cur = (iter->cur - 1) & tb->sizemask;
if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
iter->done = true;
if (elem->status == SH_STATUS_IN_USE)
{
return elem;
}
}
return NULL;
}
/*
* Report some statistics about the state of the hashtable. For
* debugging/profiling purposes only.
*/
SH_SCOPE void
SH_STAT(SH_TYPE * tb)
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
{
uint32 max_chain_length = 0;
uint32 total_chain_length = 0;
double avg_chain_length;
double fillfactor;
uint32 i;
uint32 *collisions = palloc0(tb->size * sizeof(uint32));
uint32 total_collisions = 0;
uint32 max_collisions = 0;
double avg_collisions;
for (i = 0; i < tb->size; i++)
{
uint32 hash;
uint32 optimal;
uint32 dist;
SH_ELEMENT_TYPE *elem;
elem = &tb->data[i];
if (elem->status != SH_STATUS_IN_USE)
continue;
hash = SH_ENTRY_HASH(tb, elem);
optimal = SH_INITIAL_BUCKET(tb, hash);
dist = SH_DISTANCE_FROM_OPTIMAL(tb, optimal, i);
if (dist > max_chain_length)
max_chain_length = dist;
total_chain_length += dist;
collisions[optimal]++;
}
for (i = 0; i < tb->size; i++)
{
uint32 curcoll = collisions[i];
if (curcoll == 0)
continue;
/* single contained element is not a collision */
curcoll--;
total_collisions += curcoll;
if (curcoll > max_collisions)
max_collisions = curcoll;
}
if (tb->members > 0)
{
fillfactor = tb->members / ((double) tb->size);
avg_chain_length = ((double) total_chain_length) / tb->members;
avg_collisions = ((double) total_collisions) / tb->members;
}
else
{
fillfactor = 0;
avg_chain_length = 0;
avg_collisions = 0;
}
sh_log("size: " UINT64_FORMAT ", members: %u, filled: %f, total chain: %u, max chain: %u, avg chain: %f, total_collisions: %u, max_collisions: %i, avg_collisions: %f",
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
tb->size, tb->members, fillfactor, total_chain_length, max_chain_length, avg_chain_length,
total_collisions, max_collisions, avg_collisions);
}
Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
#endif /* SH_DEFINE */
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* undefine external parameters, so next hash table can be defined */
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#undef SH_PREFIX
#undef SH_KEY_TYPE
#undef SH_KEY
#undef SH_ELEMENT_TYPE
#undef SH_HASH_KEY
#undef SH_SCOPE
#undef SH_DECLARE
#undef SH_DEFINE
#undef SH_GET_HASH
#undef SH_STORE_HASH
#undef SH_USE_NONDEFAULT_ALLOCATOR
#undef SH_EQUAL
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* undefine locally declared macros */
#undef SH_MAKE_PREFIX
#undef SH_MAKE_NAME
#undef SH_MAKE_NAME_
#undef SH_FILLFACTOR
#undef SH_MAX_FILLFACTOR
Make simplehash.h grow hashtable in additional cases. Increase the size when either the distance between actual and optimal slot grows too large, or when too many subsequent entries would have to be moved. This addresses reports that the simplehash performed, sometimes considerably, worse than dynahash. The reason turned out to be that insertions into the hashtable where, due to the use of parallel query, in effect done from another hashtable, in hash-value order. If the target hashtable, due to mis-estimation, was sized a lot smaller than the source table(s) that lead to very imbalanced tables; a lot of entries in many close-by buckets from the source tables were inserted into a single, wider, bucket on the target table. As the growth factor was solely computed based on the fillfactor, the performance of the table decreased further and further. b81b5a96f424531b was an attempt to address this problem for hash aggregates (but not for bitmap scans), but it turns out that the current method of mixing hash values often actually leaves neighboring hash-values close to each other, just in different value range. It might be worth revisiting that independently of the performance issues addressed in this patch.. To address that problem resize tables in two additional cases: Firstly when the optimal position for an entry would be far from the actual position, secondly when many entries would have to be moved to make space for the new entry (while satisfying the robin hood property). Due to the additional resizing threshold it seems possible, and testing confirms that so far, that a higher fillfactor doesn't hurt performance and saves a bit of memory. It seems better to increase it now, before a release containing any of this code, rather than wonder in some later release. The various boundaries aren't determined in a particularly scientific manner, they might need some fine-tuning. In all my tests the new code now, even with parallelism, performs at least as good as the old code, in several scenarios significantly better. Reported-By: Dilip Kumar, Robert Haas, Kuntal Ghosh Discussion: https://postgr.es/m/CAFiTN-vagvuAydKG9VnWcoK=ADAhxmOa4ZTrmNsViBBooTnriQ@mail.gmail.com https://postgr.es/m/CAGz5QC+=fNTYgzMLTBUNeKt6uaWZFXJbkB5+7oWm-n9DwVxcLA@mail.gmail.com
2016-11-23 09:23:42 +01:00
#undef SH_GROW_MAX_DIB
#undef SH_GROW_MAX_MOVE
#undef SH_GROW_MIN_FILLFACTOR
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#undef SH_MAX_SIZE
/* types */
#undef SH_TYPE
#undef SH_STATUS
#undef SH_STATUS_EMPTY
#undef SH_STATUS_IN_USE
#undef SH_ITERATOR
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
/* external function names */
#undef SH_CREATE
#undef SH_DESTROY
#undef SH_RESET
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#undef SH_INSERT
#undef SH_INSERT_HASH
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#undef SH_DELETE
#undef SH_LOOKUP
#undef SH_LOOKUP_HASH
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#undef SH_GROW
#undef SH_START_ITERATE
#undef SH_START_ITERATE_AT
#undef SH_ITERATE
#undef SH_ALLOCATE
#undef SH_FREE
Add a macro templatized hashtable. dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-15 01:05:30 +02:00
#undef SH_STAT
/* internal function names */
#undef SH_COMPUTE_PARAMETERS
#undef SH_COMPARE_KEYS
#undef SH_INITIAL_BUCKET
#undef SH_NEXT
#undef SH_PREV
#undef SH_DISTANCE_FROM_OPTIMAL
#undef SH_ENTRY_HASH
#undef SH_INSERT_HASH_INTERNAL
#undef SH_LOOKUP_HASH_INTERNAL