Improve speed of hash index build.

In the initial data sort, if the bucket numbers are the same then
next sort on the hash value.  Because index pages are kept in
hash value order, this gains a little speed by allowing the
eventual tuple insertions to be done sequentially, avoiding repeated
data movement within PageAddItem.  This seems to be good for overall
speedup of 5%-9%, depending on the incoming data.

Simon Riggs, reviewed by Amit Kapila

Discussion: https://postgr.es/m/CANbhV-FG-1ZNMBuwhUF7AxxJz3u5137dYL-o6hchK1V_dMw86g@mail.gmail.com
This commit is contained in:
Tom Lane 2022-07-28 14:34:32 -04:00
parent 70a437aa45
commit e09d7a1262
2 changed files with 21 additions and 5 deletions

View File

@ -42,9 +42,10 @@ struct HSpool
Relation index;
/*
* We sort the hash keys based on the buckets they belong to. Below masks
* are used in _hash_hashkey2bucket to determine the bucket of given hash
* key.
* We sort the hash keys based on the buckets they belong to, then by the
* hash values themselves, to optimize insertions onto hash pages. The
* masks below are used in _hash_hashkey2bucket to determine the bucket of
* a given hash key.
*/
uint32 high_mask;
uint32 low_mask;

View File

@ -1387,14 +1387,17 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
{
Bucket bucket1;
Bucket bucket2;
uint32 hash1;
uint32 hash2;
IndexTuple tuple1;
IndexTuple tuple2;
TuplesortPublic *base = TuplesortstateGetPublic(state);
TuplesortIndexHashArg *arg = (TuplesortIndexHashArg *) base->arg;
/*
* Fetch hash keys and mask off bits we don't want to sort by. We know
* that the first column of the index tuple is the hash key.
* Fetch hash keys and mask off bits we don't want to sort by, so that the
* initial sort is just on the bucket number. We know that the first
* column of the index tuple is the hash key.
*/
Assert(!a->isnull1);
bucket1 = _hash_hashkey2bucket(DatumGetUInt32(a->datum1),
@ -1409,6 +1412,18 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
else if (bucket1 < bucket2)
return -1;
/*
* If bucket values are equal, sort by hash values. This allows us to
* insert directly onto bucket/overflow pages, where the index tuples are
* stored in hash order to allow fast binary search within each page.
*/
hash1 = DatumGetUInt32(a->datum1);
hash2 = DatumGetUInt32(b->datum1);
if (hash1 > hash2)
return 1;
else if (hash1 < hash2)
return -1;
/*
* If hash values are equal, we sort on ItemPointer. This does not affect
* validity of the finished index, but it may be useful to have index