Improve hash index bucket split behavior.

Previously, the right to split a bucket was represented by a heavyweight lock on the page number of the primary bucket page. Unfortunately, this meant that every scan needed to take a heavyweight lock on that bucket also, which was bad for concurrency. Instead, use a cleanup lock on the primary bucket page to indicate the right to begin a split, so that scans only need to retain a pin on that page, which is they would have to acquire anyway, and which is also much cheaper. In addition to reducing the locking cost, this also avoids locking out scans and inserts for the entire lifetime of the split: while the new bucket is being populated with copies of the appropriate tuples from the old bucket, scans and inserts can happen in parallel. There are minor concurrency improvements for vacuum operations as well, though the situation there is still far from ideal. This patch also removes the unworldly assumption that a split will never be interrupted. With the new code, a split is done in a series of small steps and the system can pick up where it left off if it is interrupted prior to completion. While this patch does not itself add write-ahead logging for hash indexes, it is clearly a necessary first step, since one of the things that could interrupt a split is the removal of electrical power from the machine performing it. Amit Kapila. I wrote the original design on which this patch is based, and did a good bit of work on the comments and README through multiple rounds of review, but all of the code is Amit's. Also reviewed by Jesper Pedersen, Jeff Janes, and others. Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 15:39:21 -05:00 · 2016-11-30 15:39:21 -05:00 · 6d46f4783e
parent 213c0f2d78
commit 6d46f4783e
12 changed files with 1361 additions and 622 deletions
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@ -12,7 +12,7 @@ subdir = src/backend/access/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global

-OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
+       hashsort.o hashutil.o hashvalidate.o

 include $(top_srcdir)/src/backend/common.mk
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@ -126,53 +126,54 @@ the initially created buckets.
 Lock Definitions
 ----------------

-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index.  lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect.  Buffer context locks are used for short-term access
-control to individual pages of the index.
+Concurrency control for hash indexes is provided using buffer content
+locks, buffer pins, and cleanup locks.   Here as elsewhere in PostgreSQL,
+cleanup lock means that we hold an exclusive lock on the buffer and have
+observed at some point after acquiring the lock that we hold the only pin
+on that buffer.  For hash indexes, a cleanup lock on a primary bucket page
+represents the right to perform an arbitrary reorganization of the entire
+bucket.  Therefore, scans retain a pin on the primary bucket page for the
+bucket they are currently scanning.  Splitting a bucket requires a cleanup
+lock on both the old and new primary bucket pages.  VACUUM therefore takes
+a cleanup lock on every bucket page in order to remove tuples.  It can also
+remove tuples copied to a new bucket by any previous split operation, because
+the cleanup lock taken on the primary bucket page guarantees that no scans
+which started prior to the most recent split can still be in progress.  After
+cleaning each page individually, it attempts to take a cleanup lock on the
+primary bucket page in order to "squeeze" the bucket down to the minimum
+possible number of pages.

-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket.  A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done.  A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from.  Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.  LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share.  A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan.  This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it.  This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process.  If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process.  (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We choose to always lock the lower-numbered bucket first.  The metapage is
+only ever locked after all bucket locks have been taken.


 Pseudocode Algorithms
 ---------------------

+Various flags that are used in hash index operations are described as below:
+
+The bucket-being-split and bucket-being-populated flags indicate that split
+the operation is in progress for a bucket.  During split operation, a
+bucket-being-split flag is set on the old bucket and bucket-being-populated
+flag is set on new bucket.  These flags are cleared once the split operation
+is finished.
+
+The split-cleanup flag indicates that a bucket which has been recently split
+still contains tuples that were also copied to the new bucket; it essentially
+marks the split as incomplete.  Once we're certain that no scans which
+started before the new bucket was fully populated are still in progress, we
+can remove the copies from the old bucket and clear the flag.  We insist that
+this flag must be clear before splitting a bucket; thus, a bucket can't be
+split again until the previous split is totally complete.
+
+The moved-by-split flag on a tuple indicates that tuple is moved from old to
+new bucket.  Concurrent scans can skip such tuples till the split operation
+is finished.  Once the tuple is marked as moved-by-split, it will remain so
+forever but that does no harm.  We have intentionally not cleared it as that
+can generate an additional I/O which is not necessary.
+
 The operations we need to support are: readers scanning the index for
 entries of a particular hash code (which by definition are all in the same
 bucket); insertion of a new tuple into the correct bucket; enlarging the
@ -193,38 +194,48 @@ The reader algorithm is:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock
+		release any existing bucket page buffer content lock (if a concurrent
+         split happened)
+		take the buffer content lock on bucket page in shared mode
 		retake meta page buffer content lock in shared mode
-- then, per read request:
 	release pin on metapage
-	read current page of bucket and take shared buffer content lock
-		step to next page if necessary (no chaining of locks)
+	if the target bucket is still being populated by a split:
+		release the buffer content lock on current bucket page
+		pin and acquire the buffer content lock on old bucket in shared mode
+		release the buffer content lock on old bucket, but not pin
+		retake the buffer content lock on new bucket
+		arrange to scan the old bucket normally and the new bucket for
+         tuples which are not moved-by-split
+-- then, per read request:
+	reacquire content lock on current page
+	step to next page if necessary (no chaining of content locks, but keep
+     the pin on the primary bucket throughout the scan; we also maintain
+     a pin on the page currently being scanned)
 	get tuple
-	release buffer content lock and pin on current page
+	release content lock
 -- at scan shutdown:
-	release bucket share-lock
+	release all pins still held

-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection).  Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split.  If not, we're
-done.  If so, we release our previously-acquired lock and repeat the process
-using the new bucket number.  Holding the bucket sharelock for
-the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions.  Notice that the reader's lock
-does not prevent other buckets from being split or compacted.
+Holding the buffer pin on the primary bucket page for the whole scan prevents
+the reader's current-tuple pointer from being invalidated by splits or
+compactions.  (Of course, other buckets can still be split or compacted.)

 To keep concurrency reasonably good, we require readers to cope with
 concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock.  Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page.  Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
 searching for the same heap tuple TID previously returned.  Insertion does
 not move index entries across pages, so the previously-returned index entry
 should always be on the same page, at the same or higher offset number,
 as it was before.

+To allow for scans during a bucket split, if at the start of the scan, the
+bucket is marked as bucket-being-populated, it scan all the tuples in that
+bucket except for those that are marked as moved-by-split.  Once it finishes
+the scan of all the tuples in the current bucket, it scans the old bucket from
+which this bucket is formed by split.
+
 The insertion algorithm is rather similar:

 	pin meta page and take buffer content lock in shared mode
@ -233,18 +244,29 @@ The insertion algorithm is rather similar:
 		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
-		release any existing bucket page lock (if a concurrent split happened)
-		take heavyweight bucket lock in shared mode
+		release any existing bucket page buffer content lock (if a concurrent
+         split happened)
+		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
-- (so far same as reader)
 	release pin on metapage
-	pin current page of bucket and take exclusive buffer content lock
-	if full, release, read/exclusive-lock next page; repeat as needed
+-- (so far same as reader, except for acquisition of buffer content lock in
+	exclusive mode on primary bucket page)
+	if the bucket-being-split flag is set for a bucket and pin count on it is
+	 one, then finish the split
+		release the buffer content lock on current bucket
+		get the "new" bucket which was being populated by the split
+		scan the new bucket and form the hash table of TIDs
+		conditionally get the cleanup lock on old and new buckets
+		if we get the lock on both the buckets
+			finish the split using algorithm mentioned below for split
+		release the pin on old bucket and restart the insert from beginning.
+	if current page is full, release lock but not pin, read/exclusive-lock
+     next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
 	mark current page dirty and release buffer content lock and pin
-	release heavyweight share-lock
-	pin meta page and take buffer content lock in shared mode
+	if the current page is not a bucket page, release the pin on bucket page
+	pin meta page and take buffer content lock in exclusive mode
 	increment tuple count, decide if split needed
 	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
@ -256,11 +278,13 @@ bucket that is being actively scanned, because readers can cope with this
 as explained above.  We only need the short-term buffer locks to ensure
 that readers do not see a partially-updated page.

-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above.  This algorithm allows them a very high degree of
+concurrency.  (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)

 When an inserter cannot find space in any existing page of a bucket, it
 must obtain an overflow page and add that page to the bucket's chain.
@ -271,46 +295,45 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:

-	pin meta page and take buffer content lock in exclusive mode
-	check split still needed
-	if split not needed anymore, drop buffer content lock and pin and exit
-	decide which bucket to split
-	Attempt to X-lock old bucket number (definitely could fail)
-	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and pin and exit
-	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock and pin
-	-- now, accesses to all other buckets can proceed.
-	Perform actual split of bucket, moving tuples as needed
-	>> see below about acquiring needed extra space
-	Release X-locks of old and new buckets
+    pin meta page and take buffer content lock in exclusive mode
+    check split still needed
+    if split not needed anymore, drop buffer content lock and pin and exit
+    decide which bucket to split
+    try to take a cleanup lock on that bucket; if fail, give up
+    if that bucket is still being split or has split-cleanup work:
+       try to finish the split and the cleanup work
+       if that succeeds, start over; if it fails, give up
+	mark the old and new buckets indicating split is in progress
+	copy the tuples that belongs to new bucket from old bucket, marking
+     them as moved-by-split
+	release lock but not pin for primary bucket page of old bucket,
+	 read/shared-lock next page; repeat as needed
+	clear the bucket-being-split and bucket-being-populated flags
+	mark the old bucket indicating split-cleanup

-Note the metapage lock is not held while the actual tuple rearrangement is
-performed, so accesses to other buckets can proceed in parallel; in fact,
-it's possible for multiple bucket splits to proceed in parallel.
-
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it.  We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock.  (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split.  This is all right since the index is
-overfull but perfectly functional.  Every subsequent inserter will try to
-split, and eventually one will succeed.  If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+The split operation's attempt to acquire cleanup-lock on the old bucket number
+could fail if another process holds any lock or pin on it.  We do not want to
+wait if that happens, because we don't want to wait while holding the metapage
+exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split.  This is all right since the
+index is overfull but perfectly functional.  Every subsequent inserter will
+try to split, and eventually one will succeed.  If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
 not be overfull and split attempts will stop.  (We could make a successful
 splitter loop to see if the index is still overfull, but it seems better to
 distribute the split overhead across successive insertions.)

-A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt.  The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
-must-complete action.  I'm not planning to teach hash about WAL in this
-go-round.
+If a split fails partway through (e.g. due to insufficient disk space or an
+interrupt), the index will not be corrupted.  Instead, we'll retry the split
+every time a tuple is inserted into the old bucket prior to inserting the new
+tuple; eventually, we should succeed.  The fact that a split is left
+unfinished doesn't prevent subsequent buckets from being split, but we won't
+try to split the bucket again until the prior split is finished.  In other
+words, a bucket can be in the middle of being split for some time, but it can't
+be in the middle of two splits at the same time.
+
+Although we can survive a failure to split a bucket, a crash is likely to
+corrupt the index, since hash indexes are not yet WAL-logged.

 The fourth operation is garbage collection (bulk deletion):

@ -319,9 +342,17 @@ The fourth operation is garbage collection (bulk deletion):
 	fetch current max bucket number
 	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
-		Acquire X lock on target bucket
-		Scan and remove tuples, compact free space as needed
-		Release X lock
+		acquire cleanup lock on primary bucket page
+		loop:
+			scan and remove tuples
+			if this is the last bucket page, break out of loop
+			pin and x-lock next page
+			release prior lock and pin (except keep pin on primary bucket page)
+		if the page we have locked is not the primary bucket page:
+			release lock and take exclusive lock on primary bucket page
+		if there are no other pins on the primary bucket page:
+			squeeze the bucket to remove free space
+		release the pin on primary bucket page
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
@ -330,20 +361,24 @@ The fourth operation is garbage collection (bulk deletion):
 	else update metapage tuple count
 	mark meta page dirty and release buffer content lock and pin

-Note that this is designed to allow concurrent splits.  If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm.  (We must however be careful about the statistics
+Note that this is designed to allow concurrent splits and scans.  If a split
+occurs, tuples relocated into the new bucket will be visited twice by the
+scan, but that does no harm.  As we release the lock on bucket page during
+cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
+and ensures that scan will always be behind cleanup.  It is must to keep scans
+behind cleanup, else vacuum could decrease the TIDs that are required to
+complete the scan.  Now, as the scan that returns multiple tuples from the
+same bucket page always expect next valid TID to be greater than or equal to
+the current TID, it might miss the tuples.  This holds true for backward scans
+as well (backward scans first traverse each bucket starting from first bucket
+to last overflow page in the chain).  We must be careful about the statistics
 reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan.  This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason.  But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+tuples scanned, and believe this in preference to the stored tuple count if
+the stored tuple count and number of buckets did *not* change at any time
+during the scan.  This provides a way of correcting the stored tuple count if
+it gets out of sync for some reason.  But if a split or insertion does occur
+concurrently, the scan count is untrustworthy; instead, subtract the number of
+tuples deleted from the stored tuple count and use that.


 Free Space Management
@ -417,13 +452,11 @@ free page; there can be no other process holding lock on it.

 Bucket splitting uses a similar algorithm if it has to extend the new
 bucket, but it need not worry about concurrent extension since it has
-exclusive lock on the new bucket.
+buffer content lock in exclusive mode on the new bucket.

-Freeing an overflow page is done by garbage collection and by bucket
-splitting (the old bucket may contain no-longer-needed overflow pages).
-In both cases, the process holds exclusive lock on the containing bucket,
-so need not worry about other accessors of pages in the bucket.  The
-algorithm is:
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket.  The algorithm is:

 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
@ -454,14 +487,6 @@ locks.  Since they need no lmgr locks, deadlock is not possible.
 Other Notes
 -----------

-All the shenanigans with locking prevent a split occurring while *another*
-process is stopped in a given bucket.  They do not ensure that one of
-our *own* backend's scans is not stopped in the bucket, because lmgr
-doesn't consider a process's own locks to conflict.  So the Split
-algorithm must check for that case separately before deciding it can go
-ahead with the split.  VACUUM does not have this problem since nothing
-else can be happening within the vacuuming backend.
-
-Should we instead try to fix the state of any conflicting local scan?
-Seems mighty ugly --- got to move the held bucket S-lock as well as lots
-of other messiness.  For now, just punt and don't split.
+Clean up locks prevent a split from occurring while *another* process is stopped
+in a given bucket.  It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * An insertion into the current index page could have happened while
 		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold share lock on
-		 * the bucket, no deletions or splits could have occurred; therefore
-		 * we can expect that the TID still exists in the current index page,
-		 * at an offset >= where we were.)
+		 * for the TID we previously returned.  (Because we hold a pin on the
+		 * primary bucket page, no deletions or splits could have occurred;
+		 * therefore we can expect that the TID still exists in the current
+		 * index page, at an offset >= where we were.)
 		 */
 		OffsetNumber maxoffnum;

@ -424,17 +424,17 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);

 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_bucket_valid = false;
-	so->hashso_bucket_blkno = 0;
 	so->hashso_curbuf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_split_bucket_buf = InvalidBuffer;
 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
 	ItemPointerSetInvalid(&(so->hashso_heappos));

-	scan->opaque = so;
+	so->hashso_buc_populated = false;
+	so->hashso_buc_split = false;

-	/* register scan in case we change pages it's using */
-	_hash_regscan(scan);
+	scan->opaque = so;

 	return scan;
 }
@ -449,15 +449,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;

-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);

 	/* set position invalid (this will cause _hash_first call) */
 	ItemPointerSetInvalid(&(so->hashso_curpos));
@ -469,8 +461,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memmove(scan->keyData,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
-		so->hashso_bucket_valid = false;
 	}
+
+	so->hashso_buc_populated = false;
+	so->hashso_buc_split = false;
 }

 /*
@ -482,18 +476,7 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;

-	/* don't need scan registered anymore */
-	_hash_dropscan(scan);
-
-	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
-
-	/* release lock on bucket, too */
-	if (so->hashso_bucket_blkno)
-		_hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
-	so->hashso_bucket_blkno = 0;
+	_hash_dropscanbuf(rel, so);

 	pfree(so);
 	scan->opaque = NULL;
@ -504,6 +487,9 @@ hashendscan(IndexScanDesc scan)
 * The set of target tuples is specified via a callback routine that tells
 * whether any given heap tuple (identified by ItemPointer) is being deleted.
 *
+ * This function also deletes the tuples that are moved by split to other
+ * bucket.
+ *
 * Result: a palloc'd struct containing statistical info for VACUUM displays.
 */
 IndexBulkDeleteResult *
@ -548,83 +534,47 @@ loop_top:
 	{
 		BlockNumber bucket_blkno;
 		BlockNumber blkno;
-		bool		bucket_dirty = false;
+		Buffer		bucket_buf;
+		Buffer		buf;
+		HashPageOpaque bucket_opaque;
+		Page		page;
+		bool		split_cleanup = false;

 		/* Get address of bucket's start page */
 		bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);

-		/* Exclusive-lock the bucket so we can shrink it */
-		_hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
-		/* Shouldn't have any active scans locally, either */
-		if (_hash_has_active_scan(rel, cur_bucket))
-			elog(ERROR, "hash index has active scan during VACUUM");
-
-		/* Scan each page in bucket */
 		blkno = bucket_blkno;
-		while (BlockNumberIsValid(blkno))
-		{
-			Buffer		buf;
-			Page		page;
-			HashPageOpaque opaque;
-			OffsetNumber offno;
-			OffsetNumber maxoffno;
-			OffsetNumber deletable[MaxOffsetNumber];
-			int			ndeletable = 0;

-			vacuum_delay_point();
+		/*
+		 * We need to acquire a cleanup lock on the primary bucket page to out
+		 * wait concurrent scans before deleting the dead tuples.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+		LockBufferForCleanup(buf);
+		_hash_checkpage(rel, buf, LH_BUCKET_PAGE);

-			buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-											 info->strategy);
-			page = BufferGetPage(buf);
-			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-			Assert(opaque->hasho_bucket == cur_bucket);
+		page = BufferGetPage(buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);

-			/* Scan each tuple in page */
-			maxoffno = PageGetMaxOffsetNumber(page);
-			for (offno = FirstOffsetNumber;
-				 offno <= maxoffno;
-				 offno = OffsetNumberNext(offno))
-			{
-				IndexTuple	itup;
-				ItemPointer htup;
+		/*
+		 * If the bucket contains tuples that are moved by split, then we need
+		 * to delete such tuples.  We can't delete such tuples if the split
+		 * operation on bucket is not finished as those are needed by scans.
+		 */
+		if (!H_BUCKET_BEING_SPLIT(bucket_opaque) &&
+			H_NEEDS_SPLIT_CLEANUP(bucket_opaque))
+			split_cleanup = true;

-				itup = (IndexTuple) PageGetItem(page,
-												PageGetItemId(page, offno));
-				htup = &(itup->t_tid);
-				if (callback(htup, callback_state))
-				{
-					/* mark the item for deletion */
-					deletable[ndeletable++] = offno;
-					tuples_removed += 1;
-				}
-				else
-					num_index_tuples += 1;
-			}
+		bucket_buf = buf;

-			/*
-			 * Apply deletions and write page if needed, advance to next page.
-			 */
-			blkno = opaque->hasho_nextblkno;
+		hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
+						  local_metapage.hashm_maxbucket,
+						  local_metapage.hashm_highmask,
+						  local_metapage.hashm_lowmask, &tuples_removed,
+						  &num_index_tuples, split_cleanup,
+						  callback, callback_state);

-			if (ndeletable > 0)
-			{
-				PageIndexMultiDelete(page, deletable, ndeletable);
-				_hash_wrtbuf(rel, buf);
-				bucket_dirty = true;
-			}
-			else
-				_hash_relbuf(rel, buf);
-		}
-
-		/* If we deleted anything, try to compact free space */
-		if (bucket_dirty)
-			_hash_squeezebucket(rel, cur_bucket, bucket_blkno,
-								info->strategy);
-
-		/* Release bucket lock */
-		_hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+		_hash_dropbuf(rel, bucket_buf);

 		/* Advance to next bucket */
 		cur_bucket++;
@ -705,6 +655,210 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }

+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will return with a write lock again held on the
+ * primary bucket page.  The lock won't necessarily be held continuously,
+ * though, because we'll release it when visiting overflow pages.
+ *
+ * It would be very bad if this function cleaned a page while some other
+ * backend was in the midst of scanning it, because hashgettuple assumes
+ * that the next valid TID will be greater than or equal to the current
+ * valid TID.  There can't be any concurrent scans in progress when we first
+ * enter this function because of the cleanup lock we hold on the primary
+ * bucket page, but as soon as we release that lock, there might be.  We
+ * handle that by conspiring to prevent those scans from passing our cleanup
+ * scan.  To do that, we lock the next page in the bucket chain before
+ * releasing the lock on the previous page.  (This type of lock chaining is
+ * not ideal, so we might want to look for a better solution at some point.)
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
+				  BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool split_cleanup,
+				  IndexBulkDeleteCallback callback, void *callback_state)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+	bool		bucket_dirty = false;
+
+	blkno = bucket_blkno;
+	buf = bucket_buf;
+
+	if (split_cleanup)
+		new_bucket = _hash_get_newbucket_from_oldbucket(rel, cur_bucket,
+														lowmask, maxbucket);
+
+	/* Scan each page in bucket */
+	for (;;)
+	{
+		HashPageOpaque opaque;
+		OffsetNumber offno;
+		OffsetNumber maxoffno;
+		Buffer		next_buf;
+		Page		page;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+		bool		retain_pin = false;
+		bool		curr_page_dirty = false;
+
+		vacuum_delay_point();
+
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		/* Scan each tuple in page */
+		maxoffno = PageGetMaxOffsetNumber(page);
+		for (offno = FirstOffsetNumber;
+			 offno <= maxoffno;
+			 offno = OffsetNumberNext(offno))
+		{
+			ItemPointer htup;
+			IndexTuple	itup;
+			Bucket		bucket;
+			bool		kill_tuple = false;
+
+			itup = (IndexTuple) PageGetItem(page,
+											PageGetItemId(page, offno));
+			htup = &(itup->t_tid);
+
+			/*
+			 * To remove the dead tuples, we strictly want to rely on results
+			 * of callback function.  refer btvacuumpage for detailed reason.
+			 */
+			if (callback && callback(htup, callback_state))
+			{
+				kill_tuple = true;
+				if (tuples_removed)
+					*tuples_removed += 1;
+			}
+			else if (split_cleanup)
+			{
+				/* delete the tuples that are moved by split. */
+				bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+											  maxbucket,
+											  highmask,
+											  lowmask);
+				/* mark the item for deletion */
+				if (bucket != cur_bucket)
+				{
+					/*
+					 * We expect tuples to either belong to curent bucket or
+					 * new_bucket.  This is ensured because we don't allow
+					 * further splits from bucket that contains garbage. See
+					 * comments in _hash_expandtable.
+					 */
+					Assert(bucket == new_bucket);
+					kill_tuple = true;
+				}
+			}
+
+			if (kill_tuple)
+			{
+				/* mark the item for deletion */
+				deletable[ndeletable++] = offno;
+			}
+			else
+			{
+				/* we're keeping it, so count it */
+				if (num_index_tuples)
+					*num_index_tuples += 1;
+			}
+		}
+
+		/* retain the pin on primary bucket page till end of bucket scan */
+		if (blkno == bucket_blkno)
+			retain_pin = true;
+		else
+			retain_pin = false;
+
+		blkno = opaque->hasho_nextblkno;
+
+		/*
+		 * Apply deletions, advance to next page and write page if needed.
+		 */
+		if (ndeletable > 0)
+		{
+			PageIndexMultiDelete(page, deletable, ndeletable);
+			bucket_dirty = true;
+			curr_page_dirty = true;
+		}
+
+		/* bail out if there are no more pages to scan. */
+		if (!BlockNumberIsValid(blkno))
+			break;
+
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
+		/*
+		 * release the lock on previous page after acquiring the lock on next
+		 * page
+		 */
+		if (curr_page_dirty)
+		{
+			if (retain_pin)
+				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_wrtbuf(rel, buf);
+			curr_page_dirty = false;
+		}
+		else if (retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
+
+		buf = next_buf;
+	}
+
+	/*
+	 * lock the bucket page to clear the garbage flag and squeeze the bucket.
+	 * if the current buffer is same as bucket buffer, then we already have
+	 * lock on bucket page.
+	 */
+	if (buf != bucket_buf)
+	{
+		_hash_relbuf(rel, buf);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+	}
+
+	/*
+	 * Clear the garbage flag from bucket after deleting the tuples that are
+	 * moved by split.  We purposefully clear the flag before squeeze bucket,
+	 * so that after restart, vacuum shouldn't again try to delete the moved
+	 * by split tuples.
+	 */
+	if (split_cleanup)
+	{
+		HashPageOpaque bucket_opaque;
+		Page		page;
+
+		page = BufferGetPage(bucket_buf);
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+	}
+
+	/*
+	 * If we have deleted anything, try to compact free space.  For squeezing
+	 * the bucket, we must have a cleanup lock, else it can impact the
+	 * ordering of tuples for a scan that has started before it.
+	 */
+	if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
+		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+							bstrategy);
+	else
+		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+}

 void
 hash_redo(XLogReaderState *record)
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@ -28,18 +28,22 @@
 void
 _hash_doinsert(Relation rel, IndexTuple itup)
 {
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
+	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
-	BlockNumber oldblkno = InvalidBlockNumber;
-	bool		retry = false;
+	BlockNumber oldblkno;
+	bool		retry;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	uint32		maxbucket;
+	uint32		highmask;
+	uint32		lowmask;

 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@ -51,6 +55,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */

+restart_insert:
 	/* Read the metapage */
 	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
@ -69,6 +74,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 						itemsz, HashMaxItemSize((Page) metap)),
 			errhint("Values larger than a buffer page cannot be indexed.")));

+	oldblkno = InvalidBlockNumber;
+	retry = false;
+
 	/*
 	 * Loop until we get a lock on the correct target bucket.
 	 */
@ -84,21 +92,32 @@ _hash_doinsert(Relation rel, IndexTuple itup)

 		blkno = BUCKET_TO_BLKNO(metap, bucket);

+		/*
+		 * Copy bucket mapping info now; refer the comment in
+		 * _hash_expandtable where we copy this information before calling
+		 * _hash_splitbucket to see why this is okay.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
 		/* Release metapage lock, but keep pin. */
 		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);

 		/*
-		 * If the previous iteration of this loop locked what is still the
-		 * correct target bucket, we are done.  Otherwise, drop any old lock
-		 * and lock what now appears to be the correct bucket.
+		 * If the previous iteration of this loop locked the primary page of
+		 * what is still the correct target bucket, we are done.  Otherwise,
+		 * drop any old lock before acquiring the new one.
 		 */
 		if (retry)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch and lock the primary bucket page for the target bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);

 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@ -109,12 +128,36 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		retry = true;
 	}

-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+	/* remember the primary bucket buffer to release the pin on it at end. */
+	bucket_buf = buf;
+
 	page = BufferGetPage(buf);
 	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(pageopaque->hasho_bucket == bucket);

+	/*
+	 * If this bucket is in the process of being split, try to finish the
+	 * split before inserting, because that might create room for the
+	 * insertion to proceed without allocating an additional overflow page.
+	 * It's only interesting to finish the split if we're trying to insert
+	 * into the bucket from which we're removing tuples (the "old" bucket),
+	 * not if we're trying to insert into the bucket into which tuples are
+	 * being moved (the "new" bucket).
+	 */
+	if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+	{
+		/* release the lock on bucket buffer, before completing the split. */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
+						   maxbucket, highmask, lowmask);
+
+		/* release the pin on old and meta buffer.  retry for insert. */
+		_hash_dropbuf(rel, buf);
+		_hash_dropbuf(rel, metabuf);
+		goto restart_insert;
+	}
+
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
@ -127,9 +170,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		{
 			/*
 			 * ovfl page exists; go get it.  if it doesn't have room, we'll
-			 * find out next pass through the loop test above.
+			 * find out next pass through the loop test above.  we always
+			 * release both the lock and pin if this is an overflow page, but
+			 * only the lock if this is the primary bucket page, since the pin
+			 * on the primary bucket must be retained throughout the scan.
 			 */
-			_hash_relbuf(rel, buf);
+			if (buf != bucket_buf)
+				_hash_relbuf(rel, buf);
+			else
+				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
@ -144,7 +193,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);

 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf);
+			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false);
 			page = BufferGetPage(buf);

 			/* should fit now, given test above */
@ -158,11 +207,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	/* found page with enough space, so add the item here */
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);

-	/* write and release the modified page */
+	/*
+	 * write and release the modified page.  if the page we modified was an
+	 * overflow page, we also need to separately drop the pin we retained on
+	 * the primary bucket page.
+	 */
 	_hash_wrtbuf(rel, buf);
-
-	/* We can drop the bucket lock now */
-	_hash_droplock(rel, blkno, HASH_SHARE);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);

 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
 *
 *	On entry, the caller must hold a pin but no lock on 'buf'.  The pin is
 *	dropped before exiting (we assume the caller is not interested in 'buf'
- *	anymore).  The returned overflow page will be pinned and write-locked;
- *	it is guaranteed to be empty.
+ *	anymore) if not asked to retain.  The pin will be retained only for the
+ *	primary bucket.  The returned overflow page will be pinned and
+ *	write-locked; it is guaranteed to be empty.
 *
 *	The caller must hold a pin, but no lock, on the metapage buffer.
 *	That buffer is returned in the same state.
 *
- *	The caller must hold at least share lock on the bucket, to ensure that
- *	no one else tries to compact the bucket meanwhile.  This guarantees that
- *	'buf' won't stop being part of the bucket while it's unlocked.
- *
 * NB: since this could be executed concurrently by multiple processes,
 * one should not assume that the returned overflow page will be the
 * immediate successor of the originally passed 'buf'.  Additional overflow
 * pages might have been added to the bucket chain in between.
 */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 {
 	Buffer		ovflbuf;
 	Page		page;
@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
 			break;

 		/* we assume we do not need to write the unmodified page */
-		_hash_relbuf(rel, buf);
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);

 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)

 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	_hash_wrtbuf(rel, buf);
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, buf);

 	return ovflbuf;
 }
@ -369,21 +372,25 @@ _hash_firstfreebit(uint32 map)
 *	Returns the block number of the page that followed the given page
 *	in the bucket, or InvalidBlockNumber if no following page.
 *
- *	NB: caller must not hold lock on metapage, nor on either page that's
- *	adjacent in the bucket chain.  The caller had better hold exclusive lock
- *	on the bucket, too.
+ *	NB: caller must not hold lock on metapage, nor on page, that's next to
+ *	ovflbuf in the bucket chain.  We don't acquire the lock on page that's
+ *	prior to ovflbuf in chain if it is same as wbuf because the caller already
+ *	has a lock on same.  This function releases the lock on wbuf and caller
+ *	is responsible for releasing the pin on same.
 */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy)
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
+	Buffer		prevbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
+	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
 	Page		ovflpage;
 	Page		mappage;
@ -400,6 +407,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
+	writeblkno = BufferGetBlockNumber(wbuf);
 	bucket = ovflopaque->hasho_bucket;

 	/*
@ -413,23 +421,39 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
-	 * deleted.  No concurrency issues since we hold exclusive lock on the
-	 * entire bucket.
+	 * deleted.  Concurrency issues are avoided by using lock chaining as
+	 * described atop hashbucketcleanup.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-														 prevblkno,
-														 HASH_WRITE,
+		Page		prevpage;
+		HashPageOpaque prevopaque;
+
+		if (prevblkno == writeblkno)
+			prevbuf = wbuf;
+		else
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		prevpage = BufferGetPage(prevbuf);
-		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+												 bstrategy);
+
+		prevpage = BufferGetPage(prevbuf);
+		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);

 		Assert(prevopaque->hasho_bucket == bucket);
 		prevopaque->hasho_nextblkno = nextblkno;
-		_hash_wrtbuf(rel, prevbuf);
+
+		if (prevblkno != writeblkno)
+			_hash_wrtbuf(rel, prevbuf);
 	}
+
+	/* write and unlock the write buffer */
+	if (wbuf_dirty)
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
 	if (BlockNumberIsValid(nextblkno))
 	{
 		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
@ -570,8 +594,15 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 *	required that to be true on entry as well, but it's a lot easier for
 *	callers to leave empty overflow pages and let this guy clean it up.
 *
- *	Caller must hold exclusive lock on the target bucket.  This allows
- *	us to safely lock multiple pages in the bucket.
+ *	Caller must acquire cleanup lock on the primary page of the target
+ *	bucket to exclude any scans that are in progress, which could easily
+ *	be confused into returning the same tuple more than once or some tuples
+ *	not at all by the rearrangement we are performing here.  To prevent
+ *	any concurrent scan to cross the squeeze scan we use lock chaining
+ *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *
+ *	We need to retain a pin on the primary bucket to ensure that no concurrent
+ *	split can start.
 *
 *	Since this function is invoked in VACUUM, we provide an access strategy
 *	parameter that controls fetches of the bucket pages.
@ -580,6 +611,7 @@ void
 _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy)
 {
 	BlockNumber wblkno;
@ -593,23 +625,20 @@ _hash_squeezebucket(Relation rel,
 	bool		wbuf_dirty;

 	/*
-	 * start squeezing into the base bucket page.
+	 * start squeezing into the primary bucket page.
 	 */
 	wblkno = bucket_blkno;
-	wbuf = _hash_getbuf_with_strategy(rel,
-									  wblkno,
-									  HASH_WRITE,
-									  LH_BUCKET_PAGE,
-									  bstrategy);
+	wbuf = bucket_buf;
 	wpage = BufferGetPage(wbuf);
 	wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);

 	/*
-	 * if there aren't any overflow pages, there's nothing to squeeze.
+	 * if there aren't any overflow pages, there's nothing to squeeze. caller
+	 * is responsible for releasing the pin on primary bucket page.
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
 	{
-		_hash_relbuf(rel, wbuf);
+		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
 		return;
 	}

@ -646,6 +675,7 @@ _hash_squeezebucket(Relation rel,
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
+		bool		retain_pin = false;

 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
@ -671,13 +701,37 @@ _hash_squeezebucket(Relation rel,
 			 */
 			while (PageGetFreeSpace(wpage) < itemsz)
 			{
+				Buffer		next_wbuf = InvalidBuffer;
+
 				Assert(!PageIsEmpty(wpage));

+				if (wblkno == bucket_blkno)
+					retain_pin = true;
+
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));

+				/* don't need to move to next page if we reached the read page */
+				if (wblkno != rblkno)
+					next_wbuf = _hash_getbuf_with_strategy(rel,
+														   wblkno,
+														   HASH_WRITE,
+														   LH_OVERFLOW_PAGE,
+														   bstrategy);
+
+				/*
+				 * release the lock on previous page after acquiring the lock
+				 * on next page
+				 */
 				if (wbuf_dirty)
-					_hash_wrtbuf(rel, wbuf);
+				{
+					if (retain_pin)
+						_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+					else
+						_hash_wrtbuf(rel, wbuf);
+				}
+				else if (retain_pin)
+					_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
 				else
 					_hash_relbuf(rel, wbuf);

@ -695,15 +749,12 @@ _hash_squeezebucket(Relation rel,
 					return;
 				}

-				wbuf = _hash_getbuf_with_strategy(rel,
-												  wblkno,
-												  HASH_WRITE,
-												  LH_OVERFLOW_PAGE,
-												  bstrategy);
+				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
 				wbuf_dirty = false;
+				retain_pin = false;
 			}

 			/*
@ -728,28 +779,29 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  In that case, we don't need to lock it again and we
+		 * always release the lock on wbuf in _hash_freeovflpage and then
+		 * retake it again here.  This will not only simplify the code, but is
+		 * required to atomically log the changes which will be helpful when
+		 * we write WAL for hash indexes.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));

+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
-			/* yes, so release wbuf lock first */
-			if (wbuf_dirty)
-				_hash_wrtbuf(rel, wbuf);
-			else
-				_hash_relbuf(rel, wbuf);
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bstrategy);
-			/* done */
+			/* retain the pin on primary bucket page till end of bucket scan */
+			if (wblkno != bucket_blkno)
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}

-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bstrategy);
+		/* lock the overflow page being written, then get the previous one */
+		_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);

 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
 					uint32 nblocks);
 static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+					   Bucket obucket, Bucket nbucket, Buffer obuf,
+					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
+					   uint32 highmask, uint32 lowmask);


 /*
@ -54,46 +58,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 #define USELOCKING(rel)		(!RELATION_IS_LOCAL(rel))


-/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock.  (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		return ConditionalLockPage(rel, whichlock, access);
-	else
-		return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
-	if (USELOCKING(rel))
-		UnlockPage(rel, whichlock, access);
-}
-
 /*
 *	_hash_getbuf() -- Get a buffer by block number for read or write.
 *
@ -131,6 +95,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 	return buf;
 }

+/*
+ * _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup.
+ *
+ *		We read the page and try to acquire a cleanup lock.  If we get it,
+ *		we return the buffer; otherwise, we return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+	Buffer		buf;
+
+	if (blkno == P_NEW)
+		elog(ERROR, "hash AM does not use P_NEW");
+
+	buf = ReadBuffer(rel, blkno);
+
+	if (!ConditionalLockBufferForCleanup(buf))
+	{
+		ReleaseBuffer(buf);
+		return InvalidBuffer;
+	}
+
+	/* ref count and lock type are correct */
+
+	_hash_checkpage(rel, buf, flags);
+
+	return buf;
+}
+
 /*
 *	_hash_getinitbuf() -- Get and initialize a buffer by block number.
 *
@ -265,6 +258,37 @@ _hash_dropbuf(Relation rel, Buffer buf)
 	ReleaseBuffer(buf);
 }

+/*
+ *	_hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+	/* release pin we hold on primary bucket page */
+	if (BufferIsValid(so->hashso_bucket_buf) &&
+		so->hashso_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_bucket_buf);
+	so->hashso_bucket_buf = InvalidBuffer;
+
+	/* release pin we hold on primary bucket page  of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf) &&
+		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
+	so->hashso_split_bucket_buf = InvalidBuffer;
+
+	/* release any pin we still hold */
+	if (BufferIsValid(so->hashso_curbuf))
+		_hash_dropbuf(rel, so->hashso_curbuf);
+	so->hashso_curbuf = InvalidBuffer;
+
+	/* reset split scan */
+	so->hashso_buc_populated = false;
+	so->hashso_buc_split = false;
+}
+
 /*
 *	_hash_wrtbuf() -- write a hash page to disk.
 *
@ -489,9 +513,11 @@ _hash_pageinit(Page page, Size size)
 /*
 * Attempt to expand the hash table by creating one new bucket.
 *
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if we don't get cleanup lock on old or
+ * new bucket.
 *
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from the previous split.
 *
 * The caller must hold a pin, but no lock, on the metapage buffer.
 * The buffer is returned in the same state.
@ -506,10 +532,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	BlockNumber start_oblkno;
 	BlockNumber start_nblkno;
 	Buffer		buf_nblkno;
+	Buffer		buf_oblkno;
+	Page		opage;
+	HashPageOpaque oopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;

+restart_expand:
+
 	/*
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
@ -548,11 +579,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		goto fail;

 	/*
-	 * Determine which bucket is to be split, and attempt to lock the old
-	 * bucket.  If we can't get the lock, give up.
+	 * Determine which bucket is to be split, and attempt to take cleanup lock
+	 * on the old bucket.  If we can't get the lock, give up.
 	 *
-	 * The lock protects us against other backends, but not against our own
-	 * backend.  Must check for active scans separately.
+	 * The cleanup lock protects us not only against other backends, but
+	 * against our own backend as well.
+	 *
+	 * The cleanup lock is mainly to protect the split from concurrent
+	 * inserts. See src/backend/access/hash/README, Lock Definitions for
+	 * further details.  Due to this locking restriction, if there is any
+	 * pending scan, the split will give up which is not good, but harmless.
 	 */
 	new_bucket = metap->hashm_maxbucket + 1;

@ -560,14 +596,78 @@ _hash_expandtable(Relation rel, Buffer metabuf)

 	start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);

-	if (_hash_has_active_scan(rel, old_bucket))
+	buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+	if (!buf_oblkno)
 		goto fail;

-	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
-		goto fail;
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

 	/*
-	 * Likewise lock the new bucket (should never fail).
+	 * We want to finish the split from a bucket as there is no apparent
+	 * benefit by not doing so and it will make the code complicated to finish
+	 * the split that involves multiple buckets considering the case where new
+	 * split also fails.  We don't need to consider the new bucket for
+	 * completing the split here as it is not possible that a re-split of new
+	 * bucket starts when there is still a pending split from old bucket.
+	 */
+	if (H_BUCKET_BEING_SPLIT(oopaque))
+	{
+		/*
+		 * Copy bucket mapping info now; refer the comment in code below where
+		 * we copy this information before calling _hash_splitbucket to see
+		 * why this is okay.
+		 */
+		maxbucket = metap->hashm_maxbucket;
+		highmask = metap->hashm_highmask;
+		lowmask = metap->hashm_lowmask;
+
+		/*
+		 * Release the lock on metapage and old_bucket, before completing the
+		 * split.
+		 */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, buf_oblkno, HASH_READ, HASH_NOLOCK);
+
+		_hash_finish_split(rel, metabuf, buf_oblkno, old_bucket, maxbucket,
+						   highmask, lowmask);
+
+		/* release the pin on old buffer and retry for expand. */
+		_hash_dropbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * Clean the tuples remained from the previous split.  This operation
+	 * requires cleanup lock and we already have one on the old bucket, so
+	 * let's do it. We also don't want to allow further splits from the bucket
+	 * till the garbage of previous split is cleaned.  This has two
+	 * advantages; first, it helps in avoiding the bloat due to garbage and
+	 * second is, during cleanup of bucket, we are always sure that the
+	 * garbage tuples belong to most recently split bucket.  On the contrary,
+	 * if we allow cleanup of bucket after meta page is updated to indicate
+	 * the new split and before the actual split, the cleanup operation won't
+	 * be able to decide whether the tuple has been moved to the newly created
+	 * bucket and ended up deleting such tuples.
+	 */
+	if (H_NEEDS_SPLIT_CLEANUP(oopaque))
+	{
+		/* Release the metapage lock. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
+						  metap->hashm_maxbucket, metap->hashm_highmask,
+						  metap->hashm_lowmask, NULL,
+						  NULL, true, NULL, NULL);
+
+		_hash_dropbuf(rel, buf_oblkno);
+
+		goto restart_expand;
+	}
+
+	/*
+	 * There shouldn't be any active scan on new bucket.
 	 *
 	 * Note: it is safe to compute the new bucket's blkno here, even though we
 	 * may still need to update the BUCKET_TO_BLKNO mapping.  This is because
@ -576,12 +676,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	 */
 	start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);

-	if (_hash_has_active_scan(rel, new_bucket))
-		elog(ERROR, "scan in progress on supposedly new bucket");
-
-	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-		elog(ERROR, "could not get lock on supposedly new bucket");
-
 	/*
 	 * If the split point is increasing (hashm_maxbucket's log base 2
 	 * increases), we need to allocate a new batch of bucket pages.
@ -600,8 +694,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
 			/* can't split due to BlockNumber overflow */
-			_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-			_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+			_hash_relbuf(rel, buf_oblkno);
 			goto fail;
 		}
 	}
@ -609,9 +702,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/*
 	 * Physically allocate the new bucket's primary page.  We want to do this
 	 * before changing the metapage's mapping info, in case we can't get the
-	 * disk space.
+	 * disk space.  Ideally, we don't need to check for cleanup lock on new
+	 * bucket as no other backend could find this bucket unless meta page is
+	 * updated.  However, it is good to be consistent with old bucket locking.
 	 */
 	buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+	if (!IsBufferCleanupOK(buf_nblkno))
+	{
+		_hash_relbuf(rel, buf_oblkno);
+		_hash_relbuf(rel, buf_nblkno);
+		goto fail;
+	}
+

 	/*
 	 * Okay to proceed with split.  Update the metapage bucket mapping info.
@ -665,13 +767,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  start_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno,
 					  maxbucket, highmask, lowmask);

-	/* Release bucket locks, allowing others to access them */
-	_hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
-	_hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
 	return;

 	/* Here if decide not to split or fail to acquire old bucket lock */
@ -738,13 +836,17 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 * belong in the new bucket, and compress out any free space in the old
 * bucket.
 *
- * The caller must hold exclusive locks on both buckets to ensure that
+ * The caller must hold cleanup locks on both buckets to ensure that
 * no one else is trying to access them (see README).
 *
 * The caller must hold a pin, but no lock, on the metapage buffer.
 * The buffer is returned in the same state.  (The metapage is only
 * touched if it becomes necessary to add or remove overflow pages.)
 *
+ * Split needs to retain pin on primary bucket pages of both old and new
+ * buckets till end of operation.  This is to prevent vacuum from starting
+ * while a split is in progress.
+ *
 * In addition, the caller must have created the new bucket's base page,
 * which is passed in buffer nbuf, pinned and write-locked.  That lock and
 * pin are released here.  (The API is set up this way because we must do
@ -756,37 +858,86 @@ _hash_splitbucket(Relation rel,
 				  Buffer metabuf,
 				  Bucket obucket,
 				  Bucket nbucket,
-				  BlockNumber start_oblkno,
+				  Buffer obuf,
 				  Buffer nbuf,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Buffer		obuf;
 	Page		opage;
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;

-	/*
-	 * It should be okay to simultaneously write-lock pages from each bucket,
-	 * since no one else can be trying to acquire buffer lock on pages of
-	 * either bucket.
-	 */
-	obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
 	opage = BufferGetPage(obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

+	/*
+	 * Mark the old bucket to indicate that split is in progress.  At
+	 * operation end, we clear split-in-progress flag.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+
 	npage = BufferGetPage(nbuf);

-	/* initialize the new bucket's primary page */
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nopaque->hasho_prevblkno = InvalidBlockNumber;
 	nopaque->hasho_nextblkno = InvalidBlockNumber;
 	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
 	nopaque->hasho_page_id = HASHO_PAGE_ID;

+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, nbuf, NULL,
+						   maxbucket, highmask, lowmask);
+
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, obuf);
+	_hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * to finish incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+					   Buffer metabuf,
+					   Bucket obucket,
+					   Bucket nbucket,
+					   Buffer obuf,
+					   Buffer nbuf,
+					   HTAB *htab,
+					   uint32 maxbucket,
+					   uint32 highmask,
+					   uint32 lowmask)
+{
+	Buffer		bucket_obuf;
+	Buffer		bucket_nbuf;
+	Page		opage;
+	Page		npage;
+	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
+
+	bucket_obuf = obuf;
+	opage = BufferGetPage(obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	bucket_nbuf = nbuf;
+	npage = BufferGetPage(nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
 	/*
 	 * Partition the tuples in the old bucket between the old bucket and the
 	 * new bucket, advancing along the old bucket's overflow bucket chain and
@ -798,8 +949,6 @@ _hash_splitbucket(Relation rel,
 		BlockNumber oblkno;
 		OffsetNumber ooffnum;
 		OffsetNumber omaxoffnum;
-		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;

 		/* Scan each tuple in old page */
 		omaxoffnum = PageGetMaxOffsetNumber(opage);
@ -810,33 +959,52 @@ _hash_splitbucket(Relation rel,
 			IndexTuple	itup;
 			Size		itemsz;
 			Bucket		bucket;
+			bool		found = false;

 			/* skip dead tuples */
 			if (ItemIdIsDead(PageGetItemId(opage, ooffnum)))
 				continue;

 			/*
-			 * Fetch the item's hash key (conveniently stored in the item) and
-			 * determine which bucket it now belongs in.
+			 * Before inserting a tuple, probe the hash table containing TIDs
+			 * of tuples belonging to new bucket, if we find a match, then
+			 * skip that tuple, else fetch the item's hash key (conveniently
+			 * stored in the item) and determine which bucket it now belongs
+			 * in.
 			 */
 			itup = (IndexTuple) PageGetItem(opage,
 											PageGetItemId(opage, ooffnum));
+
+			if (htab)
+				(void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+			if (found)
+				continue;
+
 			bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
 										  maxbucket, highmask, lowmask);

 			if (bucket == nbucket)
 			{
+				IndexTuple	new_itup;
+
+				/*
+				 * make a copy of index tuple as we have to scribble on it.
+				 */
+				new_itup = CopyIndexTuple(itup);
+
+				/*
+				 * mark the index tuple as moved by split, such tuples are
+				 * skipped by scan if there is split in progress for a bucket.
+				 */
+				new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+
 				/*
 				 * insert the tuple into the new bucket.  if it doesn't fit on
 				 * the current page in the new bucket, we must allocate a new
 				 * overflow page and place the tuple on that page instead.
-				 *
-				 * XXX we have a problem here if we fail to get space for a
-				 * new overflow page: we'll error out leaving the bucket split
-				 * only partially complete, meaning the index is corrupt,
-				 * since searches may fail to find entries they should find.
 				 */
-				itemsz = IndexTupleDSize(*itup);
+				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);

 				if (PageGetFreeSpace(npage) < itemsz)
@ -844,9 +1012,9 @@ _hash_splitbucket(Relation rel,
 					/* write out nbuf and drop lock, but keep pin */
 					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
-					/* we don't need nopaque within the loop */
+					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}

 				/*
@ -856,12 +1024,10 @@ _hash_splitbucket(Relation rel,
 				 * Possible future improvement: accumulate all the items for
 				 * the new page and qsort them before insertion.
 				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);

-				/*
-				 * Mark tuple for deletion from old page.
-				 */
-				deletable[ndeletable++] = ooffnum;
+				/* be tidy */
+				pfree(new_itup);
 			}
 			else
 			{
@ -874,15 +1040,9 @@ _hash_splitbucket(Relation rel,

 		oblkno = oopaque->hasho_nextblkno;

-		/*
-		 * Done scanning this old page.  If we moved any tuples, delete them
-		 * from the old page.
-		 */
-		if (ndeletable > 0)
-		{
-			PageIndexMultiDelete(opage, deletable, ndeletable);
-			_hash_wrtbuf(rel, obuf);
-		}
+		/* retain the pin on the old primary bucket */
+		if (obuf == bucket_obuf)
+			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, obuf);

@ -891,18 +1051,169 @@ _hash_splitbucket(Relation rel,
 			break;

 		/* Else, advance to next old page */
-		obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
 		opage = BufferGetPage(obuf);
 		oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	}

 	/*
 	 * We're at the end of the old bucket chain, so we're done partitioning
-	 * the tuples.  Before quitting, call _hash_squeezebucket to ensure the
-	 * tuples remaining in the old bucket (including the overflow pages) are
-	 * packed as tightly as possible.  The new bucket is already tight.
+	 * the tuples.  Mark the old and new buckets to indicate split is
+	 * finished.
+	 *
+	 * To avoid deadlocks due to locking order of buckets, first lock the old
+	 * bucket and then the new bucket.
 	 */
-	_hash_wrtbuf(rel, nbuf);
+	if (nbuf == bucket_nbuf)
+		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	else
+		_hash_wrtbuf(rel, nbuf);

-	_hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	opage = BufferGetPage(bucket_obuf);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	npage = BufferGetPage(bucket_nbuf);
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
+	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
+
+	/*
+	 * After the split is finished, mark the old bucket to indicate that it
+	 * contains deletable tuples.  Vacuum will clear split-cleanup flag after
+	 * deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+
+	/*
+	 * now write the buffers, here we don't release the locks as caller is
+	 * responsible to release locks.
+	 */
+	MarkBufferDirty(bucket_obuf);
+	MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ *	_hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interrupted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage and old bucket's
+ * primay page buffer.  The buffers are returned in the same state.  (The
+ * metapage is only touched if it becomes necessary to add or remove overflow
+ * pages.)
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
+				   uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+	HASHCTL		hash_ctl;
+	HTAB	   *tidhtab;
+	Buffer		bucket_nbuf = InvalidBuffer;
+	Buffer		nbuf;
+	Page		npage;
+	BlockNumber nblkno;
+	BlockNumber bucket_nblkno;
+	HashPageOpaque npageopaque;
+	Bucket		nbucket;
+	bool		found;
+
+	/* Initialize hash tables used to track TIDs */
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(ItemPointerData);
+	hash_ctl.entrysize = sizeof(ItemPointerData);
+	hash_ctl.hcxt = CurrentMemoryContext;
+
+	tidhtab =
+		hash_create("bucket ctids",
+					256,		/* arbitrary initial size */
+					&hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	bucket_nblkno = nblkno = _hash_get_newblock_from_oldbucket(rel, obucket);
+
+	/*
+	 * Scan the new bucket and build hash table of TIDs
+	 */
+	for (;;)
+	{
+		OffsetNumber noffnum;
+		OffsetNumber nmaxoffnum;
+
+		nbuf = _hash_getbuf(rel, nblkno, HASH_READ,
+							LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		/* remember the primary bucket buffer to acquire cleanup lock on it. */
+		if (nblkno == bucket_nblkno)
+			bucket_nbuf = nbuf;
+
+		npage = BufferGetPage(nbuf);
+		npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+		/* Scan each tuple in new page */
+		nmaxoffnum = PageGetMaxOffsetNumber(npage);
+		for (noffnum = FirstOffsetNumber;
+			 noffnum <= nmaxoffnum;
+			 noffnum = OffsetNumberNext(noffnum))
+		{
+			IndexTuple	itup;
+
+			/* Fetch the item's TID and insert it in hash table. */
+			itup = (IndexTuple) PageGetItem(npage,
+											PageGetItemId(npage, noffnum));
+
+			(void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+			Assert(!found);
+		}
+
+		nblkno = npageopaque->hasho_nextblkno;
+
+		/*
+		 * release our write lock without modifying buffer and ensure to
+		 * retain the pin on primary bucket.
+		 */
+		if (nbuf == bucket_nbuf)
+			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, nbuf);
+
+		/* Exit loop if no more overflow pages in new bucket */
+		if (!BlockNumberIsValid(nblkno))
+			break;
+	}
+
+	/*
+	 * Conditionally get the cleanup lock on old and new buckets to perform
+	 * the split operation.  If we don't get the cleanup locks, silently give
+	 * up and next insertion on old bucket will try again to complete the
+	 * split.
+	 */
+	if (!ConditionalLockBufferForCleanup(obuf))
+	{
+		hash_destroy(tidhtab);
+		return;
+	}
+	if (!ConditionalLockBufferForCleanup(bucket_nbuf))
+	{
+		_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+		hash_destroy(tidhtab);
+		return;
+	}
+
+	npage = BufferGetPage(bucket_nbuf);
+	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nbucket = npageopaque->hasho_bucket;
+
+	_hash_splitbucket_guts(rel, metabuf, obucket,
+						   nbucket, obuf, bucket_nbuf, tidhtab,
+						   maxbucket, highmask, lowmask);
+
+	_hash_relbuf(rel, bucket_nbuf);
+	_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+	hash_destroy(tidhtab);
 }
--- a/src/backend/access/hash/hashscan.c
+++ b/src/backend/access/hash/hashscan.c
@ -1,153 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * hashscan.c
- *	  manage scans on hash tables
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/access/hash/hashscan.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "access/hash.h"
-#include "access/relscan.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/resowner.h"
-
-
-/*
- * We track all of a backend's active scans on hash indexes using a list
- * of HashScanListData structs, which are allocated in TopMemoryContext.
- * It's okay to use a long-lived context because we rely on the ResourceOwner
- * mechanism to clean up unused entries after transaction or subtransaction
- * abort.  We can't safely keep the entries in the executor's per-query
- * context, because that might be already freed before we get a chance to
- * clean up the list.  (XXX seems like there should be a better way to
- * manage this...)
- */
-typedef struct HashScanListData
-{
-	IndexScanDesc hashsl_scan;
-	ResourceOwner hashsl_owner;
-	struct HashScanListData *hashsl_next;
-} HashScanListData;
-
-typedef HashScanListData *HashScanList;
-
-static HashScanList HashScans = NULL;
-
-
-/*
- * ReleaseResources_hash() --- clean up hash subsystem resources.
- *
- * This is here because it needs to touch this module's static var HashScans.
- */
-void
-ReleaseResources_hash(void)
-{
-	HashScanList l;
-	HashScanList prev;
-	HashScanList next;
-
-	/*
-	 * Release all HashScanList items belonging to the current ResourceOwner.
-	 * Note that we do not release the underlying IndexScanDesc; that's in
-	 * executor memory and will go away on its own (in fact quite possibly has
-	 * gone away already, so we mustn't try to touch it here).
-	 *
-	 * Note: this should be a no-op during normal query shutdown. However, in
-	 * an abort situation ExecutorEnd is not called and so there may be open
-	 * index scans to clean up.
-	 */
-	prev = NULL;
-
-	for (l = HashScans; l != NULL; l = next)
-	{
-		next = l->hashsl_next;
-		if (l->hashsl_owner == CurrentResourceOwner)
-		{
-			if (prev == NULL)
-				HashScans = next;
-			else
-				prev->hashsl_next = next;
-
-			pfree(l);
-			/* prev does not change */
-		}
-		else
-			prev = l;
-	}
-}
-
-/*
- *	_hash_regscan() -- register a new scan.
- */
-void
-_hash_regscan(IndexScanDesc scan)
-{
-	HashScanList new_el;
-
-	new_el = (HashScanList) MemoryContextAlloc(TopMemoryContext,
-											   sizeof(HashScanListData));
-	new_el->hashsl_scan = scan;
-	new_el->hashsl_owner = CurrentResourceOwner;
-	new_el->hashsl_next = HashScans;
-	HashScans = new_el;
-}
-
-/*
- *	_hash_dropscan() -- drop a scan from the scan list
- */
-void
-_hash_dropscan(IndexScanDesc scan)
-{
-	HashScanList chk,
-				last;
-
-	last = NULL;
-	for (chk = HashScans;
-		 chk != NULL && chk->hashsl_scan != scan;
-		 chk = chk->hashsl_next)
-		last = chk;
-
-	if (chk == NULL)
-		elog(ERROR, "hash scan list trashed; cannot find 0x%p", (void *) scan);
-
-	if (last == NULL)
-		HashScans = chk->hashsl_next;
-	else
-		last->hashsl_next = chk->hashsl_next;
-
-	pfree(chk);
-}
-
-/*
- * Is there an active scan in this bucket?
- */
-bool
-_hash_has_active_scan(Relation rel, Bucket bucket)
-{
-	Oid			relid = RelationGetRelid(rel);
-	HashScanList l;
-
-	for (l = HashScans; l != NULL; l = l->hashsl_next)
-	{
-		if (relid == l->hashsl_scan->indexRelation->rd_id)
-		{
-			HashScanOpaque so = (HashScanOpaque) l->hashsl_scan->opaque;
-
-			if (so->hashso_bucket_valid &&
-				so->hashso_bucket == bucket)
-				return true;
-		}
-	}
-
-	return false;
-}
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@ -63,38 +63,94 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 }

 /*
- * Advance to next page in a bucket, if any.
+ * Advance to next page in a bucket, if any.  If we are scanning the bucket
+ * being populated during split operation then this function advances to the
+ * bucket being split after the last bucket page of bucket being populated.
 */
 static void
-_hash_readnext(Relation rel,
+_hash_readnext(IndexScanDesc scan,
 			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
 {
 	BlockNumber blkno;
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	bool		block_found = false;

 	blkno = (*opaquep)->hasho_nextblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan.  Refer the
+	 * comments in _hash_first to know the reason of retaining pin.
+	 */
+	if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
 	if (BlockNumberIsValid(blkno))
 	{
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+		block_found = true;
+	}
+	else if (so->hashso_buc_populated && !so->hashso_buc_split)
+	{
+		/*
+		 * end of bucket, scan bucket being split if there was a split in
+		 * progress at the start of scan.
+		 */
+		*bufp = so->hashso_split_bucket_buf;
+
+		/*
+		 * buffer for bucket being split must be valid as we acquire the pin
+		 * on it before the start of scan and retain it till end of scan.
+		 */
+		Assert(BufferIsValid(*bufp));
+
+		_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
+
+		/*
+		 * setting hashso_buc_split to true indicates that we are scanning
+		 * bucket being split.
+		 */
+		so->hashso_buc_split = true;
+
+		block_found = true;
+	}
+
+	if (block_found)
+	{
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }

 /*
- * Advance to previous page in a bucket, if any.
+ * Advance to previous page in a bucket, if any.  If the current scan has
+ * started during split operation then this function advances to bucket
+ * being populated after the first bucket page of bucket being split.
 */
 static void
-_hash_readprev(Relation rel,
+_hash_readprev(IndexScanDesc scan,
 			   Buffer *bufp, Page *pagep, HashPageOpaque *opaquep)
 {
 	BlockNumber blkno;
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;

 	blkno = (*opaquep)->hasho_prevblkno;
-	_hash_relbuf(rel, *bufp);
+
+	/*
+	 * Retain the pin on primary bucket page till the end of scan.  Refer the
+	 * comments in _hash_first to know the reason of retaining pin.
+	 */
+	if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
+		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, *bufp);
+
 	*bufp = InvalidBuffer;
 	/* check for interrupts while we're not holding any buffer lock */
 	CHECK_FOR_INTERRUPTS();
@ -104,6 +160,41 @@ _hash_readprev(Relation rel,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, *bufp);
+	}
+	else if (so->hashso_buc_populated && so->hashso_buc_split)
+	{
+		/*
+		 * end of bucket, scan bucket being populated if there was a split in
+		 * progress at the start of scan.
+		 */
+		*bufp = so->hashso_bucket_buf;
+
+		/*
+		 * buffer for bucket being populated must be valid as we acquire the
+		 * pin on it before the start of scan and retain it till end of scan.
+		 */
+		Assert(BufferIsValid(*bufp));
+
+		_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
+		*pagep = BufferGetPage(*bufp);
+		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+		/* move to the end of bucket chain */
+		while (BlockNumberIsValid((*opaquep)->hasho_nextblkno))
+			_hash_readnext(scan, bufp, pagep, opaquep);
+
+		/*
+		 * setting hashso_buc_split to false indicates that we are scanning
+		 * bucket being populated.
+		 */
+		so->hashso_buc_split = false;
 	}
 }

@ -218,9 +309,11 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (oldblkno == blkno)
 				break;
-			_hash_droplock(rel, oldblkno, HASH_SHARE);
+			_hash_relbuf(rel, buf);
 		}
-		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/* Fetch the primary bucket page for the bucket */
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);

 		/*
 		 * Reacquire metapage lock and check that no bucket split has taken
@ -234,22 +327,73 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	/* done with the metapage */
 	_hash_dropbuf(rel, metabuf);

-	/* Update scan opaque state to show we have lock on the bucket */
-	so->hashso_bucket = bucket;
-	so->hashso_bucket_valid = true;
-	so->hashso_bucket_blkno = blkno;
-
-	/* Fetch the primary bucket page for the bucket */
-	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
 	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);

+	so->hashso_bucket_buf = buf;
+
+	/*
+	 * If the bucket split is in progress, then while scanning the bucket
+	 * being populated, we need to skip tuples that are moved from bucket
+	 * being split.  We need to maintain the pin on bucket being split to
+	 * ensure that split-cleanup work done by vacuum doesn't remove tuples
+	 * from it till this scan is done.  We need to main to maintain the pin on
+	 * bucket being populated to ensure that vacuum doesn't squeeze that
+	 * bucket till this scan is complete, otherwise the ordering of tuples
+	 * can't be maintained during forward and backward scans.  Here, we have
+	 * to be cautious about locking order, first acquire the lock on bucket
+	 * being split, release the lock on it, but not pin, then acquire the lock
+	 * on bucket being populated and again re-verify whether the bucket split
+	 * still is in progress.  First acquiring lock on bucket being split
+	 * ensures that the vacuum waits for this scan to finish.
+	 */
+	if (H_BUCKET_BEING_POPULATED(opaque))
+	{
+		BlockNumber old_blkno;
+		Buffer		old_buf;
+
+		old_blkno = _hash_get_oldblock_from_newbucket(rel, bucket);
+
+		/*
+		 * release the lock on new bucket and re-acquire it after acquiring
+		 * the lock on old bucket.
+		 */
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+		/*
+		 * remember the split bucket buffer so as to use it later for
+		 * scanning.
+		 */
+		so->hashso_split_bucket_buf = old_buf;
+		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		page = BufferGetPage(buf);
+		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		Assert(opaque->hasho_bucket == bucket);
+
+		if (H_BUCKET_BEING_POPULATED(opaque))
+			so->hashso_buc_populated = true;
+		else
+		{
+			_hash_dropbuf(rel, so->hashso_split_bucket_buf);
+			so->hashso_split_bucket_buf = InvalidBuffer;
+		}
+	}
+
 	/* If a backwards scan is requested, move to the end of the chain */
 	if (ScanDirectionIsBackward(dir))
 	{
-		while (BlockNumberIsValid(opaque->hasho_nextblkno))
-			_hash_readnext(rel, &buf, &page, &opaque);
+		/*
+		 * Backward scans that start during split needs to start from end of
+		 * bucket being split.
+		 */
+		while (BlockNumberIsValid(opaque->hasho_nextblkno) ||
+			   (so->hashso_buc_populated && !so->hashso_buc_split))
+			_hash_readnext(scan, &buf, &page, &opaque);
 	}

 	/* Now find the first tuple satisfying the qualification */
@ -273,6 +417,12 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 *		false.  Else, return true and set the hashso_curpos for the
 *		scan to the right thing.
 *
+ *		Here we need to ensure that if the scan has started during split, then
+ *		skip the tuples that are moved by split while scanning bucket being
+ *		populated and then scan the bucket being split to cover all such
+ *		tuples.  This is done to ensure that we don't miss tuples in the scans
+ *		that are started during split.
+ *
 *		'bufP' points to the current buffer, which is pinned and read-locked.
 *		On success exit, we have pin and read-lock on whichever page
 *		contains the right item; on failure, we have released all buffers.
@ -338,6 +488,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum >= FirstOffsetNumber);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_buc_populated && !so->hashso_buc_split &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberNext(offnum);	/* move forward */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@ -345,7 +508,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readnext(rel, &buf, &page, &opaque);
+					_hash_readnext(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@ -353,7 +516,6 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
 						itup = NULL;
 						break;	/* exit for-loop */
 					}
@ -379,6 +541,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					{
 						Assert(offnum <= maxoff);
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+						/*
+						 * skip the tuples that are moved by split operation
+						 * for the scan that has started when split was in
+						 * progress
+						 */
+						if (so->hashso_buc_populated && !so->hashso_buc_split &&
+							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+						{
+							offnum = OffsetNumberPrev(offnum);	/* move back */
+							continue;
+						}
+
 						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
 							break;		/* yes, so exit for-loop */
 					}
@ -386,7 +561,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					/*
 					 * ran off the end of this page, try the next
 					 */
-					_hash_readprev(rel, &buf, &page, &opaque);
+					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
 						maxoff = PageGetMaxOffsetNumber(page);
@ -394,7 +569,6 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					}
 					else
 					{
-						/* end of bucket */
 						itup = NULL;
 						break;	/* exit for-loop */
 					}
@ -410,9 +584,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)

 		if (itup == NULL)
 		{
-			/* we ran off the end of the bucket without finding a match */
+			/*
+			 * We ran off the end of the bucket without finding a match.
+			 * Release the pin on bucket buffers.  Normally, such pins are
+			 * released at end of scan, however scrolling cursors can
+			 * reacquire the bucket lock and pin in the same scan multiple
+			 * times.
+			 */
 			*bufP = so->hashso_curbuf = InvalidBuffer;
 			ItemPointerSetInvalid(current);
+			_hash_dropscanbuf(rel, so);
 			return false;
 		}

--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@ -20,6 +20,8 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"

+#define CALC_NEW_BUCKET(old_bucket, lowmask) \
+			old_bucket | (lowmask + 1)

 /*
 * _hash_checkqual -- does the index tuple satisfy the scan conditions?
@ -352,3 +354,95 @@ _hash_binsearch_last(Page page, uint32 hash_value)

 	return lower;
 }
+
+/*
+ *	_hash_get_oldblock_from_newbucket() -- get the block number of a bucket
+ *			from which current (new) bucket is being split.
+ */
+BlockNumber
+_hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket)
+{
+	Bucket		old_bucket;
+	uint32		mask;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	/*
+	 * To get the old bucket from the current bucket, we need a mask to modulo
+	 * into lower half of table.  This mask is stored in meta page as
+	 * hashm_lowmask, but here we can't rely on the same, because we need a
+	 * value of lowmask that was prevalent at the time when bucket split was
+	 * started.  Masking the most significant bit of new bucket would give us
+	 * old bucket.
+	 */
+	mask = (((uint32) 1) << (fls(new_bucket) - 1)) - 1;
+	old_bucket = new_bucket & mask;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newblock_from_oldbucket() -- get the block number of a bucket
+ *			that will be generated after split from old bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half.  It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket)
+{
+	Bucket		new_bucket;
+	Buffer		metabuf;
+	HashMetaPage metap;
+	BlockNumber blkno;
+
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+	metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+	new_bucket = _hash_get_newbucket_from_oldbucket(rel, old_bucket,
+													metap->hashm_lowmask,
+													metap->hashm_maxbucket);
+	blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+
+	_hash_relbuf(rel, metabuf);
+
+	return blkno;
+}
+
+/*
+ *	_hash_get_newbucket_from_oldbucket() -- get the new bucket that will be
+ *			generated after split from current (old) bucket.
+ *
+ * This is used to find the new bucket from old bucket.  New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of current table
+ * half (lowmask passed in this function can be used to identify msb of
+ * current table half).  There could be multiple buckets that could have
+ * been split from current bucket.  We need the first such bucket that exists.
+ * Caller must ensure that no more than one split has happened from old
+ * bucket.
+ */
+Bucket
+_hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
+								   uint32 lowmask, uint32 maxbucket)
+{
+	Bucket		new_bucket;
+
+	new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
+	if (new_bucket > maxbucket)
+	{
+		lowmask = lowmask >> 1;
+		new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
+	}
+
+	return new_bucket;
+}
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@ -668,9 +668,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintFileLeakWarning(res);
 			FileClose(res);
 		}
-
-		/* Clean up index scans too */
-		ReleaseResources_hash();
 	}

 	/* Let add-on modules get a chance too */
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@ -24,6 +24,7 @@
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
+#include "utils/hsearch.h"
 #include "utils/relcache.h"

 /*
@ -32,6 +33,8 @@
 */
 typedef uint32 Bucket;

+#define InvalidBucket	((Bucket) 0xFFFFFFFF)
+
 #define BUCKET_TO_BLKNO(metap,B) \
 		((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)

@ -51,6 +54,9 @@ typedef uint32 Bucket;
 #define LH_BUCKET_PAGE			(1 << 1)
 #define LH_BITMAP_PAGE			(1 << 2)
 #define LH_META_PAGE			(1 << 3)
+#define LH_BUCKET_BEING_POPULATED	(1 << 4)
+#define LH_BUCKET_BEING_SPLIT	(1 << 5)
+#define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)

 typedef struct HashPageOpaqueData
 {
@ -63,6 +69,10 @@ typedef struct HashPageOpaqueData

 typedef HashPageOpaqueData *HashPageOpaque;

+#define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
+#define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
+#define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+
 /*
 * The page ID is for the convenience of pg_filedump and similar utilities,
 * which otherwise would have a hard time telling pages of different index
@ -79,19 +89,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;

-	/*
-	 * By definition, a hash scan should be examining only one bucket. We
-	 * record the bucket number here as soon as it is known.
-	 */
-	Bucket		hashso_bucket;
-	bool		hashso_bucket_valid;
-
-	/*
-	 * If we have a share lock on the bucket, we record it here.  When
-	 * hashso_bucket_blkno is zero, we have no such lock.
-	 */
-	BlockNumber hashso_bucket_blkno;
-
 	/*
 	 * We also want to remember which buffer we're currently examining in the
 	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
@ -100,11 +97,30 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_curbuf;

+	/* remember the buffer associated with primary bucket */
+	Buffer		hashso_bucket_buf;
+
+	/*
+	 * remember the buffer associated with primary bucket page of bucket being
+	 * split.  it is required during the scan of the bucket which is being
+	 * populated during split operation.
+	 */
+	Buffer		hashso_split_bucket_buf;
+
 	/* Current position of the scan, as an index TID */
 	ItemPointerData hashso_curpos;

 	/* Current position of the scan, as a heap TID */
 	ItemPointerData hashso_heappos;
+
+	/* Whether scan starts on bucket being populated due to split */
+	bool		hashso_buc_populated;
+
+	/*
+	 * Whether scanning bucket being split?  The value of this parameter is
+	 * referred only when hashso_buc_populated is true.
+	 */
+	bool		hashso_buc_split;
 } HashScanOpaqueData;

 typedef HashScanOpaqueData *HashScanOpaque;
@ -175,6 +191,8 @@ typedef HashMetaPageData *HashMetaPage;
 				  sizeof(ItemIdData) - \
 				  MAXALIGN(sizeof(HashPageOpaqueData)))

+#define INDEX_MOVED_BY_SPLIT_MASK	0x2000
+
 #define HASH_MIN_FILLFACTOR			10
 #define HASH_DEFAULT_FILLFACTOR		75

@ -223,9 +241,6 @@ typedef HashMetaPageData *HashMetaPage;
 #define HASH_WRITE		BUFFER_LOCK_EXCLUSIVE
 #define HASH_NOLOCK		(-1)

-#define HASH_SHARE		ShareLock
-#define HASH_EXCLUSIVE	ExclusiveLock
-
 /*
 *	Strategy number. There's only one valid strategy for hashing: equality.
 */
@ -297,21 +312,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);

 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BufferAccessStrategy bstrategy);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
+					Buffer bucket_buf,
 					BufferAccessStrategy bstrategy);

 /* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
 extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 			 int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
@ -320,6 +335,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 						   BufferAccessStrategy bstrategy);
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
@ -327,12 +343,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
-
-/* hashscan.c */
-extern void _hash_regscan(IndexScanDesc scan);
-extern void _hash_dropscan(IndexScanDesc scan);
-extern bool _hash_has_active_scan(Relation rel, Bucket bucket);
-extern void ReleaseResources_hash(void);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+				   Bucket obucket, uint32 maxbucket, uint32 highmask,
+				   uint32 lowmask);

 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
@ -362,5 +375,18 @@ extern bool _hash_convert_tuple(Relation index,
 					Datum *index_values, bool *index_isnull);
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket);
+extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
+extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
+								   uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
+				  Buffer bucket_buf, BlockNumber bucket_blkno,
+				  BufferAccessStrategy bstrategy,
+				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
+				  double *tuples_removed, double *num_index_tuples,
+				  bool bucket_has_garbage,
+				  IndexBulkDeleteCallback callback, void *callback_state);

 #endif   /* HASH_H */
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@ -63,7 +63,7 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 * t_info manipulation macros
 */
 #define INDEX_SIZE_MASK 0x1FFF
-/* bit 0x2000 is not used at present */
+/* bit 0x2000 is reserved for index-AM specific usage */
 #define INDEX_VAR_MASK	0x4000
 #define INDEX_NULL_MASK 0x8000