From afb9249d06f47d7a6d4a89fea0c3625fe43c5a5d Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 12 May 2015 14:10:10 -0400
Subject: [PATCH] Add support for doing late row locking in FDWs.

Previously, FDWs could only do "early row locking", that is lock a row as
soon as it's fetched, even though local restriction/join conditions might
discard the row later.  This patch adds callbacks that allow FDWs to do
late locking in the same way that it's done for regular tables.

To make use of this feature, an FDW must support the "ctid" column as a
unique row identifier.  Currently, since ctid has to be of type TID,
the feature is of limited use, though in principle it could be used by
postgres_fdw.  We may eventually allow FDWs to specify another data type
for ctid, which would make it possible for more FDWs to use this feature.

This commit does not modify postgres_fdw to use late locking.  We've
tested some prototype code for that, but it's not in committable shape,
and besides it's quite unclear whether it actually makes sense to do late
locking against a remote server.  The extra round trips required are likely
to outweigh any benefit from improved concurrency.

Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me
---
 doc/src/sgml/fdwhandler.sgml           | 232 +++++++++++++++++++++++--
 src/backend/executor/execMain.c        |  79 +++++++--
 src/backend/executor/execUtils.c       |  17 +-
 src/backend/executor/nodeLockRows.c    | 133 +++++++++-----
 src/backend/executor/nodeModifyTable.c |   2 +-
 src/backend/optimizer/plan/planner.c   |   8 +-
 src/include/executor/executor.h        |   2 +-
 src/include/foreign/fdwapi.h           |  12 ++
 src/include/nodes/execnodes.h          |  12 +-
 src/include/nodes/plannodes.h          |  31 ++--
 10 files changed, 415 insertions(+), 113 deletions(-)
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 33863f04f8..236157743a 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -665,6 +665,108 @@ IsForeignRelUpdatable (Relation rel);
 
    </sect2>
 
+   <sect2 id="fdw-callbacks-row-locking">
+    <title>FDW Routines For Row Locking</title>
+
+    <para>
+     If an FDW wishes to support <firstterm>late row locking</> (as described
+     in <xref linkend="fdw-row-locking">), it must provide the following
+     callback functions:
+    </para>
+
+    <para>
+<programlisting>
+RowMarkType
+GetForeignRowMarkType (RangeTblEntry *rte,
+                       LockClauseStrength strength);
+</programlisting>
+
+     Report which row-marking option to use for a foreign table.
+     <literal>rte</> is the <structname>RangeTblEntry</> node for the table
+     and <literal>strength</> describes the lock strength requested by the
+     relevant <literal>FOR UPDATE/SHARE</> clause, if any.  The result must be
+     a member of the <literal>RowMarkType</> enum type.
+    </para>
+
+    <para>
+     This function is called during query planning for each foreign table that
+     appears in an <command>UPDATE</>, <command>DELETE</>, or <command>SELECT
+     FOR UPDATE/SHARE</> query and is not the target of <command>UPDATE</>
+     or <command>DELETE</>.
+    </para>
+
+    <para>
+     If the <function>GetForeignRowMarkType</> pointer is set to
+     <literal>NULL</>, the <literal>ROW_MARK_COPY</> option is always used.
+     (This implies that <function>RefetchForeignRow</> will never be called,
+     so it need not be provided either.)
+    </para>
+
+    <para>
+     See <xref linkend="fdw-row-locking"> for more information.
+    </para>
+
+    <para>
+<programlisting>
+HeapTuple
+RefetchForeignRow (EState *estate,
+                   ExecRowMark *erm,
+                   Datum rowid,
+                   bool *updated);
+</programlisting>
+
+     Re-fetch one tuple from the foreign table, after locking it if required.
+     <literal>estate</> is global execution state for the query.
+     <literal>erm</> is the <structname>ExecRowMark</> struct describing
+     the target foreign table and the row lock type (if any) to acquire.
+     <literal>rowid</> identifies the tuple to be fetched.
+     <literal>updated</> is an output parameter.
+    </para>
+
+    <para>
+     This function should return a palloc'ed copy of the fetched tuple,
+     or <literal>NULL</> if the row lock couldn't be obtained.  The row lock
+     type to acquire is defined by <literal>erm-&gt;markType</>, which is the
+     value previously returned by <function>GetForeignRowMarkType</>.
+     (<literal>ROW_MARK_REFERENCE</> means to just re-fetch the tuple without
+     acquiring any lock, and <literal>ROW_MARK_COPY</> will never be seen by
+     this routine.)
+    </para>
+
+    <para>
+     In addition, <literal>*updated</> should be set to <literal>true</>
+     if what was fetched was an updated version of the tuple rather than
+     the same version previously obtained.  (If the FDW cannot be sure about
+     this, always returning <literal>true</> is recommended.)
+    </para>
+
+    <para>
+     Note that by default, failure to acquire a row lock should result in
+     raising an error; a <literal>NULL</> return is only appropriate if
+     the <literal>SKIP LOCKED</> option is specified
+     by <literal>erm-&gt;waitPolicy</>.
+    </para>
+
+    <para>
+     The <literal>rowid</> is the <structfield>ctid</> value previously read
+     for the row to be re-fetched.  Although the <literal>rowid</> value is
+     passed as a <type>Datum</>, it can currently only be a <type>tid</>.  The
+     function API is chosen in hopes that it may be possible to allow other
+     datatypes for row IDs in future.
+    </para>
+
+    <para>
+     If the <function>RefetchForeignRow</> pointer is set to
+     <literal>NULL</>, attempts to re-fetch rows will fail
+     with an error message.
+    </para>
+
+    <para>
+     See <xref linkend="fdw-row-locking"> for more information.
+    </para>
+
+   </sect2>
+
    <sect2 id="fdw-callbacks-explain">
     <title>FDW Routines for <command>EXPLAIN</></title>
 
@@ -1092,24 +1194,6 @@ GetForeignServerByName(const char *name, bool missing_ok);
      structures that <function>copyObject</> knows how to copy.
     </para>
 
-    <para>
-     For an <command>UPDATE</> or <command>DELETE</> against an external data
-     source that supports concurrent updates, it is recommended that the
-     <literal>ForeignScan</> operation lock the rows that it fetches, perhaps
-     via the equivalent of <command>SELECT FOR UPDATE</>.  The FDW may also
-     choose to lock rows at fetch time when the foreign table is referenced
-     in a <command>SELECT FOR UPDATE/SHARE</>; if it does not, the
-     <literal>FOR UPDATE</> or <literal>FOR SHARE</> option is essentially a
-     no-op so far as the foreign table is concerned.  This behavior may yield
-     semantics slightly different from operations on local tables, where row
-     locking is customarily delayed as long as possible: remote rows may get
-     locked even though they subsequently fail locally-applied restriction or
-     join conditions.  However, matching the local semantics exactly would
-     require an additional remote access for every row, and might be
-     impossible anyway depending on what locking semantics the external data
-     source provides.
-    </para>
-
     <para>
      <command>INSERT</> with an <literal>ON CONFLICT</> clause does not
      support specifying the conflict target, as remote constraints are not
@@ -1117,6 +1201,118 @@ GetForeignServerByName(const char *name, bool missing_ok);
      UPDATE</> is not supported, since the specification is mandatory there.
     </para>
 
+   </sect1>
+
+   <sect1 id="fdw-row-locking">
+    <title>Row Locking in Foreign Data Wrappers</title>
+
+    <para>
+     If an FDW's underlying storage mechanism has a concept of locking
+     individual rows to prevent concurrent updates of those rows, it is
+     usually worthwhile for the FDW to perform row-level locking with as
+     close an approximation as practical to the semantics used in
+     ordinary <productname>PostgreSQL</> tables.  There are multiple
+     considerations involved in this.
+    </para>
+
+    <para>
+     One key decision to be made is whether to perform <firstterm>early
+     locking</> or <firstterm>late locking</>.  In early locking, a row is
+     locked when it is first retrieved from the underlying store, while in
+     late locking, the row is locked only when it is known that it needs to
+     be locked.  (The difference arises because some rows may be discarded by
+     locally-checked restriction or join conditions.)  Early locking is much
+     simpler and avoids extra round trips to a remote store, but it can cause
+     locking of rows that need not have been locked, resulting in reduced
+     concurrency or even unexpected deadlocks.  Also, late locking is only
+     possible if the row to be locked can be uniquely re-identified later.
+     Preferably the row identifier should identify a specific version of the
+     row, as <productname>PostgreSQL</> TIDs do.
+    </para>
+
+    <para>
+     By default, <productname>PostgreSQL</> ignores locking considerations
+     when interfacing to FDWs, but an FDW can perform early locking without
+     any explicit support from the core code.  The API functions described
+     in <xref linkend="fdw-callbacks-row-locking">, which were added
+     in <productname>PostgreSQL</> 9.5, allow an FDW to use late locking if
+     it wishes.
+    </para>
+
+    <para>
+     An additional consideration is that in <literal>READ COMMITTED</>
+     isolation mode, <productname>PostgreSQL</> may need to re-check
+     restriction and join conditions against an updated version of some
+     target tuple.  Rechecking join conditions requires re-obtaining copies
+     of the non-target rows that were previously joined to the target tuple.
+     When working with standard <productname>PostgreSQL</> tables, this is
+     done by including the TIDs of the non-target tables in the column list
+     projected through the join, and then re-fetching non-target rows when
+     required.  This approach keeps the join data set compact, but it
+     requires inexpensive re-fetch capability, as well as a TID that can
+     uniquely identify the row version to be re-fetched.  By default,
+     therefore, the approach used with foreign tables is to include a copy of
+     the entire row fetched from a foreign table in the column list projected
+     through the join.  This puts no special demands on the FDW but can
+     result in reduced performance of merge and hash joins.  An FDW that is
+     capable of meeting the re-fetch requirements can choose to do it the
+     first way.
+    </para>
+
+    <para>
+     For an <command>UPDATE</> or <command>DELETE</> on a foreign table, it
+     is recommended that the <literal>ForeignScan</> operation on the target
+     table perform early locking on the rows that it fetches, perhaps via the
+     equivalent of <command>SELECT FOR UPDATE</>.  An FDW can detect whether
+     a table is an <command>UPDATE</>/<command>DELETE</> target at plan time
+     by comparing its relid to <literal>root-&gt;parse-&gt;resultRelation</>,
+     or at execution time by using <function>ExecRelationIsTargetRelation()</>.
+     An alternative possibility is to perform late locking within the
+     <function>ExecForeignUpdate</> or <function>ExecForeignDelete</>
+     callback, but no special support is provided for this.
+    </para>
+
+    <para>
+     For foreign tables that are specified to be locked by a <command>SELECT
+     FOR UPDATE/SHARE</> command, the <literal>ForeignScan</> operation can
+     again perform early locking by fetching tuples with the equivalent
+     of <command>SELECT FOR UPDATE/SHARE</>.  To perform late locking
+     instead, provide the callback functions defined
+     in <xref linkend="fdw-callbacks-row-locking">.
+     In <function>GetForeignRowMarkType</>, select rowmark option
+     <literal>ROW_MARK_EXCLUSIVE</>, <literal>ROW_MARK_NOKEYEXCLUSIVE</>,
+     <literal>ROW_MARK_SHARE</>, or <literal>ROW_MARK_KEYSHARE</> depending
+     on the requested lock strength.  (The core code will act the same
+     regardless of which of these four options you choose.)
+     Elsewhere, you can detect whether a foreign table was specified to be
+     locked by this type of command by using <function>get_plan_rowmark</> at
+     plan time, or <function>ExecFindRowMark</> at execution time; you must
+     check not only whether a non-null rowmark struct is returned, but that
+     its <structfield>strength</> field is not <literal>LCS_NONE</>.
+    </para>
+
+    <para>
+     Lastly, for foreign tables that are used in an <command>UPDATE</>,
+     <command>DELETE</> or <command>SELECT FOR UPDATE/SHARE</> command but
+     are not specified to be row-locked, you can override the default choice
+     to copy entire rows by having <function>GetForeignRowMarkType</> select
+     option <literal>ROW_MARK_REFERENCE</> when it sees lock strength
+     <literal>LCS_NONE</>.  This will cause <function>RefetchForeignRow</> to
+     be called with that value for <structfield>markType</>; it should then
+     re-fetch the row without acquiring any new lock.  (If you have
+     a <function>GetForeignRowMarkType</> function but don't wish to re-fetch
+     unlocked rows, select option <literal>ROW_MARK_COPY</>
+     for <literal>LCS_NONE</>.)
+    </para>
+
+    <para>
+     See <filename>src/include/nodes/lockoptions.h</>, the comments
+     for <type>RowMarkType</> and <type>PlanRowMark</>
+     in <filename>src/include/nodes/plannodes.h</>, and the comments for
+     <type>ExecRowMark</> in <filename>src/include/nodes/execnodes.h</> for
+     additional information.
+    </para>
+
   </sect1>
 
  </chapter>
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0dee949178..43d3c44c82 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -898,8 +898,11 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		erm->prti = rc->prti;
 		erm->rowmarkId = rc->rowmarkId;
 		erm->markType = rc->markType;
+		erm->strength = rc->strength;
 		erm->waitPolicy = rc->waitPolicy;
+		erm->ermActive = false;
 		ItemPointerSetInvalid(&(erm->curCtid));
+		erm->ermExtra = NULL;
 		estate->es_rowMarks = lappend(estate->es_rowMarks, erm);
 	}
 
@@ -1143,6 +1146,8 @@ CheckValidResultRel(Relation resultRel, CmdType operation)
 static void
 CheckValidRowMarkRel(Relation rel, RowMarkType markType)
 {
+	FdwRoutine *fdwroutine;
+
 	switch (rel->rd_rel->relkind)
 	{
 		case RELKIND_RELATION:
@@ -1178,11 +1183,13 @@ CheckValidRowMarkRel(Relation rel, RowMarkType markType)
 							  RelationGetRelationName(rel))));
 			break;
 		case RELKIND_FOREIGN_TABLE:
-			/* Should not get here; planner should have used ROW_MARK_COPY */
-			ereport(ERROR,
-					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("cannot lock rows in foreign table \"%s\"",
-							RelationGetRelationName(rel))));
+			/* Okay only if the FDW supports it */
+			fdwroutine = GetFdwRoutineForRelation(rel, false);
+			if (fdwroutine->RefetchForeignRow == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("cannot lock rows in foreign table \"%s\"",
+								RelationGetRelationName(rel))));
 			break;
 		default:
 			ereport(ERROR,
@@ -2005,9 +2012,11 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
 
 /*
  * ExecFindRowMark -- find the ExecRowMark struct for given rangetable index
+ *
+ * If no such struct, either return NULL or throw error depending on missing_ok
  */
 ExecRowMark *
-ExecFindRowMark(EState *estate, Index rti)
+ExecFindRowMark(EState *estate, Index rti, bool missing_ok)
 {
 	ListCell   *lc;
 
@@ -2018,8 +2027,9 @@ ExecFindRowMark(EState *estate, Index rti)
 		if (erm->rti == rti)
 			return erm;
 	}
-	elog(ERROR, "failed to find ExecRowMark for rangetable index %u", rti);
-	return NULL;				/* keep compiler quiet */
+	if (!missing_ok)
+		elog(ERROR, "failed to find ExecRowMark for rangetable index %u", rti);
+	return NULL;
 }
 
 /*
@@ -2530,7 +2540,7 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate)
 
 		if (erm->markType == ROW_MARK_REFERENCE)
 		{
-			Buffer		buffer;
+			HeapTuple	copyTuple;
 
 			Assert(erm->relation != NULL);
 
@@ -2541,17 +2551,50 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate)
 			/* non-locked rels could be on the inside of outer joins */
 			if (isNull)
 				continue;
-			tuple.t_self = *((ItemPointer) DatumGetPointer(datum));
 
-			/* okay, fetch the tuple */
-			if (!heap_fetch(erm->relation, SnapshotAny, &tuple, &buffer,
-							false, NULL))
-				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
+			/* fetch requests on foreign tables must be passed to their FDW */
+			if (erm->relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+			{
+				FdwRoutine *fdwroutine;
+				bool		updated = false;
 
-			/* successful, copy and store tuple */
-			EvalPlanQualSetTuple(epqstate, erm->rti,
-								 heap_copytuple(&tuple));
-			ReleaseBuffer(buffer);
+				fdwroutine = GetFdwRoutineForRelation(erm->relation, false);
+				/* this should have been checked already, but let's be safe */
+				if (fdwroutine->RefetchForeignRow == NULL)
+					ereport(ERROR,
+							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("cannot lock rows in foreign table \"%s\"",
+									RelationGetRelationName(erm->relation))));
+				copyTuple = fdwroutine->RefetchForeignRow(epqstate->estate,
+														  erm,
+														  datum,
+														  &updated);
+				if (copyTuple == NULL)
+					elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
+
+				/*
+				 * Ideally we'd insist on updated == false here, but that
+				 * assumes that FDWs can track that exactly, which they might
+				 * not be able to.  So just ignore the flag.
+				 */
+			}
+			else
+			{
+				/* ordinary table, fetch the tuple */
+				Buffer		buffer;
+
+				tuple.t_self = *((ItemPointer) DatumGetPointer(datum));
+				if (!heap_fetch(erm->relation, SnapshotAny, &tuple, &buffer,
+								false, NULL))
+					elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
+
+				/* successful, copy tuple */
+				copyTuple = heap_copytuple(&tuple);
+				ReleaseBuffer(buffer);
+			}
+
+			/* store tuple */
+			EvalPlanQualSetTuple(epqstate, erm->rti, copyTuple);
 		}
 		else
 		{
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 88ba16bc6d..0da8e53e81 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -805,20 +805,11 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags)
 		lockmode = NoLock;
 	else
 	{
-		ListCell   *l;
+		/* Keep this check in sync with InitPlan! */
+		ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true);
 
-		foreach(l, estate->es_rowMarks)
-		{
-			ExecRowMark *erm = lfirst(l);
-
-			/* Keep this check in sync with InitPlan! */
-			if (erm->rti == scanrelid &&
-				erm->relation != NULL)
-			{
-				lockmode = NoLock;
-				break;
-			}
-		}
+		if (erm != NULL && erm->relation != NULL)
+			lockmode = NoLock;
 	}
 
 	/* Open the relation and acquire lock as needed */
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 5ae106c06a..7bcf99f488 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -25,6 +25,7 @@
 #include "access/xact.h"
 #include "executor/executor.h"
 #include "executor/nodeLockRows.h"
+#include "foreign/fdwapi.h"
 #include "storage/bufmgr.h"
 #include "utils/rel.h"
 #include "utils/tqual.h"
@@ -40,7 +41,7 @@ ExecLockRows(LockRowsState *node)
 	TupleTableSlot *slot;
 	EState	   *estate;
 	PlanState  *outerPlan;
-	bool		epq_started;
+	bool		epq_needed;
 	ListCell   *lc;
 
 	/*
@@ -58,15 +59,18 @@ lnext:
 	if (TupIsNull(slot))
 		return NULL;
 
+	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
+	epq_needed = false;
+
 	/*
 	 * Attempt to lock the source tuple(s).  (Note we only have locking
 	 * rowmarks in lr_arowMarks.)
 	 */
-	epq_started = false;
 	foreach(lc, node->lr_arowMarks)
 	{
 		ExecAuxRowMark *aerm = (ExecAuxRowMark *) lfirst(lc);
 		ExecRowMark *erm = aerm->rowmark;
+		HeapTuple  *testTuple;
 		Datum		datum;
 		bool		isNull;
 		HeapTupleData tuple;
@@ -77,8 +81,10 @@ lnext:
 		HeapTuple	copyTuple;
 
 		/* clear any leftover test tuple for this rel */
-		if (node->lr_epqstate.estate != NULL)
-			EvalPlanQualSetTuple(&node->lr_epqstate, erm->rti, NULL);
+		testTuple = &(node->lr_curtuples[erm->rti - 1]);
+		if (*testTuple != NULL)
+			heap_freetuple(*testTuple);
+		*testTuple = NULL;
 
 		/* if child rel, must check whether it produced this row */
 		if (erm->rti != erm->prti)
@@ -97,10 +103,12 @@ lnext:
 			if (tableoid != erm->relid)
 			{
 				/* this child is inactive right now */
+				erm->ermActive = false;
 				ItemPointerSetInvalid(&(erm->curCtid));
 				continue;
 			}
 		}
+		erm->ermActive = true;
 
 		/* fetch the tuple's ctid */
 		datum = ExecGetJunkAttribute(slot,
@@ -109,9 +117,45 @@ lnext:
 		/* shouldn't ever get a null result... */
 		if (isNull)
 			elog(ERROR, "ctid is NULL");
-		tuple.t_self = *((ItemPointer) DatumGetPointer(datum));
+
+		/* requests for foreign tables must be passed to their FDW */
+		if (erm->relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+		{
+			FdwRoutine *fdwroutine;
+			bool		updated = false;
+
+			fdwroutine = GetFdwRoutineForRelation(erm->relation, false);
+			/* this should have been checked already, but let's be safe */
+			if (fdwroutine->RefetchForeignRow == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("cannot lock rows in foreign table \"%s\"",
+								RelationGetRelationName(erm->relation))));
+			copyTuple = fdwroutine->RefetchForeignRow(estate,
+													  erm,
+													  datum,
+													  &updated);
+			if (copyTuple == NULL)
+			{
+				/* couldn't get the lock, so skip this row */
+				goto lnext;
+			}
+
+			/* save locked tuple for possible EvalPlanQual testing below */
+			*testTuple = copyTuple;
+
+			/*
+			 * if FDW says tuple was updated before getting locked, we need to
+			 * perform EPQ testing to see if quals are still satisfied
+			 */
+			if (updated)
+				epq_needed = true;
+
+			continue;
+		}
 
 		/* okay, try to lock the tuple */
+		tuple.t_self = *((ItemPointer) DatumGetPointer(datum));
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -191,40 +235,11 @@ lnext:
 				/* remember the actually locked tuple's TID */
 				tuple.t_self = copyTuple->t_self;
 
-				/*
-				 * Need to run a recheck subquery.  Initialize EPQ state if we
-				 * didn't do so already.
-				 */
-				if (!epq_started)
-				{
-					ListCell   *lc2;
+				/* Save locked tuple for EvalPlanQual testing below */
+				*testTuple = copyTuple;
 
-					EvalPlanQualBegin(&node->lr_epqstate, estate);
-
-					/*
-					 * Ensure that rels with already-visited rowmarks are told
-					 * not to return tuples during the first EPQ test.  We can
-					 * exit this loop once it reaches the current rowmark;
-					 * rels appearing later in the list will be set up
-					 * correctly by the EvalPlanQualSetTuple call at the top
-					 * of the loop.
-					 */
-					foreach(lc2, node->lr_arowMarks)
-					{
-						ExecAuxRowMark *aerm2 = (ExecAuxRowMark *) lfirst(lc2);
-
-						if (lc2 == lc)
-							break;
-						EvalPlanQualSetTuple(&node->lr_epqstate,
-											 aerm2->rowmark->rti,
-											 NULL);
-					}
-
-					epq_started = true;
-				}
-
-				/* Store target tuple for relation's scan node */
-				EvalPlanQualSetTuple(&node->lr_epqstate, erm->rti, copyTuple);
+				/* Remember we need to do EPQ testing */
+				epq_needed = true;
 
 				/* Continue loop until we have all target tuples */
 				break;
@@ -237,17 +252,35 @@ lnext:
 					 test);
 		}
 
-		/* Remember locked tuple's TID for WHERE CURRENT OF */
+		/* Remember locked tuple's TID for EPQ testing and WHERE CURRENT OF */
 		erm->curCtid = tuple.t_self;
 	}
 
 	/*
 	 * If we need to do EvalPlanQual testing, do so.
 	 */
-	if (epq_started)
+	if (epq_needed)
 	{
+		int			i;
+
+		/* Initialize EPQ machinery */
+		EvalPlanQualBegin(&node->lr_epqstate, estate);
+
 		/*
-		 * First, fetch a copy of any rows that were successfully locked
+		 * Transfer already-fetched tuples into the EPQ state, and make sure
+		 * its test tuples for other tables are reset to NULL.
+		 */
+		for (i = 0; i < node->lr_ntables; i++)
+		{
+			EvalPlanQualSetTuple(&node->lr_epqstate,
+								 i + 1,
+								 node->lr_curtuples[i]);
+			/* freeing this tuple is now the responsibility of EPQ */
+			node->lr_curtuples[i] = NULL;
+		}
+
+		/*
+		 * Next, fetch a copy of any rows that were successfully locked
 		 * without any update having occurred.  (We do this in a separate pass
 		 * so as to avoid overhead in the common case where there are no
 		 * concurrent updates.)
@@ -260,7 +293,7 @@ lnext:
 			Buffer		buffer;
 
 			/* ignore non-active child tables */
-			if (!ItemPointerIsValid(&(erm->curCtid)))
+			if (!erm->ermActive)
 			{
 				Assert(erm->rti != erm->prti);	/* check it's child table */
 				continue;
@@ -269,6 +302,10 @@ lnext:
 			if (EvalPlanQualGetTuple(&node->lr_epqstate, erm->rti) != NULL)
 				continue;		/* it was updated and fetched above */
 
+			/* foreign tables should have been fetched above */
+			Assert(erm->relation->rd_rel->relkind != RELKIND_FOREIGN_TABLE);
+			Assert(ItemPointerIsValid(&(erm->curCtid)));
+
 			/* okay, fetch the tuple */
 			tuple.t_self = erm->curCtid;
 			if (!heap_fetch(erm->relation, SnapshotAny, &tuple, &buffer,
@@ -351,6 +388,13 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	ExecAssignResultTypeFromTL(&lrstate->ps);
 	lrstate->ps.ps_ProjInfo = NULL;
 
+	/*
+	 * Create workspace in which we can remember per-RTE locked tuples
+	 */
+	lrstate->lr_ntables = list_length(estate->es_range_table);
+	lrstate->lr_curtuples = (HeapTuple *)
+		palloc0(lrstate->lr_ntables * sizeof(HeapTuple));
+
 	/*
 	 * Locate the ExecRowMark(s) that this node is responsible for, and
 	 * construct ExecAuxRowMarks for them.  (InitPlan should already have
@@ -370,8 +414,11 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 		if (rc->isParent)
 			continue;
 
+		/* safety check on size of lr_curtuples array */
+		Assert(rc->rti > 0 && rc->rti <= lrstate->lr_ntables);
+
 		/* find ExecRowMark and build ExecAuxRowMark */
-		erm = ExecFindRowMark(estate, rc->rti);
+		erm = ExecFindRowMark(estate, rc->rti, false);
 		aerm = ExecBuildAuxRowMark(erm, outerPlan->targetlist);
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 34435c7e50..aec4151094 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1720,7 +1720,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			continue;
 
 		/* find ExecRowMark (same for all subplans) */
-		erm = ExecFindRowMark(estate, rc->rti);
+		erm = ExecFindRowMark(estate, rc->rti, false);
 
 		/* build ExecAuxRowMark for each subplan */
 		for (i = 0; i < nplans; i++)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c80d45acaa..8de57c8e6b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #ifdef OPTIMIZER_DEBUG
@@ -2324,7 +2325,12 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
 	{
-		/* For now, we force all foreign tables to use ROW_MARK_COPY */
+		/* Let the FDW select the rowmark type, if it wants to */
+		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
+
+		if (fdwroutine->GetForeignRowMarkType != NULL)
+			return fdwroutine->GetForeignRowMarkType(rte, strength);
+		/* Otherwise, use ROW_MARK_COPY by default */
 		return ROW_MARK_COPY;
 	}
 	else
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6c64609197..e60ab9fd96 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -196,7 +196,7 @@ extern void ExecConstraints(ResultRelInfo *resultRelInfo,
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
-extern ExecRowMark *ExecFindRowMark(EState *estate, Index rti);
+extern ExecRowMark *ExecFindRowMark(EState *estate, Index rti, bool missing_ok);
 extern ExecAuxRowMark *ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist);
 extern TupleTableSlot *EvalPlanQual(EState *estate, EPQState *epqstate,
 			 Relation relation, Index rti, int lockmode,
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 511c96b093..69b48b4677 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -89,6 +89,14 @@ typedef void (*EndForeignModify_function) (EState *estate,
 
 typedef int (*IsForeignRelUpdatable_function) (Relation rel);
 
+typedef RowMarkType (*GetForeignRowMarkType_function) (RangeTblEntry *rte,
+												LockClauseStrength strength);
+
+typedef HeapTuple (*RefetchForeignRow_function) (EState *estate,
+															 ExecRowMark *erm,
+															 Datum rowid,
+															 bool *updated);
+
 typedef void (*ExplainForeignScan_function) (ForeignScanState *node,
 													struct ExplainState *es);
 
@@ -151,6 +159,10 @@ typedef struct FdwRoutine
 	EndForeignModify_function EndForeignModify;
 	IsForeignRelUpdatable_function IsForeignRelUpdatable;
 
+	/* Functions for SELECT FOR UPDATE/SHARE row locking */
+	GetForeignRowMarkType_function GetForeignRowMarkType;
+	RefetchForeignRow_function RefetchForeignRow;
+
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9de6d1484e..5ad2cc2358 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -429,8 +429,11 @@ typedef struct EState
  * parent RTEs, which can be ignored at runtime).  Virtual relations such as
  * subqueries-in-FROM will have an ExecRowMark with relation == NULL.  See
  * PlanRowMark for details about most of the fields.  In addition to fields
- * directly derived from PlanRowMark, we store curCtid, which is used by the
- * WHERE CURRENT OF code.
+ * directly derived from PlanRowMark, we store an activity flag (to denote
+ * inactive children of inheritance trees), curCtid, which is used by the
+ * WHERE CURRENT OF code, and ermExtra, which is available for use by the plan
+ * node that sources the relation (e.g., for a foreign table the FDW can use
+ * ermExtra to hold information).
  *
  * EState->es_rowMarks is a list of these structs.
  */
@@ -442,8 +445,11 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
+	bool		ermActive;		/* is this mark relevant for current tuple? */
 	ItemPointerData curCtid;	/* ctid of currently locked tuple, if any */
+	void	   *ermExtra;		/* available for use by relation source node */
 } ExecRowMark;
 
 /*
@@ -1921,6 +1927,8 @@ typedef struct LockRowsState
 	PlanState	ps;				/* its first field is NodeTag */
 	List	   *lr_arowMarks;	/* List of ExecAuxRowMarks */
 	EPQState	lr_epqstate;	/* for evaluating EvalPlanQual rechecks */
+	HeapTuple  *lr_curtuples;	/* locked tuples (one entry per RT entry) */
+	int			lr_ntables;		/* length of lr_curtuples[] array */
 } LockRowsState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9313292222..1494b336c2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -822,16 +822,16 @@ typedef struct Limit
  *
  * The first four of these values represent different lock strengths that
  * we can take on tuples according to SELECT FOR [KEY] UPDATE/SHARE requests.
- * We only support these on regular tables.  For foreign tables, any locking
- * that might be done for these requests must happen during the initial row
- * fetch; there is no mechanism for going back to lock a row later (and thus
- * no need for EvalPlanQual machinery during updates of foreign tables).
+ * We support these on regular tables, as well as on foreign tables whose FDWs
+ * report support for late locking.  For other foreign tables, any locking
+ * that might be done for such requests must happen during the initial row
+ * fetch; their FDWs provide no mechanism for going back to lock a row later.
  * This means that the semantics will be a bit different than for a local
  * table; in particular we are likely to lock more rows than would be locked
  * locally, since remote rows will be locked even if they then fail
- * locally-checked restriction or join quals.  However, the alternative of
- * doing a separate remote query to lock each selected row is extremely
- * unappealing, so let's do it like this for now.
+ * locally-checked restriction or join quals.  However, the prospect of
+ * doing a separate remote query to lock each selected row is usually pretty
+ * unappealing, so early locking remains a credible design choice for FDWs.
  *
  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we have to uniquely
  * identify all the source rows, not only those from the target relations, so
@@ -840,12 +840,11 @@ typedef struct Limit
  * represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
  * FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
  * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the case is not performance-critical in practice.  Note that
- * we use ROW_MARK_COPY for non-target foreign tables, even if the FDW has a
- * concept of rowid and so could theoretically support some form of
- * ROW_MARK_REFERENCE.  Although copying the whole row value is inefficient,
- * it's probably still faster than doing a second remote fetch, so it doesn't
- * seem worth the extra complexity to permit ROW_MARK_REFERENCE.
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
  */
 typedef enum RowMarkType
 {
@@ -866,7 +865,7 @@ typedef enum RowMarkType
  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we create a separate
  * PlanRowMark node for each non-target relation in the query.  Relations that
  * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
- * regular tables) or ROW_MARK_COPY (if not).
+ * regular tables or supported foreign tables) or ROW_MARK_COPY (if not).
  *
  * Initially all PlanRowMarks have rti == prti and isParent == false.
  * When the planner discovers that a relation is the root of an inheritance
@@ -879,8 +878,8 @@ typedef enum RowMarkType
  * to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
- * information sufficient to identify the locked or fetched rows.  For
- * regular tables (markType != ROW_MARK_COPY), these columns are named
+ * information sufficient to identify the locked or fetched rows.  When
+ * markType != ROW_MARK_COPY, these columns are named
  *		tableoid%u			OID of table
  *		ctid%u				TID of row
  * The tableoid column is only present for an inheritance hierarchy.