Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
src/backend/storage/lmgr/README-SSI
|
|
|
|
|
|
|
|
Serializable Snapshot Isolation (SSI) and Predicate Locking
|
|
|
|
===========================================================
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
This code is in the lmgr directory because about 90% of it is an
|
|
|
|
implementation of predicate locking, which is required for SSI,
|
|
|
|
rather than being directly related to SSI itself. When another use
|
|
|
|
for predicate locking justifies the effort to tease these two things
|
|
|
|
apart, this README file should probably be split.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
|
|
|
|
Credits
|
|
|
|
-------
|
|
|
|
|
|
|
|
This feature was developed by Kevin Grittner and Dan R. K. Ports,
|
|
|
|
with review and suggestions from Joe Conway, Heikki Linnakangas, and
|
|
|
|
Jeff Davis. It is based on work published in these papers:
|
|
|
|
|
|
|
|
Michael J. Cahill, Uwe Röhm, and Alan D. Fekete. 2008.
|
|
|
|
Serializable isolation for snapshot databases.
|
2011-05-31 22:10:05 +02:00
|
|
|
In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
international conference on Management of data,
|
2011-05-31 22:10:05 +02:00
|
|
|
pages 729-738, New York, NY, USA. ACM.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
http://doi.acm.org/10.1145/1376616.1376690
|
|
|
|
|
|
|
|
Michael James Cahill. 2009.
|
|
|
|
Serializable Isolation for Snapshot Databases.
|
|
|
|
Sydney Digital Theses.
|
|
|
|
University of Sydney, School of Information Technologies.
|
|
|
|
http://hdl.handle.net/2123/5353
|
|
|
|
|
|
|
|
|
|
|
|
Overview
|
|
|
|
--------
|
|
|
|
|
|
|
|
With true serializable transactions, if you can show that your
|
|
|
|
transaction will do the right thing if there are no concurrent
|
|
|
|
transactions, it will do the right thing in any mix of serializable
|
|
|
|
transactions or be rolled back with a serialization failure. This
|
|
|
|
feature has been implemented in PostgreSQL using SSI.
|
|
|
|
|
|
|
|
|
|
|
|
Serializable and Snapshot Transaction Isolation Levels
|
|
|
|
------------------------------------------------------
|
|
|
|
|
|
|
|
Serializable transaction isolation is attractive for shops with
|
|
|
|
active development by many programmers against a complex schema
|
|
|
|
because it guarantees data integrity with very little staff time --
|
|
|
|
if a transaction can be shown to always do the right thing when it is
|
|
|
|
run alone (before or after any other transaction), it will always do
|
|
|
|
the right thing in any mix of concurrent serializable transactions.
|
|
|
|
Where conflicts with other transactions would result in an
|
2011-06-16 20:12:56 +02:00
|
|
|
inconsistent state within the database or an inconsistent view of
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
the data, a serializable transaction will block or roll back to
|
|
|
|
prevent the anomaly. The SQL standard provides a specific SQLSTATE
|
|
|
|
for errors generated when a transaction rolls back for this reason,
|
|
|
|
so that transactions can be retried automatically.
|
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
Before version 9.1, PostgreSQL did not support a full serializable
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
isolation level. A request for serializable transaction isolation
|
|
|
|
actually provided snapshot isolation. This has well known anomalies
|
|
|
|
which can allow data corruption or inconsistent views of the data
|
|
|
|
during concurrent transactions; although these anomalies only occur
|
|
|
|
when certain patterns of read-write dependencies exist within a set
|
|
|
|
of concurrent transactions. Where these patterns exist, the anomalies
|
|
|
|
can be prevented by introducing conflicts through explicitly
|
|
|
|
programmed locks or otherwise unnecessary writes to the database.
|
|
|
|
Snapshot isolation is popular because performance is better than
|
|
|
|
serializable isolation and the integrity guarantees which it does
|
|
|
|
provide allow anomalies to be avoided or managed with reasonable
|
|
|
|
effort in many environments.
|
|
|
|
|
|
|
|
|
|
|
|
Serializable Isolation Implementation Strategies
|
|
|
|
------------------------------------------------
|
|
|
|
|
|
|
|
Techniques for implementing full serializable isolation have been
|
|
|
|
published and in use in many database products for decades. The
|
2011-06-16 20:12:56 +02:00
|
|
|
primary technique which has been used is Strict Two-Phase Locking
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
(S2PL), which operates by blocking writes against data which has been
|
|
|
|
read by concurrent transactions and blocking any access (read or
|
|
|
|
write) against data which has been written by concurrent
|
|
|
|
transactions. A cycle in a graph of blocking indicates a deadlock,
|
|
|
|
requiring a rollback. Blocking and deadlocks under S2PL in high
|
|
|
|
contention workloads can be debilitating, crippling throughput and
|
|
|
|
response time.
|
|
|
|
|
|
|
|
A new technique for implementing full serializable isolation in an
|
|
|
|
MVCC database appears in the literature beginning in 2008. This
|
|
|
|
technique, known as Serializable Snapshot Isolation (SSI) has many of
|
|
|
|
the advantages of snapshot isolation. In particular, reads don't
|
|
|
|
block anything and writes don't block reads. Essentially, it runs
|
|
|
|
snapshot isolation but monitors the read-write conflicts between
|
|
|
|
transactions to identify dangerous structures in the transaction
|
|
|
|
graph which indicate that a set of concurrent transactions might
|
|
|
|
produce an anomaly, and rolls back transactions to ensure that no
|
|
|
|
anomalies occur. It will produce some false positives (where a
|
|
|
|
transaction is rolled back even though there would not have been an
|
|
|
|
anomaly), but will never let an anomaly occur. In the two known
|
|
|
|
prototype implementations, performance for many workloads (even with
|
|
|
|
the need to restart transactions which are rolled back) is very close
|
|
|
|
to snapshot isolation and generally far better than an S2PL
|
|
|
|
implementation.
|
|
|
|
|
|
|
|
|
|
|
|
Apparent Serial Order of Execution
|
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
One way to understand when snapshot anomalies can occur, and to
|
|
|
|
visualize the difference between the serializable implementations
|
|
|
|
described above, is to consider that among transactions executing at
|
|
|
|
the serializable transaction isolation level, the results are
|
|
|
|
required to be consistent with some serial (one-at-a-time) execution
|
2011-06-16 20:12:56 +02:00
|
|
|
of the transactions [1]. How is that order determined in each?
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
In S2PL, each transaction locks any data it accesses. It holds the
|
|
|
|
locks until committing, preventing other transactions from making
|
|
|
|
conflicting accesses to the same data in the interim. Some
|
|
|
|
transactions may have to be rolled back to prevent deadlock. But
|
|
|
|
successful transactions can always be viewed as having occurred
|
|
|
|
sequentially, in the order they committed.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
With snapshot isolation, reads never block writes, nor vice versa, so
|
2011-06-16 20:12:56 +02:00
|
|
|
more concurrency is possible. The order in which transactions appear
|
|
|
|
to have executed is determined by something more subtle than in S2PL:
|
|
|
|
read/write dependencies. If a transaction reads data, it appears to
|
|
|
|
execute after the transaction that wrote the data it is reading.
|
|
|
|
Similarly, if it updates data, it appears to execute after the
|
|
|
|
transaction that wrote the previous version. These dependencies, which
|
|
|
|
we call "wr-dependencies" and "ww-dependencies", are consistent with
|
|
|
|
the commit order, because the first transaction must have committed
|
|
|
|
before the second starts. However, there can also be dependencies
|
|
|
|
between two *concurrent* transactions, i.e. where one was running when
|
|
|
|
the other acquired its snapshot. These "rw-conflicts" occur when one
|
|
|
|
transaction attempts to read data which is not visible to it because
|
|
|
|
the transaction which wrote it (or will later write it) is
|
|
|
|
concurrent. The reading transaction appears to have executed first,
|
|
|
|
regardless of the actual sequence of transaction starts or commits,
|
|
|
|
because it sees a database state prior to that in which the other
|
|
|
|
transaction leaves it.
|
|
|
|
|
|
|
|
Anomalies occur when a cycle is created in the graph of dependencies:
|
|
|
|
when a dependency or series of dependencies causes transaction A to
|
|
|
|
appear to have executed before transaction B, but another series of
|
|
|
|
dependencies causes B to appear before A. If that's the case, then
|
|
|
|
the results can't be consistent with any serial execution of the
|
|
|
|
transactions.
|
|
|
|
|
|
|
|
|
|
|
|
SSI Algorithm
|
|
|
|
-------------
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
As of 9.1, serializable transactions in PostgreSQL are implemented using
|
2011-06-16 20:12:56 +02:00
|
|
|
Serializable Snapshot Isolation (SSI), based on the work of Cahill
|
|
|
|
et al. Fundamentally, this allows snapshot isolation to run as it
|
2011-06-22 00:00:09 +02:00
|
|
|
previously did, while monitoring for conditions which could create a
|
|
|
|
serialization anomaly.
|
2011-06-16 20:12:56 +02:00
|
|
|
|
|
|
|
SSI is based on the observation [2] that each snapshot isolation
|
|
|
|
anomaly corresponds to a cycle that contains a "dangerous structure"
|
|
|
|
of two adjacent rw-conflict edges:
|
|
|
|
|
|
|
|
Tin ------> Tpivot ------> Tout
|
|
|
|
rw rw
|
|
|
|
|
|
|
|
SSI works by watching for this dangerous structure, and rolling
|
|
|
|
back a transaction when needed to prevent any anomaly. This means it
|
|
|
|
only needs to track rw-conflicts between concurrent transactions, not
|
|
|
|
wr- and ww-dependencies. It also means there is a risk of false
|
2011-06-22 00:00:09 +02:00
|
|
|
positives, because not every dangerous structure is embedded in an
|
|
|
|
actual cycle. The number of false positives is low in practice, so
|
|
|
|
this represents an acceptable tradeoff for keeping the detection
|
|
|
|
overhead low.
|
2011-06-16 20:12:56 +02:00
|
|
|
|
|
|
|
The PostgreSQL implementation uses two additional optimizations:
|
|
|
|
|
|
|
|
* Tout must commit before any other transaction in the cycle
|
|
|
|
(see proof of Theorem 2.1 of [2]). We only roll back a transaction
|
|
|
|
if Tout commits before Tpivot and Tin.
|
|
|
|
|
|
|
|
* if Tin is read-only, there can only be an anomaly if Tout committed
|
|
|
|
before Tin takes its snapshot. This optimization is an original
|
|
|
|
one. Proof:
|
|
|
|
|
|
|
|
- Because there is a cycle, there must be some transaction T0 that
|
2011-06-22 00:00:09 +02:00
|
|
|
precedes Tin in the cycle. (T0 might be the same as Tout.)
|
2011-06-16 20:12:56 +02:00
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
- The edge between T0 and Tin can't be a rw-conflict or ww-dependency,
|
2011-06-16 20:12:56 +02:00
|
|
|
because Tin was read-only, so it must be a wr-dependency.
|
2011-06-22 00:00:09 +02:00
|
|
|
Those can only occur if T0 committed before Tin took its snapshot,
|
|
|
|
else Tin would have ignored T0's output.
|
2011-06-16 20:12:56 +02:00
|
|
|
|
|
|
|
- Because Tout must commit before any other transaction in the
|
|
|
|
cycle, it must commit before T0 commits -- and thus before Tin
|
|
|
|
starts.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
|
|
|
|
PostgreSQL Implementation
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
* Since this technique is based on Snapshot Isolation (SI), those
|
|
|
|
areas in PostgreSQL which don't use SI can't be brought under SSI.
|
|
|
|
This includes system tables, temporary tables, sequences, hint bit
|
|
|
|
rewrites, etc. SSI can not eliminate existing anomalies in these
|
|
|
|
areas.
|
|
|
|
|
|
|
|
* Any transaction which is run at a transaction isolation level
|
|
|
|
other than SERIALIZABLE will not be affected by SSI. If you want to
|
|
|
|
enforce business rules through SSI, all transactions should be run at
|
|
|
|
the SERIALIZABLE transaction isolation level, and that should
|
|
|
|
probably be set as the default.
|
|
|
|
|
|
|
|
* If all transactions are run at the SERIALIZABLE transaction
|
|
|
|
isolation level, business rules can be enforced in triggers or
|
|
|
|
application code without ever having a need to acquire an explicit
|
|
|
|
lock or to use SELECT FOR SHARE or SELECT FOR UPDATE.
|
|
|
|
|
|
|
|
* Those who want to continue to use snapshot isolation without
|
|
|
|
the additional protections of SSI (and the associated costs of
|
|
|
|
enforcing those protections), can use the REPEATABLE READ transaction
|
2011-06-16 20:12:56 +02:00
|
|
|
isolation level. This level retains its legacy behavior, which
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
is identical to the old SERIALIZABLE implementation and fully
|
|
|
|
consistent with the standard's requirements for the REPEATABLE READ
|
|
|
|
transaction isolation level.
|
|
|
|
|
|
|
|
* Performance under this SSI implementation will be significantly
|
|
|
|
improved if transactions which don't modify permanent tables are
|
|
|
|
declared to be READ ONLY before they begin reading data.
|
|
|
|
|
|
|
|
* Performance under SSI will tend to degrade more rapidly with a
|
|
|
|
large number of active database transactions than under less strict
|
|
|
|
isolation levels. Limiting the number of active transactions through
|
|
|
|
use of a connection pool or similar techniques may be necessary to
|
|
|
|
maintain good performance.
|
|
|
|
|
|
|
|
* Any transaction which must be rolled back to prevent
|
|
|
|
serialization anomalies will fail with SQLSTATE 40001, which has a
|
|
|
|
standard meaning of "serialization failure".
|
|
|
|
|
|
|
|
* This SSI implementation makes an effort to choose the
|
2011-06-29 08:26:14 +02:00
|
|
|
transaction to be canceled such that an immediate retry of the
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
transaction will not fail due to conflicts with exactly the same
|
2011-06-29 08:26:14 +02:00
|
|
|
transactions. Pursuant to this goal, no transaction is canceled
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
until one of the other transactions in the set of conflicts which
|
|
|
|
could generate an anomaly has successfully committed. This is
|
|
|
|
conceptually similar to how write conflicts are handled. To fully
|
|
|
|
implement this guarantee there needs to be a way to roll back the
|
|
|
|
active transaction for another process with a serialization failure
|
|
|
|
SQLSTATE, even if it is "idle in transaction".
|
|
|
|
|
|
|
|
|
|
|
|
Predicate Locking
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
Both S2PL and SSI require some form of predicate locking to handle
|
|
|
|
situations where reads conflict with later inserts or with later
|
|
|
|
updates which move data into the selected range. PostgreSQL didn't
|
|
|
|
already have predicate locking, so it needed to be added to support
|
|
|
|
full serializable transactions under either strategy. Practical
|
|
|
|
implementations of predicate locking generally involve acquiring
|
|
|
|
locks against data as it is accessed, using multiple granularities
|
|
|
|
(tuple, page, table, etc.) with escalation as needed to keep the lock
|
2011-06-22 00:00:09 +02:00
|
|
|
count to a number which can be tracked within RAM structures. This
|
|
|
|
approach was used in PostgreSQL. Coarse granularities can cause some
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
false positive indications of conflict. The number of false positives
|
|
|
|
can be influenced by plan choice.
|
|
|
|
|
|
|
|
|
|
|
|
Implementation overview
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
New RAM structures, inspired by those used to track traditional locks
|
|
|
|
in PostgreSQL, but tailored to the needs of SIREAD predicate locking,
|
|
|
|
are used. These refer to physical objects actually accessed in the
|
|
|
|
course of executing the query, to model the predicates through
|
|
|
|
inference. Anyone interested in this subject should review the
|
2011-06-16 20:12:56 +02:00
|
|
|
Hellerstein, Stonebraker and Hamilton paper [3], along with the
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
locking papers referenced from that and the Cahill papers.
|
|
|
|
|
|
|
|
Because the SIREAD locks don't block, traditional locking techniques
|
2011-06-22 00:00:09 +02:00
|
|
|
have to be modified. Intent locking (locking higher level objects
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
before locking lower level objects) doesn't work with non-blocking
|
|
|
|
"locks" (which are, in some respects, more like flags than locks).
|
|
|
|
|
|
|
|
A configurable amount of shared memory is reserved at postmaster
|
|
|
|
start-up to track predicate locks. This size cannot be changed
|
|
|
|
without a restart.
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
To prevent resource exhaustion, multiple fine-grained locks may
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
be promoted to a single coarser-grained lock as needed.
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
An attempt to acquire an SIREAD lock on a tuple when the same
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
transaction already holds an SIREAD lock on the page or the relation
|
|
|
|
will be ignored. Likewise, an attempt to lock a page when the
|
|
|
|
relation is locked will be ignored, and the acquisition of a coarser
|
|
|
|
lock will result in the automatic release of all finer-grained locks
|
|
|
|
it covers.
|
|
|
|
|
|
|
|
|
|
|
|
Heap locking
|
|
|
|
------------
|
|
|
|
|
|
|
|
Predicate locks will be acquired for the heap based on the following:
|
|
|
|
|
|
|
|
* For a table scan, the entire relation will be locked.
|
|
|
|
|
|
|
|
* Each tuple read which is visible to the reading transaction
|
|
|
|
will be locked, whether or not it meets selection criteria; except
|
|
|
|
that there is no need to acquire an SIREAD lock on a tuple when the
|
|
|
|
transaction already holds a write lock on any tuple representing the
|
2011-06-22 00:00:09 +02:00
|
|
|
row, since a rw-conflict would also create a ww-dependency which
|
|
|
|
has more aggressive enforcement and thus will prevent any anomaly.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* Modifying a heap tuple creates a rw-conflict with any transaction
|
|
|
|
that holds a SIREAD lock on that tuple, or on the page or relation
|
|
|
|
that contains it.
|
|
|
|
|
|
|
|
* Inserting a new tuple creates a rw-conflict with any transaction
|
|
|
|
holding a SIREAD lock on the entire relation. It doesn't conflict with
|
|
|
|
page-level locks, because page-level locks are only used to aggregate
|
|
|
|
tuple locks. Unlike index page locks, they don't lock "gaps" on the page.
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
Index AM implementations
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
Since predicate locks only exist to detect writes which conflict with
|
|
|
|
earlier reads, and heap tuple locks are acquired to cover all heap
|
|
|
|
tuples actually read, including those read through indexes, the index
|
|
|
|
tuples which were actually scanned are not of interest in themselves;
|
|
|
|
we only care about their "new neighbors" -- later inserts into the
|
|
|
|
index which would have been included in the scan had they existed at
|
|
|
|
the time. Conceptually, we want to lock the gaps between and
|
|
|
|
surrounding index entries within the scanned range.
|
|
|
|
|
|
|
|
Correctness requires that any insert into an index generates a
|
|
|
|
rw-conflict with a concurrent serializable transaction if, after that
|
|
|
|
insert, re-execution of any index scan of the other transaction would
|
|
|
|
access the heap for a row not accessed during the previous execution.
|
|
|
|
Note that a non-HOT update which expires an old index entry covered
|
|
|
|
by the scan and adds a new entry for the modified row's new tuple
|
|
|
|
need not generate a conflict, although an update which "moves" a row
|
|
|
|
into the scan must generate a conflict. While correctness allows
|
|
|
|
false positives, they should be minimized for performance reasons.
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
Several optimizations are possible, though not all are implemented yet:
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
* An index scan which is just finding the right position for an
|
2011-06-22 00:00:09 +02:00
|
|
|
index insertion or deletion need not acquire a predicate lock.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
* An index scan which is comparing for equality on the entire key
|
2011-06-22 00:00:09 +02:00
|
|
|
for a unique index need not acquire a predicate lock as long as a key
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
is found corresponding to a visible tuple which has not been modified
|
|
|
|
by another transaction -- there are no "between or around" gaps to
|
|
|
|
cover.
|
|
|
|
|
|
|
|
* As long as built-in foreign key enforcement continues to use
|
|
|
|
its current "special tricks" to deal with MVCC issues, predicate
|
|
|
|
locks should not be needed for scans done by enforcement code.
|
|
|
|
|
|
|
|
* If a search determines that no rows can be found regardless of
|
|
|
|
index contents because the search conditions are contradictory (e.g.,
|
|
|
|
x = 1 AND x = 2), then no predicate lock is needed.
|
|
|
|
|
|
|
|
Other index AM implementation considerations:
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
* For an index AM that doesn't have support for predicate locking,
|
|
|
|
we just acquire a predicate lock on the whole index for any search.
|
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* B-tree index searches acquire predicate locks only on the
|
|
|
|
index *leaf* pages needed to lock the appropriate index range. If,
|
|
|
|
however, a search discovers that no root page has yet been created, a
|
|
|
|
predicate lock on the index relation is required.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
Re-think predicate locking on GIN indexes.
The principle behind the locking was not very well thought-out, and not
documented. Add a section in the README to explain how it's supposed to
work, and change the code so that it actually works that way.
This fixes two bugs:
1. If fast update was turned on concurrently, subsequent inserts to the
pending list would not conflict with predicate locks that were acquired
earlier, on entry pages. The included 'predicate-gin-fastupdate' test
demonstrates that. To fix, make all scans acquire a predicate lock on
the metapage. That lock represents a scan of the pending list, whether
or not there is a pending list at the moment. Forget about the
optimization to skip locking/checking for locks, when fastupdate=off.
2. If a scan finds no match, it still needs to lock the entry page. The
point of predicate locks is to lock the gabs between values, whether
or not there is a match. The included 'predicate-gin-nomatch' test
tests that case.
In addition to those two bug fixes, this removes some unnecessary locking,
following the principle laid out in the README. Because all items in
a posting tree have the same key value, a lock on the posting tree root is
enough to cover all the items. (With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.)
Also, some spelling fixes.
Author: Heikki Linnakangas with some editorization by me
Review: Andrey Borodin, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/0b3ad2c2-2692-62a9-3a04-5724f2af9114@iki.fi
2018-05-04 10:27:50 +02:00
|
|
|
* Like a B-tree, GIN searches acquire predicate locks only on the
|
|
|
|
leaf pages of entry tree. When performing an equality scan, and an
|
|
|
|
entry has a posting tree, the posting tree root is locked instead, to
|
|
|
|
lock only that key value. However, fastupdate=on postpones the
|
|
|
|
insertion of tuples into index structure by temporarily storing them
|
|
|
|
into pending list. That makes us unable to detect r-w conflicts using
|
|
|
|
page-level locks. To cope with that, insertions to the pending list
|
|
|
|
conflict with all scans.
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
* GiST searches can determine that there are no matches at any
|
2018-03-27 14:43:19 +02:00
|
|
|
level of the index, so we acquire predicate lock at each index
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
level during a GiST search. An index insert at the leaf level can
|
|
|
|
then be trusted to ripple up to all levels and locations where
|
2018-03-27 14:43:19 +02:00
|
|
|
conflicting predicate locks may exist. In case there is a page split,
|
Re-think predicate locking on GIN indexes.
The principle behind the locking was not very well thought-out, and not
documented. Add a section in the README to explain how it's supposed to
work, and change the code so that it actually works that way.
This fixes two bugs:
1. If fast update was turned on concurrently, subsequent inserts to the
pending list would not conflict with predicate locks that were acquired
earlier, on entry pages. The included 'predicate-gin-fastupdate' test
demonstrates that. To fix, make all scans acquire a predicate lock on
the metapage. That lock represents a scan of the pending list, whether
or not there is a pending list at the moment. Forget about the
optimization to skip locking/checking for locks, when fastupdate=off.
2. If a scan finds no match, it still needs to lock the entry page. The
point of predicate locks is to lock the gabs between values, whether
or not there is a match. The included 'predicate-gin-nomatch' test
tests that case.
In addition to those two bug fixes, this removes some unnecessary locking,
following the principle laid out in the README. Because all items in
a posting tree have the same key value, a lock on the posting tree root is
enough to cover all the items. (With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.)
Also, some spelling fixes.
Author: Heikki Linnakangas with some editorization by me
Review: Andrey Borodin, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/0b3ad2c2-2692-62a9-3a04-5724f2af9114@iki.fi
2018-05-04 10:27:50 +02:00
|
|
|
we need to copy predicate lock from the original page to all the new
|
|
|
|
pages.
|
2018-03-30 13:23:17 +02:00
|
|
|
|
2018-04-07 15:59:14 +02:00
|
|
|
* Hash index searches acquire predicate locks on the primary
|
|
|
|
page of a bucket. It acquires a lock on both the old and new buckets
|
|
|
|
for scans that happen concurrently with page splits. During a bucket
|
|
|
|
split, a predicate lock is copied from the primary page of an old
|
|
|
|
bucket to the primary page of a new bucket.
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
* The effects of page splits, overflows, consolidations, and
|
|
|
|
removals must be carefully reviewed to ensure that predicate locks
|
|
|
|
aren't "lost" during those operations, or kept with pages which could
|
|
|
|
get re-used for different parts of the index.
|
|
|
|
|
|
|
|
|
|
|
|
Innovations
|
|
|
|
-----------
|
|
|
|
|
|
|
|
The PostgreSQL implementation of Serializable Snapshot Isolation
|
|
|
|
differs from what is described in the cited papers for several
|
|
|
|
reasons:
|
|
|
|
|
|
|
|
1. PostgreSQL didn't have any existing predicate locking. It had
|
|
|
|
to be added from scratch.
|
|
|
|
|
|
|
|
2. The existing in-memory lock structures were not suitable for
|
|
|
|
tracking SIREAD locks.
|
|
|
|
* In PostgreSQL, tuple level locks are not held in RAM for
|
|
|
|
any length of time; lock information is written to the tuples
|
|
|
|
involved in the transactions.
|
|
|
|
* In PostgreSQL, existing lock structures have pointers to
|
2011-06-22 00:00:09 +02:00
|
|
|
memory which is related to a session. SIREAD locks need to persist
|
|
|
|
past the end of the originating transaction and even the session
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
which ran it.
|
|
|
|
* PostgreSQL needs to be able to tolerate a large number of
|
|
|
|
transactions executing while one long-running transaction stays open
|
|
|
|
-- the in-RAM techniques discussed in the papers wouldn't support
|
|
|
|
that.
|
|
|
|
|
|
|
|
3. Unlike the database products used for the prototypes described
|
|
|
|
in the papers, PostgreSQL didn't already have a true serializable
|
|
|
|
isolation level distinct from snapshot isolation.
|
|
|
|
|
|
|
|
4. PostgreSQL supports subtransactions -- an issue not mentioned
|
|
|
|
in the papers.
|
|
|
|
|
|
|
|
5. PostgreSQL doesn't assign a transaction number to a database
|
2011-06-22 00:00:09 +02:00
|
|
|
transaction until and unless necessary (normally, when the transaction
|
|
|
|
attempts to modify data).
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
6. PostgreSQL has pluggable data types with user-definable
|
|
|
|
operators, as well as pluggable index types, not all of which are
|
|
|
|
based around data types which support ordering.
|
|
|
|
|
|
|
|
7. Some possible optimizations became apparent during development
|
|
|
|
and testing.
|
|
|
|
|
|
|
|
Differences from the implementation described in the papers are
|
|
|
|
listed below.
|
|
|
|
|
|
|
|
* New structures needed to be created in shared memory to track
|
|
|
|
the proper information for serializable transactions and their SIREAD
|
|
|
|
locks.
|
|
|
|
|
|
|
|
* Because PostgreSQL does not have the same concept of an "oldest
|
|
|
|
transaction ID" for all serializable transactions as assumed in the
|
2011-06-10 16:22:00 +02:00
|
|
|
Cahill thesis, we track the oldest snapshot xmin among serializable
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
transactions, and a count of how many active transactions use that
|
|
|
|
xmin. When the count hits zero we find the new oldest xmin and run a
|
|
|
|
clean-up based on that.
|
|
|
|
|
|
|
|
* Because reads in a subtransaction may cause that subtransaction
|
|
|
|
to roll back, thereby affecting what is written by the top level
|
|
|
|
transaction, predicate locks must survive a subtransaction rollback.
|
|
|
|
As a consequence, all xid usage in SSI, including predicate locking,
|
|
|
|
is based on the top level xid. When looking at an xid that comes
|
|
|
|
from a tuple's xmin or xmax, for example, we always call
|
|
|
|
SubTransGetTopmostTransaction() before doing much else with it.
|
|
|
|
|
2011-05-30 19:42:16 +02:00
|
|
|
* PostgreSQL does not use "update in place" with a rollback log
|
|
|
|
for its MVCC implementation. Where possible it uses "HOT" updates on
|
|
|
|
the same page (if there is room and no indexed value is changed).
|
|
|
|
For non-HOT updates the old tuple is expired in place and a new tuple
|
|
|
|
is inserted at a new location. Because of this difference, a tuple
|
|
|
|
lock in PostgreSQL doesn't automatically lock any other versions of a
|
|
|
|
row. We don't try to copy or expand a tuple lock to any other
|
|
|
|
versions of the row, based on the following proof that any additional
|
|
|
|
serialization failures we would get from that would be false
|
|
|
|
positives:
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
o If transaction T1 reads a row version (thus acquiring a
|
|
|
|
predicate lock on it) and a second transaction T2 updates that row
|
|
|
|
version (thus creating a rw-conflict graph edge from T1 to T2), must a
|
|
|
|
third transaction T3 which re-updates the new version of the row also
|
|
|
|
have a rw-conflict in from T1 to prevent anomalies? In other words,
|
|
|
|
does it matter whether we recognize the edge T1 -> T3?
|
2011-05-30 19:42:16 +02:00
|
|
|
|
|
|
|
o If T1 has a conflict in, it certainly doesn't. Adding the
|
|
|
|
edge T1 -> T3 would create a dangerous structure, but we already had
|
2011-06-22 00:00:09 +02:00
|
|
|
one from the edge T1 -> T2, so we would have aborted something anyway.
|
|
|
|
(T2 has already committed, else T3 could not have updated its output;
|
|
|
|
but we would have aborted either T1 or T1's predecessor(s). Hence
|
|
|
|
no cycle involving T1 and T3 can survive.)
|
2011-05-30 19:42:16 +02:00
|
|
|
|
|
|
|
o Now let's consider the case where T1 doesn't have a
|
2011-06-22 00:00:09 +02:00
|
|
|
rw-conflict in. If that's the case, for this edge T1 -> T3 to make a
|
|
|
|
difference, T3 must have a rw-conflict out that induces a cycle in the
|
|
|
|
dependency graph, i.e. a conflict out to some transaction preceding T1
|
|
|
|
in the graph. (A conflict out to T1 itself would be problematic too,
|
|
|
|
but that would mean T1 has a conflict in, the case we already
|
|
|
|
eliminated.)
|
2011-05-30 19:42:16 +02:00
|
|
|
|
|
|
|
o So now we're trying to figure out if there can be an
|
|
|
|
rw-conflict edge T3 -> T0, where T0 is some transaction that precedes
|
2011-06-22 00:00:09 +02:00
|
|
|
T1. For T0 to precede T1, there has to be some edge, or sequence of
|
|
|
|
edges, from T0 to T1. At least the last edge has to be a wr-dependency
|
|
|
|
or ww-dependency rather than a rw-conflict, because T1 doesn't have a
|
|
|
|
rw-conflict in. And that gives us enough information about the order
|
|
|
|
of transactions to see that T3 can't have a rw-conflict to T0:
|
2011-05-30 19:42:16 +02:00
|
|
|
- T0 committed before T1 started (the wr/ww-dependency implies this)
|
|
|
|
- T1 started before T2 committed (the T1->T2 rw-conflict implies this)
|
2011-06-22 00:00:09 +02:00
|
|
|
- T2 committed before T3 started (otherwise, T3 would get aborted
|
2011-05-30 19:42:16 +02:00
|
|
|
because of an update conflict)
|
|
|
|
|
|
|
|
o That means T0 committed before T3 started, and therefore
|
|
|
|
there can't be a rw-conflict from T3 to T0.
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
o So in all cases, we don't need the T1 -> T3 edge to
|
|
|
|
recognize cycles. Therefore it's not necessary for T1's SIREAD lock
|
|
|
|
on the original tuple version to cover later versions as well.
|
2011-05-30 19:42:16 +02:00
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* Predicate locking in PostgreSQL starts at the tuple level
|
|
|
|
when possible. Multiple fine-grained locks are promoted to a single
|
|
|
|
coarser-granularity lock as needed to avoid resource exhaustion. The
|
|
|
|
amount of memory used for these structures is configurable, to balance
|
|
|
|
RAM usage against SIREAD lock granularity.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* Each backend keeps a process-local table of the locks it holds.
|
|
|
|
To support granularity promotion decisions with low CPU and locking
|
|
|
|
overhead, this table also includes the coarser covering locks and the
|
|
|
|
number of finer-granularity locks they cover.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* Conflicts are identified by looking for predicate locks
|
|
|
|
when tuples are written, and by looking at the MVCC information when
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
tuples are read. There is no matching between two RAM-based locks.
|
|
|
|
|
|
|
|
* Because write locks are stored in the heap tuples rather than a
|
|
|
|
RAM-based lock table, the optimization described in the Cahill thesis
|
|
|
|
which eliminates an SIREAD lock where there is a write lock is
|
|
|
|
implemented by the following:
|
|
|
|
1. When checking a heap write for conflicts against existing
|
|
|
|
predicate locks, a tuple lock on the tuple being written is removed.
|
|
|
|
2. When acquiring a predicate lock on a heap tuple, we
|
|
|
|
return quickly without doing anything if it is a tuple written by the
|
|
|
|
reading transaction.
|
|
|
|
|
|
|
|
* Rather than using conflictIn and conflictOut pointers which use
|
|
|
|
NULL to indicate no conflict and a self-reference to indicate
|
|
|
|
multiple conflicts or conflicts with committed transactions, we use a
|
|
|
|
list of rw-conflicts. With the more complete information, false
|
|
|
|
positives are reduced and we have sufficient data for more aggressive
|
2011-06-22 00:00:09 +02:00
|
|
|
clean-up and other optimizations:
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
o We can avoid ever rolling back a transaction until and
|
|
|
|
unless there is a pivot where a transaction on the conflict *out*
|
|
|
|
side of the pivot committed before either of the other transactions.
|
2011-06-22 00:00:09 +02:00
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
o We can avoid ever rolling back a transaction when the
|
|
|
|
transaction on the conflict *in* side of the pivot is explicitly or
|
|
|
|
implicitly READ ONLY unless the transaction on the conflict *out*
|
|
|
|
side of the pivot committed before the READ ONLY transaction acquired
|
|
|
|
its snapshot. (An implicit READ ONLY transaction is one which
|
|
|
|
committed without writing, even though it was not explicitly declared
|
|
|
|
to be READ ONLY.)
|
2011-06-22 00:00:09 +02:00
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
o We can more aggressively clean up conflicts, predicate
|
|
|
|
locks, and SSI transaction information.
|
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* We allow a READ ONLY transaction to "opt out" of SSI if there are
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
no READ WRITE transactions which could cause the READ ONLY
|
|
|
|
transaction to ever become part of a "dangerous structure" of
|
|
|
|
overlapping transaction dependencies.
|
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
* We allow the user to request that a READ ONLY transaction wait
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
until the conditions are right for it to start in the "opt out" state
|
|
|
|
described above. We add a DEFERRABLE state to transactions, which is
|
|
|
|
specified and maintained in a way similar to READ ONLY. It is
|
2011-06-22 00:00:09 +02:00
|
|
|
ignored for transactions that are not SERIALIZABLE and READ ONLY.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
* When a transaction must be rolled back, we pick among the
|
|
|
|
active transactions such that an immediate retry will not fail again
|
|
|
|
on conflicts with the same transactions.
|
|
|
|
|
|
|
|
* We use the PostgreSQL SLRU system to hold summarized
|
|
|
|
information about older committed transactions to put an upper bound
|
|
|
|
on RAM used. Beyond that limit, information spills to disk.
|
|
|
|
Performance can degrade in a pessimal situation, but it should be
|
2011-06-29 08:26:14 +02:00
|
|
|
tolerable, and transactions won't need to be canceled or blocked
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
from starting.
|
|
|
|
|
|
|
|
|
|
|
|
R&D Issues
|
|
|
|
----------
|
|
|
|
|
|
|
|
This is intended to be the place to record specific issues which need
|
|
|
|
more detailed review or analysis.
|
|
|
|
|
|
|
|
* WAL file replay. While serializable implementations using S2PL
|
|
|
|
can guarantee that the write-ahead log contains commits in a sequence
|
|
|
|
consistent with some serial execution of serializable transactions,
|
|
|
|
SSI cannot make that guarantee. While the WAL replay is no less
|
|
|
|
consistent than under snapshot isolation, it is possible that under
|
|
|
|
PITR recovery or hot standby a database could reach a readable state
|
|
|
|
where some transactions appear before other transactions which would
|
|
|
|
have had to precede them to maintain serializable consistency. In
|
|
|
|
essence, if we do nothing, WAL replay will be at snapshot isolation
|
|
|
|
even for serializable transactions. Is this OK? If not, how do we
|
|
|
|
address it?
|
|
|
|
|
|
|
|
* External replication. Look at how this impacts external
|
|
|
|
replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc.
|
|
|
|
This is related to the "WAL file replay" issue.
|
|
|
|
|
|
|
|
* UNIQUE btree search for equality on all columns. Since a search
|
|
|
|
of a UNIQUE index using equality tests on all columns will lock the
|
|
|
|
heap tuple if an entry is found, it appears that there is no need to
|
|
|
|
get a predicate lock on the index in that case. A predicate lock is
|
|
|
|
still needed for such a search if a matching index entry which points
|
|
|
|
to a visible tuple is not found.
|
|
|
|
|
|
|
|
* Minimize touching of shared memory. Should lists in shared
|
|
|
|
memory push entries which have just been returned to the front of the
|
|
|
|
available list, so they will be popped back off soon and some memory
|
|
|
|
might never be touched, or should we keep adding returned items to
|
|
|
|
the end of the available list?
|
|
|
|
|
|
|
|
|
2011-06-22 00:00:09 +02:00
|
|
|
References
|
|
|
|
----------
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
|
|
|
|
[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
|
|
|
|
Search for serial execution to find the relevant section.
|
|
|
|
|
2011-06-16 20:12:56 +02:00
|
|
|
[2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM
|
|
|
|
Transactions on Database Systems 30:2, Jun. 2005.
|
|
|
|
http://dx.doi.org/10.1145/1071610.1071615
|
|
|
|
|
|
|
|
[3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
Architecture of a Database System. Foundations and Trends(R) in
|
2011-05-31 22:10:05 +02:00
|
|
|
Databases Vol. 1, No. 2 (2007) 141-259.
|
2011-06-16 20:12:56 +02:00
|
|
|
http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
Of particular interest:
|
|
|
|
* 6.1 A Note on ACID
|
|
|
|
* 6.2 A Brief Review of Serializability
|
|
|
|
* 6.3 Locking and Latching
|
|
|
|
* 6.3.1 Transaction Isolation Levels
|
2011-06-16 20:12:56 +02:00
|
|
|
* 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties
|