From 54d0e2886a9cea2035a471dd6dfb8f671223de7d Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 17 Sep 2010 00:42:39 +0000
Subject: [PATCH] Add some documentation about how we WAL-log filesystem
 actions. Per a question from Robert Haas.

---
 src/backend/access/transam/README | 81 ++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 05c41d487c..08794cdf8a 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -1,4 +1,4 @@
-$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $
+$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.14 2010/09/17 00:42:39 tgl Exp $
 
 The Transaction System
 ======================
@@ -543,6 +543,85 @@ consistency.  Such insertions occur after WAL is operational, so they can
 and should write WAL records for the additional generated actions.
 
 
+Write-Ahead Logging for Filesystem Actions
+------------------------------------------
+
+The previous section described how to WAL-log actions that only change page
+contents within shared buffers.  For that type of action it is generally
+possible to check all likely error cases (such as insufficient space on the
+page) before beginning to make the actual change.  Therefore we can make
+the change and the creation of the associated WAL log record "atomic" by
+wrapping them into a critical section --- the odds of failure partway
+through are low enough that PANIC is acceptable if it does happen.
+
+Clearly, that approach doesn't work for cases where there's a significant
+probability of failure within the action to be logged, such as creation
+of a new file or database.  We don't want to PANIC, and we especially don't
+want to PANIC after having already written a WAL record that says we did
+the action --- if we did, replay of the record would probably fail again
+and PANIC again, making the failure unrecoverable.  This means that the
+ordinary WAL rule of "write WAL before the changes it describes" doesn't
+work, and we need a different design for such cases.
+
+There are several basic types of filesystem actions that have this
+issue.  Here is how we deal with each:
+
+1. Adding a disk page to an existing table.
+
+This action isn't WAL-logged at all.  We extend a table by writing a page
+of zeroes at its end.  We must actually do this write so that we are sure
+the filesystem has allocated the space.  If the write fails we can just
+error out normally.  Once the space is known allocated, we can initialize
+and fill the page via one or more normal WAL-logged actions.  Because it's
+possible that we crash between extending the file and writing out the WAL
+entries, we have to treat discovery of an all-zeroes page in a table or
+index as being a non-error condition.  In such cases we can just reclaim
+the space for re-use.
+
+2. Creating a new table, which requires a new file in the filesystem.
+
+We try to create the file, and if successful we make a WAL record saying
+we did it.  If not successful, we can just throw an error.  Notice that
+there is a window where we have created the file but not yet written any
+WAL about it to disk.  If we crash during this window, the file remains
+on disk as an "orphan".  It would be possible to clean up such orphans
+by having database restart search for files that don't have any committed
+entry in pg_class, but that currently isn't done because of the possibility
+of deleting data that is useful for forensic analysis of the crash.
+Orphan files are harmless --- at worst they waste a bit of disk space ---
+because we check for on-disk collisions when allocating new relfilenode
+OIDs.  So cleaning up isn't really necessary.
+
+3. Deleting a table, which requires an unlink() that could fail.
+
+Our approach here is to WAL-log the operation first, but to treat failure
+of the actual unlink() call as a warning rather than error condition.
+Again, this can leave an orphan file behind, but that's cheap compared to
+the alternatives.  Since we can't actually do the unlink() until after
+we've committed the DROP TABLE transaction, throwing an error would be out
+of the question anyway.  (It may be worth noting that the WAL entry about
+the file deletion is actually part of the commit record for the dropping
+transaction.)
+
+4. Creating and deleting databases and tablespaces, which requires creating
+and deleting directories and entire directory trees.
+
+These cases are handled similarly to creating individual files, ie, we
+try to do the action first and then write a WAL entry if it succeeded.
+The potential amount of wasted disk space is rather larger, of course.
+In the creation case we try to delete the directory tree again if creation
+fails, so as to reduce the risk of wasted space.  Failure partway through
+a deletion operation results in a corrupt database: the DROP failed, but
+some of the data is gone anyway.  There is little we can do about that,
+though, and in any case it was presumably data the user no longer wants.
+
+In all of these cases, if WAL replay fails to redo the original action
+we must panic and abort recovery.  The DBA will have to manually clean up
+(for instance, free up some disk space or fix directory permissions) and
+then restart recovery.  This is part of the reason for not writing a WAL
+entry until we've successfully done the original action.
+
+
 Asynchronous Commit
 -------------------