From 54d0e2886a9cea2035a471dd6dfb8f671223de7d Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Fri, 17 Sep 2010 00:42:39 +0000 Subject: [PATCH] Add some documentation about how we WAL-log filesystem actions. Per a question from Robert Haas. --- src/backend/access/transam/README | 81 ++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 05c41d487c..08794cdf8a 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -1,4 +1,4 @@ -$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $ +$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.14 2010/09/17 00:42:39 tgl Exp $ The Transaction System ====================== @@ -543,6 +543,85 @@ consistency. Such insertions occur after WAL is operational, so they can and should write WAL records for the additional generated actions. +Write-Ahead Logging for Filesystem Actions +------------------------------------------ + +The previous section described how to WAL-log actions that only change page +contents within shared buffers. For that type of action it is generally +possible to check all likely error cases (such as insufficient space on the +page) before beginning to make the actual change. Therefore we can make +the change and the creation of the associated WAL log record "atomic" by +wrapping them into a critical section --- the odds of failure partway +through are low enough that PANIC is acceptable if it does happen. + +Clearly, that approach doesn't work for cases where there's a significant +probability of failure within the action to be logged, such as creation +of a new file or database. We don't want to PANIC, and we especially don't +want to PANIC after having already written a WAL record that says we did +the action --- if we did, replay of the record would probably fail again +and PANIC again, making the failure unrecoverable. This means that the +ordinary WAL rule of "write WAL before the changes it describes" doesn't +work, and we need a different design for such cases. + +There are several basic types of filesystem actions that have this +issue. Here is how we deal with each: + +1. Adding a disk page to an existing table. + +This action isn't WAL-logged at all. We extend a table by writing a page +of zeroes at its end. We must actually do this write so that we are sure +the filesystem has allocated the space. If the write fails we can just +error out normally. Once the space is known allocated, we can initialize +and fill the page via one or more normal WAL-logged actions. Because it's +possible that we crash between extending the file and writing out the WAL +entries, we have to treat discovery of an all-zeroes page in a table or +index as being a non-error condition. In such cases we can just reclaim +the space for re-use. + +2. Creating a new table, which requires a new file in the filesystem. + +We try to create the file, and if successful we make a WAL record saying +we did it. If not successful, we can just throw an error. Notice that +there is a window where we have created the file but not yet written any +WAL about it to disk. If we crash during this window, the file remains +on disk as an "orphan". It would be possible to clean up such orphans +by having database restart search for files that don't have any committed +entry in pg_class, but that currently isn't done because of the possibility +of deleting data that is useful for forensic analysis of the crash. +Orphan files are harmless --- at worst they waste a bit of disk space --- +because we check for on-disk collisions when allocating new relfilenode +OIDs. So cleaning up isn't really necessary. + +3. Deleting a table, which requires an unlink() that could fail. + +Our approach here is to WAL-log the operation first, but to treat failure +of the actual unlink() call as a warning rather than error condition. +Again, this can leave an orphan file behind, but that's cheap compared to +the alternatives. Since we can't actually do the unlink() until after +we've committed the DROP TABLE transaction, throwing an error would be out +of the question anyway. (It may be worth noting that the WAL entry about +the file deletion is actually part of the commit record for the dropping +transaction.) + +4. Creating and deleting databases and tablespaces, which requires creating +and deleting directories and entire directory trees. + +These cases are handled similarly to creating individual files, ie, we +try to do the action first and then write a WAL entry if it succeeded. +The potential amount of wasted disk space is rather larger, of course. +In the creation case we try to delete the directory tree again if creation +fails, so as to reduce the risk of wasted space. Failure partway through +a deletion operation results in a corrupt database: the DROP failed, but +some of the data is gone anyway. There is little we can do about that, +though, and in any case it was presumably data the user no longer wants. + +In all of these cases, if WAL replay fails to redo the original action +we must panic and abort recovery. The DBA will have to manually clean up +(for instance, free up some disk space or fix directory permissions) and +then restart recovery. This is part of the reason for not writing a WAL +entry until we've successfully done the original action. + + Asynchronous Commit -------------------