diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index f4dc252780..e8f8b8a581 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1,5 +1,5 @@ Backup and Restore @@ -1148,21 +1148,20 @@ restore_command = 'copy /mnt/server/archivedir/%f "%p"' # Windows It should also be noted that the default WAL - format is fairly bulky since it includes many disk page snapshots. The pages - are partially compressed, using the simple expedient of removing the - empty space (if any) within each block. You can significantly reduce + format is fairly bulky since it includes many disk page snapshots. + These page snapshots are designed to support crash recovery, + since we may need to fix partially-written disk pages. Depending + on your system hardware and software, the risk of partial writes may + be small enough to ignore, in which case you can significantly reduce the total volume of archived logs by turning off page snapshots - using the parameter, - though you should read the notes and warnings in - before you do so. - These page snapshots are designed to allow crash recovery, - since we may need to fix partially-written disk pages. It is not - necessary to store these page copies for PITR operations, however. - If you turn off , your PITR - backup and recovery operations will continue to work successfully. + using the parameter. + (Read the notes and warnings in + before you do so.) + Turning off page snapshots does not prevent use of the logs for PITR + operations. An area for future development is to compress archived WAL data by - removing unnecessary page copies when - is turned on. In the meantime, administrators + removing unnecessary page copies even when full_page_writes + is on. In the meantime, administrators may wish to reduce the number of page snapshots included in WAL by increasing the checkpoint interval parameters as much as feasible. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5582f6b778..f72ad33a6a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1,5 +1,5 @@ Run-time Configuration @@ -1251,14 +1251,15 @@ SET ENABLE_SEQSCAN TO OFF; If this option is on, the PostgreSQL server - will use the fsync() system call in several places - to make sure that updates are physically written to disk. This - insures that a database cluster will recover to a + will try to make sure that updates are physically written to + disk, by issuing fsync() system calls or various + equivalent methods (see ). + This ensures that the database cluster can recover to a consistent state after an operating system or hardware crash. - However, using fsync() results in a + However, using fsync results in a performance penalty: when a transaction is committed, PostgreSQL must wait for the operating system to flush the write-ahead log to disk. When @@ -1268,7 +1269,7 @@ SET ENABLE_SEQSCAN TO OFF; However, if the system crashes, the results of the last few committed transactions may be lost in part or whole. In the worst case, unrecoverable data corruption may occur. - (Crashes of the database server itself are not + (Crashes of the database software itself are not a risk factor here. Only an operating-system-level crash creates a risk of corruption.) @@ -1277,8 +1278,8 @@ SET ENABLE_SEQSCAN TO OFF; Due to the risks involved, there is no universally correct setting for fsync. Some administrators always disable fsync, while others only - turn it off for bulk loads, where there is a clear restart - point if something goes wrong, whereas some administrators + turn it off during initial bulk data loads, where there is a clear + restart point if something goes wrong. Others always leave fsync enabled. The default is to enable fsync, for maximum reliability. If you trust your operating system, your hardware, and your @@ -1288,9 +1289,9 @@ SET ENABLE_SEQSCAN TO OFF; This option can only be set at server start or in the - postgresql.conf file. If this option - is off, consider also turning off - guc-full-page-writes. + postgresql.conf file. If you turn + this option off, also consider turning off + . @@ -1302,8 +1303,10 @@ SET ENABLE_SEQSCAN TO OFF; - Method used for forcing WAL updates out to disk. Possible - values are: + Method used for forcing WAL updates out to disk. + If fsync is off then this setting is irrelevant, + since updates will not be forced out at all. + Possible values are: @@ -1313,7 +1316,12 @@ SET ENABLE_SEQSCAN TO OFF; - fdatasync (call fdatasync() at each commit), + fdatasync (call fdatasync() at each commit) + + + + + fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache) @@ -1322,11 +1330,6 @@ SET ENABLE_SEQSCAN TO OFF; - - fsync_writethrough (force write-through of any disk write cache) - - - open_sync (write WAL files with open() option O_SYNC) @@ -1334,8 +1337,7 @@ SET ENABLE_SEQSCAN TO OFF; Not all of these choices are available on all platforms. - The top-most supported option is used as the default. - If fsync is off then this setting is irrelevant. + The default is the first method in the above list that is supported. This option can only be set at server start or in the postgresql.conf file. @@ -1349,21 +1351,37 @@ SET ENABLE_SEQSCAN TO OFF; full_page_writes (boolean) - A page write in process during an operating system crash might - be only partially written to disk, leading to an on-disk page - that contains a mix of old and new data. During recovery, the - row changes stored in WAL are not enough to completely restore - the page. + When this option is on, the PostgreSQL server + writes the entire content of each disk page to WAL during the + first modification of that page after a checkpoint. + This is needed because + a page write that is in process during an operating system crash might + be only partially completed, leading to an on-disk page + that contains a mix of old and new data. The row-level change data + normally stored in WAL will not be enough to completely restore + such a page during post-crash recovery. Storing the full page image + guarantees that the page can be correctly restored, but at a price + in increasing the amount of data that must be written to WAL. + (Because WAL replay always starts from a checkpoint, it is sufficient + to do this during the first change of each page after a checkpoint. + Therefore, one way to reduce the cost of full-page writes is to + increase the checkpoint interval parameters.) - When this option is on, the PostgreSQL server - writes full pages to WAL when they are first modified after a - checkpoint so crash recovery is possible. Turning this option off - might lead to a corrupt system after an operating system crash - or power failure because uncorrected partial pages might contain - inconsistent or corrupt data. The risks are less but similar to - fsync. + Turning this option off speeds normal operation, but + might lead to a corrupt database after an operating system crash + or power failure. The risks are similar to turning off + fsync, though smaller. It may be safe to turn off + this option if you have hardware (such as a battery-backed disk + controller) or filesystem software (e.g., Reiser4) that reduces + the risk of partial page writes to an acceptably low level. + + + + Turning off this option does not affect use of + WAL archiving for point-in-time recovery (PITR) + (see ). @@ -1384,7 +1402,7 @@ SET ENABLE_SEQSCAN TO OFF; Number of disk-page buffers allocated in shared memory for WAL data. The default is 8. The setting need only be large enough to hold the amount of WAL data generated by one typical transaction, since - the data is flushed to disk at every transaction commit. + the data is written out to disk at every transaction commit. This option can only be set at server start. @@ -1481,8 +1499,9 @@ SET ENABLE_SEQSCAN TO OFF; Write a message to the server log if checkpoints caused by the filling of checkpoint segment files happen closer together - than this many seconds. The default is 30 seconds. - Zero turns off the warning. + than this many seconds (which suggests that + checkpoint_segments ought to be raised). The default is + 30 seconds. Zero disables the warning. diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index 62595c594e..cfea73ed69 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -1,4 +1,4 @@ - + Reliability @@ -7,12 +7,12 @@ Reliability is a major feature of any serious database system, and PostgreSQL does everything possible to guarantee reliable operation. One aspect of reliable operation is that all data - recorded by a transaction should be stored in a non-volatile area + recorded by a committed transaction should be stored in a non-volatile area that is safe from power loss, operating system failure, and hardware - failure (unrelated to the non-volatile area itself). To accomplish - this, PostgreSQL uses the magnetic platters of modern - disk drives for permanent storage that is immune to the failures - listed above. In fact, even if a computer is fatally damaged, if + failure (except failure of the non-volatile area itself, of course). + Successfully writing the data to the computer's permanent storage + (disk drive or equivalent) ordinarily meets this requirement. + In fact, even if a computer is fatally damaged, if the disk drives survive they can be moved to another computer with similar hardware and all committed transactions will remain intact. @@ -21,60 +21,64 @@ While forcing data periodically to the disk platters might seem like a simple operation, it is not. Because disk drives are dramatically slower than main memory and CPUs, several layers of caching exist - between the computer's main memory and the disk drive platters. - First, there is the operating system kernel cache, which caches - frequently requested disk blocks and delays disk writes. Fortunately, + between the computer's main memory and the disk platters. + First, there is the operating system's buffer cache, which caches + frequently requested disk blocks and combines disk writes. Fortunately, all operating systems give applications a way to force writes from - the kernel cache to disk, and PostgreSQL uses those - features. In fact, the parameter - controls how this is done. + the buffer cache to disk, and PostgreSQL uses those + features. (See the parameter + to adjust how this is done.) + - Secondly, there is an optional disk drive controller cache, - particularly popular on RAID controller cards. Some of - these caches are write-through, meaning writes are passed + Next, there may be a cache in the disk drive controller; this is + particularly common on RAID controller cards. Some of + these caches are write-through, meaning writes are passed along to the drive as soon as they arrive. Others are - write-back, meaning data is passed on to the drive at - some later time. Such caches can be a reliability problem because the - disk controller card cache is volatile, unlike the disk driver - platters, unless the disk drive controller has a battery-backed - cache, meaning the card has a battery that maintains power to the - cache in case of server power loss. When the disk drives are later - accessible, the data is written to the drives. + write-back, meaning data is passed on to the drive at + some later time. Such caches can be a reliability hazard because the + memory in the disk controller cache is volatile, and will lose its + contents in a power failure. Better controller cards have + battery-backed caches, meaning the card has a battery that + maintains power to the cache in case of system power loss. After power + is restored the data will be written to the disk drives. And finally, most disk drives have caches. Some are write-through - (typically SCSI), and some are write-back(typically IDE), and the + while some are write-back, and the same concerns about data loss exist for write-back drive caches as - exist for disk controller caches. To have reliability, all - storage subsystems must be reliable in their storage characteristics. - When the operating system sends a write request to the drive platters, - there is little it can do to make sure the data has arrived at a - non-volatile store area on the system. Rather, it is the + exist for disk controller caches. Consumer-grade IDE drives are + particularly likely to contain write-back caches that will not + survive a power failure. + + + + When the operating system sends a write request to the disk hardware, + there is little it can do to make sure the data has arrived at a truly + non-volatile storage area. Rather, it is the administrator's responsibility to be sure that all storage components - have reliable characteristics. + ensure data integrity. Avoid disk controllers that have non-battery-backed + write caches. At the drive level, disable write-back caching if the + drive cannot guarantee the data will be written before shutdown. - One other area of potential data loss are the disk platter writes - themselves. Disk platters are internally made up of 512-byte sectors. + Another risk of data loss is posed by the disk platter write + operations themselves. Disk platters are divided into sectors, + commonly 512 bytes each. Every physical read or write operation + processes a whole sector. When a write request arrives at the drive, it might be for 512 bytes, 1024 bytes, or 8192 bytes, and the process of writing could fail due to power loss at any time, meaning some of the 512-byte sectors were - written, and others were not, or the first half of a 512-byte sector - has new data, and the remainder has the original data. Obviously, on - startup, PostgreSQL would not be able to deal with - these partially written cases. To guard against that, + written, and others were not. To guard against such failures, PostgreSQL periodically writes full page images to permanent storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQL can - restore partially-written pages. If you have a battery-backed disk - controller or filesystem (e.g. Reiser4) that prevents partial page writes, - you can turn off this page imaging by using the - parameter. This parameter has no - effect on the successful use of Point in Time Recovery (PITR), - described in . + restore partially-written pages. If you have a battery-backed disk + controller or filesystem software (e.g., Reiser4) that prevents partial + page writes, you can turn off this page imaging by using the + parameter. @@ -111,11 +115,7 @@ - WAL brings three major benefits: - - - - The first major benefit of using WAL is a + A major benefit of using WAL is a significantly reduced number of disk writes, because only the log file needs to be flushed to disk at the time of transaction commit, rather than every data file changed by the transaction. @@ -129,30 +129,7 @@ - The next benefit is crash recovery protection. The truth is - that, before WAL was introduced back in release 7.1, - PostgreSQL was never able to guarantee - consistency in the case of a crash. Now, - WAL protects fully against the following problems: - - - - index rows pointing to nonexistent table rows - - - - index rows lost in split operations - - - - totally corrupted table or index page content, because - of partially written data pages - - - - - - Finally, WAL makes it possible to support on-line + WAL also makes it possible to support on-line backup and point-in-time recovery, as described in . By archiving the WAL data we can support reverting to any time instant covered by the available WAL data: @@ -169,7 +146,7 @@ <acronym>WAL</acronym> Configuration - There are several WAL-related configuration parameters that + There are several WAL-related configuration parameters that affect database performance. This section explains their use. Consult for general information about setting server configuration parameters. @@ -178,16 +155,17 @@ Checkpointscheckpoint are points in the sequence of transactions at which it is guaranteed - that the data files have been updated with all information logged before + that the data files have been updated with all information written before the checkpoint. At checkpoint time, all dirty data pages are flushed to - disk and a special checkpoint record is written to the log file. As a - result, in the event of a crash, the crash recovery procedure knows from - what point in the log (known as the redo record) it should start the - REDO operation, since any changes made to data files before that point - are already on disk. After a checkpoint has been made, any log segments - written before the redo record are no longer needed and can be recycled - or removed. (When WAL archiving is being done, the - log segments must be archived before being recycled or removed.) + disk and a special checkpoint record is written to the log file. + In the event of a crash, the crash recovery procedure looks at the latest + checkpoint record to determine the point in the log (known as the redo + record) from which it should start the REDO operation. Any changes made to + data files before that point are known to be already on disk. Hence, after + a checkpoint has been made, any log segments preceding the one containing + the redo record are no longer needed and can be recycled or removed. (When + WAL archiving is being done, the log segments must be + archived before being recycled or removed.) @@ -206,7 +184,7 @@ more often. This allows faster after-crash recovery (since less work will need to be redone). However, one must balance this against the increased cost of flushing dirty data pages more often. If - is set (the default), there is + is set (as is the default), there is another factor to consider. To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. In that case, @@ -228,8 +206,9 @@ checkpoint_segments. Occasional appearance of such a message is not cause for alarm, but if it appears often then the checkpoint control parameters should be increased. Bulk operations such - as a COPY, INSERT SELECT etc. may cause a number of such warnings if you - do not set high enough. + as large COPY transfers may cause a number of such warnings + to appear if you have not set checkpoint_segments high + enough. @@ -273,8 +252,7 @@ correspondingly increase shared memory usage. When is set and the system is very busy, setting this value higher will help smooth response times during the - period immediately following each checkpoint. As a guide, a setting of 1024 - would be considered to be high. + period immediately following each checkpoint. @@ -310,8 +288,7 @@ (provided that PostgreSQL has been compiled with support for it) will result in each LogInsert and LogFlush - WAL call being logged to the server log. The output - is too verbose for use as a guide to performance tuning. This + WAL call being logged to the server log. This option may be replaced by a more general mechanism in the future. @@ -340,15 +317,6 @@ available stock of numbers. - - The WAL buffers and control structure are in - shared memory and are handled by the server child processes; they - are protected by lightweight locks. The demand on shared memory is - dependent on the number of buffers. The default size of the - WAL buffers is 8 buffers of 8 kB each, or 64 kB - total. - - It is of advantage if the log is located on another disk than the main database files. This may be achieved by moving the directory