diff --git a/doc/src/sgml/admin.sgml b/doc/src/sgml/admin.sgml index b79146cb99..32c0983b17 100644 --- a/doc/src/sgml/admin.sgml +++ b/doc/src/sgml/admin.sgml @@ -1,5 +1,5 @@ @@ -29,6 +29,7 @@ $Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.34 2001/05/12 22:51:34 &charset; &manage-ag; &user-manag; + &maintenance; &backup; &wal; &storage; diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 45451ecabc..4b613db8e5 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -1,4 +1,4 @@ - + @@ -43,6 +43,7 @@ + diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml new file mode 100644 index 0000000000..7629c5bd54 --- /dev/null +++ b/doc/src/sgml/maintenance.sgml @@ -0,0 +1,371 @@ + + + + Routine Database Maintenance Tasks + + + General Discussion + + + There are a few routine maintenance chores that must be performed on + a regular basis to keep a PostgreSQL + installation running smoothly. The tasks discussed here are repetitive + in nature and can easily be automated using standard Unix tools such + as cron scripts. But it is the database + administrator's responsibility to set up appropriate scripts, and to + check that they execute successfully. + + + + One obvious maintenance task is creation of backup copies of the data on a + regular schedule. Without a recent backup, you have no chance of recovery + after a catastrophe (disk failure, fire, mistakenly dropping a critical + table, etc). The backup and recovery mechanisms available in + PostgreSQL are discussed at length in + . + + + + The other main category of maintenance task is periodic vacuuming + of the database. This activity is discussed in + . + + + + + + PostgreSQL is low-maintenance compared to + some other database products. Nonetheless, appropriate attention to + these tasks will go far towards ensuring a pleasant and productive + experience with the system. + + + + + + Routine Vacuuming + + + PostgreSQL's VACUUM command must be + run on a regular basis for several reasons: + + + + To recover disk space occupied by updated or deleted + rows. + + + + To update data statistics used by the + PostgreSQL query planner. + + + + To protect against loss of very old data due to + transaction ID wraparound. + + + + The frequency and scope of VACUUMs performed for each of + these reasons will vary depending on the needs of each installation. + Therefore, database administrators must understand these issues and + develop an appropriate maintenance strategy. This section concentrates + on explaining the high-level issues; for details about command syntax + and so on, see the VACUUM command reference page. + + + + Beginning in PostgreSQL 7.2, the standard form + of VACUUM can run in parallel with normal database operations + (selects, inserts, updates, deletes, but not changes to table schemas). + Routine vacuuming is therefore not nearly as intrusive as it was in prior + releases, and it's not as critical to try to schedule it at low-usage + times of day. + + + + Recovering disk space + + + In normal PostgreSQL operation, an UPDATE or + DELETE of a row does not immediately remove the old tuple + (version of the row). This approach is necessary to gain the benefits + of multi-version concurrency control (see the User's Guide): the tuple + must not be deleted while + it is still potentially visible to other transactions. But eventually, + an outdated or deleted tuple is no longer of interest to any transaction. + The space it occupies must be reclaimed for reuse by new tuples, to avoid + infinite growth of disk space requirements. This is done by running + VACUUM. + + + + Clearly, a table that receives frequent updates or deletes will need + to be vacuumed more often than tables that are seldom updated. It may + be useful to set up periodic cron tasks that vacuum only selected tables, + skipping tables that are known not to change often. This is only likely + to be helpful if you have both large heavily-updated tables and large + seldom-updated tables --- the extra cost of vacuuming a small table + isn't enough to be worth worrying about. + + + + The standard form of VACUUM is best used with the goal of + maintaining a fairly level steady-state usage of disk space. The standard + form finds old tuples and makes their space available for re-use within + the table, but it does not try very hard to shorten the table file and + return disk space to the operating system. If you need to return disk + space to the operating system you can use VACUUM FULL --- + but what's the point of releasing disk space that will only have to be + allocated again soon? Moderately frequent standard VACUUMs + are a better approach than infrequent VACUUM FULLs for + maintaining heavily-updated tables. + + + + Recommended practice for most sites is to schedule a database-wide + VACUUM once a day at a low-usage time of day, supplemented + by more frequent vacuuming of heavily-updated tables if necessary. + (If you have multiple databases in an installation, don't forget to + vacuum each one; the vacuumdb script may be helpful.) + Use plain VACUUM, not VACUUM FULL, for routine + vacuuming for space recovery. + + + + VACUUM FULL is recommended for cases where you know you have + deleted the majority of tuples in a table, so that the steady-state size + of the table can be shrunk substantially with VACUUM FULL's + more aggressive approach. + + + + If you have a table whose contents are deleted completely every so often, + consider doing it with TRUNCATE rather than using + DELETE followed by VACUUM. + + + + + Updating planner statistics + + + The PostgreSQL query planner relies on + statistical information about the contents of tables in order to + generate good plans for queries. These statistics are gathered by + the ANALYZE command, which can be invoked by itself or + as an optional step in VACUUM. It is important to have + reasonably accurate statistics, otherwise poor choices of plans may + degrade database performance. + + + + As with vacuuming for space recovery, frequent updates of statistics + are more useful for heavily-updated tables than for seldom-updated ones. + But even for a heavily-updated table, there may be no need for + statistics updates if the statistical distribution of the data is not + changing much. A simple rule of thumb is to think about how much + the minimum and maximum values of the columns in the table change. + For example, a timestamp column that contains the time of row update + will have a constantly-increasing maximum value as rows are added and + updated; such a column will probably need more frequent statistics + updates than, say, a column containing URLs for pages accessed on a + website. The URL column may receive changes just as often, but the + statistical distribution of its values probably changes relatively + slowly. + + + + It is possible to run ANALYZE on specific tables and even + just specific columns of a table, so the flexibility exists to update some + statistics more frequently than others if your application requires it. + In practice, however, the usefulness of this feature is doubtful. + Beginning in PostgreSQL 7.2, + ANALYZE is a fairly fast operation even on large tables, + because it uses a statistical random sampling of the rows of a table + rather than reading every single row. So it's probably much simpler + to just run it over the whole database every so often. + + + + + Although per-column tweaking of ANALYZE frequency may not be + very productive, you may well find it worthwhile to do per-column + adjustment of the level of detail of the statistics collected by + ANALYZE. Columns that are heavily used in WHERE clauses + and have highly irregular data distributions may require a finer-grain + data histogram than other columns. See ALTER TABLE SET + STATISTICS. + + + + + Recommended practice for most sites is to schedule a database-wide + ANALYZE once a day at a low-usage time of day; this can + usefully be combined with a nightly VACUUM. However, + sites with relatively slowly changing table statistics may find that + this is overkill, and that less-frequent ANALYZE runs + are sufficient. + + + + + Preventing transaction ID wraparound failures + + + PostgreSQL's MVCC transaction semantics + depend on being able to compare transaction ID (XID) + numbers: a tuple with an insertion XID newer than the current + transaction's XID is in the future and should not be visible + to the current transaction. But since transaction IDs have limited size + (32 bits at this writing) an installation that runs for a long time (more + than 4 billion transactions) will suffer transaction ID + wraparound: the XID counter wraps around to zero, and all of a sudden + transactions that were in the past appear to be in the future --- which + means their outputs become invisible. In short, catatrophic data loss. + (Actually the data is still there, but that's cold comfort if you can't + get at it.) + + + + Prior to PostgreSQL 7.2, the only defense + against XID wraparound was to re-initdb at least every 4 billion + transactions. This of course was not very satisfactory for high-traffic + sites, so a better solution has been devised. The new approach allows an + installation to remain up indefinitely, without initdb or any sort of + restart. The price is this maintenance requirement: every table + in the database must be VACUUMed more often than once every billion + transactions. + + + + In practice this isn't an onerous requirement, but since the consequences + of failing to meet it can be complete data loss (not just wasted disk + space or slow performance), some special provisions have been made to help + database administrators keep track of the time since the last + VACUUM. The remainder of this section gives the details. + + + + The new approach to XID comparison distinguishes two special XIDs, numbers + 1 and 2 (BootstrapXID and FrozenXID). These two + XIDs are always considered older than every normal XID. Normal XIDs (those + greater than 2) are compared using modulo-2^31 arithmetic. This means + that for every normal XID, there are two billion XIDs that are + older and two billion that are newer; another way to + say it is that the normal XID space is circular with no endpoint. + Therefore, once a tuple has been created with a particular normal XID, the + tuple will appear to be in the past for the next two billion + transactions, no matter which normal XID we are talking about. If the + tuple still exists after more than two billion transactions, it would + suddenly appear to be in the future. To prevent data loss, old tuples + must be reassigned the XID FrozenXID sometime before they reach + the two-billion-transactions-old mark. Once they are assigned this + special XID, they will appear to be in the past to all normal + transactions regardless of wraparound issues, and so such tuples will be + good until deleted, no matter how long that is. This reassignment of + XID is handled by VACUUM. + + + + VACUUM's normal policy is to reassign FrozenXID + to any tuple with a normal XID more than one billion transactions in the + past. This policy preserves the original insertion XID until it is not + likely to be of interest anymore (in fact, most tuples will probably + live and die without ever being frozen). This policy means + that the maximum safe interval between VACUUMs of any table + is exactly one billion transactions: if you wait longer, it's possible + that a tuple that was not quite old enough to be reassigned last time + is now more than two billion transactions old and has wrapped around + into the future --- ie, is lost to you. (Of course, it'll reappear + after another two billion transactions, but that's no help.) + + + + Since periodic VACUUMs are needed anyway for the reasons + described earlier, it's unlikely that any table would not be vacuumed + for as long as a billion transactions. But to help administrators ensure + this constraint is met, VACUUM stores transaction ID + statistics in the system table pg_database. In particular, + the datfrozenxid field of a database's + pg_database row is updated at the completion of any + database-wide vacuum operation (ie, VACUUM that does not + name a specific table). The value stored in this field is the freeze + cutoff XID that was used by that VACUUM command. All normal + XIDs older than this cutoff XID are guaranteed to have been replaced by + FrozenXID within that database. A convenient way to + examine this information is to execute the query + + +select datname, age(datfrozenxid) from pg_database; + + + The age column measures the number of transactions from the + cutoff XID to the current transaction's XID. + + + + With the standard freezing policy, the age column will start + at one billion for a freshly-vacuumed database. When the age + approaches two billion, the database must be vacuumed again to avoid + risk of wraparound failures. Recommended practice is to vacuum each + database at least once every half-a-billion (500 million) transactions, + so as to provide plenty of safety margin. To help meet this rule, + each database-wide VACUUM automatically delivers a warning + if there are any pg_database entries showing an + age of more than 1.5 billion transactions, for example: + + +play=# vacuum; +NOTICE: Some databases have not been vacuumed in 1613770184 transactions. + Better vacuum them within 533713463 transactions, + or you may have a wraparound failure. +VACUUM + + + + + + VACUUM with the FREEZE option uses a more + aggressive freezing policy: tuples are frozen if they are old enough + to be considered good by all open transactions. In particular, if + a VACUUM FREEZE is performed in an otherwise-idle database, + it is guaranteed that all tuples in that database will be + frozen. Hence, as long the database is not modified in any way, it + will not need subsequent vacuuming to avoid transaction ID wraparound + problems. This technique is used by initdb to prepare the + template0 database. It should also be used to prepare any + user-created databases that are to be marked datallowconn = + false in pg_database, since there isn't any + convenient way to vacuum a database that you can't connect to. Note + that VACUUM's automatic warning message about unvacuumed databases will + ignore pg_database entries with datallowconn = + false, so as to avoid giving false warnings about these + databases; therefore it's up to you to ensure that such databases are + frozen correctly. + + + + + + + diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index fd340b0c94..ea65b3c283 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -1,5 +1,5 @@ @@ -640,7 +640,7 @@ CAST ( 'string' AS type ) OID - The unique identifier (object ID) of a row. This is a serial number + The object identifier (object ID) of a row. This is a serial number that is automatically added by Postgres to all table rows (unless the table was created WITHOUT OIDS, in which case this column is not present). @@ -688,7 +688,7 @@ CAST ( 'string' AS type ) The identity (transaction ID) of the deleting transaction, or zero for an undeleted tuple. It is possible for this field - to be nonzero in a visible tuple: that indicates that the + to be nonzero in a visible tuple: that usually indicates that the deleting transaction hasn't committed yet, or that an attempted deletion was rolled back. @@ -712,7 +712,8 @@ CAST ( 'string' AS type ) (block number, tuple index within block) that identifies the physical location of the tuple. Note that although the ctid can be used to locate the tuple very quickly, a row's ctid - will change each time it is updated or moved by VACUUM. + will change each time it is updated or moved by VACUUM + FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows. @@ -737,7 +738,19 @@ CAST ( 'string' AS type ) - Transaction and command identifiers are 32-bit quantities. + Transaction identifiers are 32-bit quantities. In a long-lived + database it is possible for transaction IDs to wrap around. This + is not a fatal problem given appropriate maintenance procedures; + see the Administrator's Guide for details. However, it is unwise + to depend on uniqueness of transaction IDs over the long term + (more than one billion transactions). + + + + Command identifiers are also 32-bit quantities. This creates a hard + limit of 2^32 (4 billion) SQL commands within a single transaction. + In practice this limit is not a problem --- note that the limit is on + number of SQL queries, not number of tuples processed.