postgresql

Commit Graph

Author	SHA1	Message	Date
Andres Freund	83ff1618bc	Centralize definition of integer limits. Several submitted and even committed patches have run into the problem that C89, our baseline, does not provide minimum/maximum values for various integer datatypes. C99's stdint.h does, but we can't rely on it. Several parts of the code defined limits locally, so instead centralize the definitions to c.h. This patch also changes the more obvious usages of literal limit values; there's more places that could be changed, but it's less clear whether it's beneficial to change those. Author: Andrew Gierth Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk	2015-03-25 22:39:42 +01:00
Andres Freund	87cec51d3a	Don't delay replication for less than recovery_min_apply_delay's resolution. Recovery delays are implemented by waiting on a latch, and latches take milliseconds as a parameter. The required amount of waiting was computed using microsecond resolution though and the wait loop's abort condition was checking the delay in microseconds as well. This could lead to short spurts of busy looping when the overall wait time was below a millisecond, but above 0 microseconds. Instead just formulate the wait loop's abort condition in millisecond granularity as well. Given that that's recovery_min_apply_delay resolution, it seems harmless to not wait for less than a millisecond. Backpatch to 9.4 where recovery_min_apply_delay was introduced. Discussion: 20150323141819.GH26995@alap3.anarazel.de	2015-03-23 16:51:11 +01:00
Andres Freund	a1105c3dd4	Fix copy & paste error in `4f1b890b13`. Due to the bug delayed standbys would not delay when applying prepared transactions. Discussion: CAB7nPqT6BO1cCn+sAyDByBxA4EKZNAiPi2mFJ=ANeZmnmewRyg@mail.gmail.com Michael Paquier via Coverity.	2015-03-23 15:53:40 +01:00
Andres Freund	4f1b890b13	Merge the various forms of transaction commit & abort records. Since `465883b0a` two versions of commit records have existed. A compact version that was used when no cache invalidations, smgr unlinks and similar were needed, and a full version that could deal with all that. Additionally the full version was embedded into twophase commit records. That resulted in a measurable reduction in the size of the logged WAL in some workloads. But more recently additions like logical decoding, which e.g. needs information about the database something was executed on, made it applicable in fewer situations. The static split generally made it hard to expand the commit record, because concerns over the size made it hard to add anything to the compact version. Additionally it's not particularly pretty to have twophase.c insert RM_XACT records. Rejigger things so that the commit and abort records only have one form each, including the twophase equivalents. The presence of the various optional (in the sense of not being in every record) pieces is indicated by a bits in the 'xinfo' flag. That flag previously was not included in compact commit records. To prevent an increase in size due to its presence, it's only included if necessary; signalled by a bit in the xl_info bits available for xact.c, similar to heapam.c's XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE. Twophase commit/aborts are now the same as their normal counterparts. The original transaction's xid is included in an optional data field. This means that commit records generally are smaller, except in the case of a transaction with subtransactions, but no other special cases; the increase there is four bytes, which seems acceptable given that the more common case of not having subtransactions shrank. The savings are especially measurable for twophase commits, which previously always used the full version; but will in practice only infrequently have required that. The motivation for this work are not the space savings and and deduplication though; it's that it makes it easier to extend commit records with additional information. That's just a few lines of code now; without impacting the common case where that information is not needed. Discussion: 20150220152150.GD4149@awork2.anarazel.de, 235610.92468.qm%40web29004.mail.ird.yahoo.com Reviewed-By: Heikki Linnakangas, Simon Riggs	2015-03-15 17:37:07 +01:00
Andres Freund	a0f5954af1	Increase max_wal_size's default from 128MB to 1GB. The introduction of min_wal_size & max_wal_size in `88e9823026` makes it feasible to increase the default upper bound in checkpoint size. Previously raising the default would lead to a increased disk footprint, even if more segments weren't beneficial. The low default of checkpoint size is one of common performance problem users have thus increasing the default makes sense. Setups where the increase in maximum disk usage is a problem will very likely have to run with a modified configuration anyway. Discussion: 54F4EFB8.40202@agliodbs.com, CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com Author: Josh Berkus, after a discussion involving lots of people.	2015-03-15 17:37:07 +01:00
Andres Freund	51c11a7025	Remove pause_at_recovery_target recovery.conf setting. The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4) replaces it's functionality. Having both seems likely to cause more confusion than it saves worry due to the incompatibility. Discussion: 5484FC53.2060903@2ndquadrant.com Author: Petr Jelinek	2015-03-15 17:37:07 +01:00
Fujii Masao	57aa5b2bb1	Add GUC to enable compression of full page images stored in WAL. When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup. A compressed page image will be decompressed during WAL replay. Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. This commit changes the WAL format (so bumping WAL version number) so that the one-byte flag indicating whether a full page image is compressed or not is included in its header information. This means that the commit increases the WAL volume one-byte per a full page image even if WAL compression is not used at all. We can save that one-byte by borrowing one-bit from the existing field like hole_offset in the header and using it as the flag, for example. But which would reduce the code readability and the extensibility of the feature. Per discussion, it's not worth paying those prices to save only one-byte, so we decided to add the one-byte flag to the header. This commit doesn't introduce any new compression algorithm like lz4. Currently a full page image is compressed using the existing PGLZ algorithm. Per discussion, we decided to use it at least in the first version of the feature because there were no performance reports showing that its compression ratio is unacceptably lower than that of other algorithm. Of course, in the future, it's worth considering the support of other compression algorithm for the better compression. Rahila Syed and Michael Paquier, reviewed in various versions by myself, Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.	2015-03-11 15:52:24 +09:00
Alvaro Herrera	4f3924d9cd	Keep CommitTs module in sync in standby and master We allow this module to be turned off on restarts, so a restart time check is enough to activate or deactivate the module; however, if there is a standby replaying WAL emitted from a master which is restarted, but the standby isn't, the state in the standby becomes inconsistent and can easily be crashed. Fix by activating and deactivating the module during WAL replay on parameter change as well as on system start. Problem reported by Fujii Masao in http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com Author: Petr Jelínek	2015-03-09 17:44:00 -03:00
Heikki Linnakangas	88e9823026	Replace checkpoint_segments with min_wal_size and max_wal_size. Instead of having a single knob (checkpoint_segments) that both triggers checkpoints, and determines how many checkpoints to recycle, they are now separate concerns. There is still an internal variable called CheckpointSegments, which triggers checkpoints. But it no longer determines how many segments to recycle at a checkpoint. That is now auto-tuned by keeping a moving average of the distance between checkpoints (in bytes), and trying to keep that many segments in reserve. The advantage of this is that you can set max_wal_size very high, but the system won't actually consume that much space if there isn't any need for it. The min_wal_size sets a floor for that; you can effectively disable the auto-tuning behavior by setting min_wal_size equal to max_wal_size. The max_wal_size setting is now the actual target size of WAL at which a new checkpoint is triggered, instead of the distance between checkpoints. Previously, you could calculate the actual WAL usage with the formula "(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this patch, you set the desired WAL usage with max_wal_size, and the system calculates the appropriate CheckpointSegments with the reverse of that formula. That's a lot more intuitive for administrators to set. Reviewed by Amit Kapila and Venkata Balaji N.	2015-02-23 18:53:02 +02:00
Fujii Masao	5d2b45e3f7	Add GUC to control the time to wait before retrieving WAL after failed attempt. Previously when the standby server failed to retrieve WAL files from any sources (i.e., streaming replication, local pg_xlog directory or WAL archive), it always waited for five seconds (hard-coded) before the next attempt. For example, this is problematic in warm-standby because restore_command can fail every five seconds even while new WAL file is expected to be unavailable for a long time and flood the log files with its error messages. This commit adds new parameter, wal_retrieve_retry_interval, to control that wait time. Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.	2015-02-23 20:55:17 +09:00
Heikki Linnakangas	49b04188f8	Fix thinko in re-setting wal_log_hints flag from a parameter-change record. The flag is supposed to be copied from the record. Same issue with track_commit_timestamps, but that's master-only. Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was added.	2015-01-15 20:52:41 +02:00
Heikki Linnakangas	1e78d81e88	Don't open a WAL segment for writing at end of recovery. Since commit `ba94518a`, we used XLogFileOpen to open the next segment for writing, but if the end-of-recovery happens exactly at a segment boundary, the new segment might not exist yet. (Before `ba94518a`, XLogFileOpen was correct, because we would open the previous segment if the switch happened at the boundary.) Instead of trying to create it if necessary, it's simpler to not bother opening the segment at all. XLogWrite() will open or create it soon anyway, after writing the checkpoint or end-of-recovery record. Reported by Andres Freund.	2015-01-07 16:20:20 +02:00
Bruce Momjian	4baaf863ec	Update copyright for 2015 Backpatch certain files through 9.0	2015-01-06 11:43:47 -05:00
Tom Lane	d6657d2a10	Treat negative values of recovery_min_apply_delay as having no effect. At one point in the development of this feature, it was claimed that allowing negative values would be useful to compensate for timezone differences between master and slave servers. That was based on a mistaken assumption that commit timestamps are recorded in local time; but of course they're in UTC. Nor is a negative apply delay likely to be a sane way of coping with server clock skew. However, the committed patch still treated negative delays as doing something, and the timezone misapprehension survived in the user documentation as well. If recovery_min_apply_delay were a proper GUC we'd just set the minimum allowed value to be zero; but for the moment it seems better to treat negative settings as if they were zero. In passing do some extra wordsmithing on the parameter's documentation, including correcting a second misstatement that the parameter affects processing of Restore Point records. Issue noted by Michael Paquier, who also provided the code patch; doc changes by me. Back-patch to 9.4 where the feature was introduced.	2015-01-03 13:14:03 -05:00
Heikki Linnakangas	2ef6c66a2b	Fix file descriptor leak at end of recovery. XLogFileInit() returns a file descriptor, which needs to be closed. The leak was short-lived, since the startup process exits shortly afterwards, but it was clearly a bug, nevertheless. Per Coverity report.	2014-12-21 21:51:59 +02:00
Heikki Linnakangas	5c805d0a81	Fix timestamp in end-of-recovery WAL records. We used time(null) to set a TimestampTz field, which gave bogus results. Noticed while looking at pg_xlogdump output. Backpatch to 9.3 and above, where the fast promotion was introduced.	2014-12-19 17:04:20 +02:00
Heikki Linnakangas	ba94518aad	Change how first WAL segment on new timeline after promotion is created. Two changes: 1. When copying a WAL segment from old timeline to create the first segment on the new timeline, only copy up to the point where the timeline switch happens, and zero-fill the rest. This avoids corner cases where we might think that the copied WAL from the previous timeline belong to the new timeline. 2. If the timeline switch happens at a segment boundary, don't copy the whole old segment to the new timeline. It's pointless, because it's 100% identical to the old segment.	2014-12-18 20:23:03 +02:00
Andres Freund	c303e9e7e5	Fix (re-)starting from a basebackup taken off a standby after a failure. When starting up from a basebackup taken off a standby extra logic has to be applied to compute the point where the data directory is consistent. Normal base backups use a WAL record for that purpose, but that isn't possible on a standby. That logic had a error check ensuring that the cluster's control file indicates being in recovery. Unfortunately that check was too strict, disregarding the fact that the control file could also indicate that the cluster was shut down while in recovery. That's possible when the a cluster starting from a basebackup is shut down before the backup label has been removed. When everything goes well that's a short window, but when either restore_command or primary_conninfo isn't configured correctly the window can get much wider. That's because inbetween reading and unlinking the label we restore the last checkpoint from WAL which can need additional WAL. To fix simply also allow starting when the control file indicates "shutdown in recovery". There's nicer fixes imaginable, but they'd be more invasive. Backpatch to 9.2 where support for taking basebackups from standbys was added.	2014-12-18 08:47:27 +01:00
Simon Riggs	b8e33a85d4	Tweaks for recovery_target_action Rename parameter action_at_recovery_target to recovery_target_action suggested by Christoph Berg. Place into recovery.conf suggested by Fujii Masao, replacing (deprecating) earlier parameters, per Michael Paquier.	2014-12-07 21:55:29 +09:00
Alvaro Herrera	73c986adde	Keep track of transaction commit timestamps Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut	2014-12-03 11:53:02 -03:00
Heikki Linnakangas	afeacd2748	Fix assertion failure at end of PITR. InitXLogInsert() cannot be called in a critical section, because it allocates memory. But CreateCheckPoint() did that, when called for the end-of-recovery checkpoint by the startup process. In the passing, fix the scratch space allocation in InitXLogInsert to go to the right memory context. Also update the comment at InitXLOGAccess, which hasn't been totally accurate since hot standby was introduced (in a hot standby backend, InitXLOGAccess isn't called at backend startup). Reported by Michael Paquier	2014-11-28 09:31:53 +02:00
Simon Riggs	aedccb1f6f	action_at_recovery_target recovery config option action_at_recovery_target = pause \| promote \| shutdown Petr Jelinek Reviewed by Muhammad Asif Naeem, Fujji Masao and Simon Riggs	2014-11-25 20:13:30 +00:00
Heikki Linnakangas	0bd624d63b	Distinguish XLOG_FPI records generated for hint-bit updates. Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images generated for hint bit updates, when checksums are enabled. The new record type is replayed exactly the same as XLOG_FPI, but allows them to be tallied separately e.g. in pg_xlogdump.	2014-11-24 11:09:08 +02:00
Heikki Linnakangas	2c03216d83	Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.	2014-11-20 18:46:41 +02:00
Andres Freund	d3586fc8aa	Ensure unlogged tables are reset even if crash recovery errors out. Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund	2014-11-15 01:19:26 +01:00
Heikki Linnakangas	7250d8535b	Fix building with WAL_DEBUG. Now that the backup blocks are appended to the WAL record in xloginsert.c, XLogInsert doesn't see them anymore and cannot remove them from the version reconstructed for xlog_outdesc. This makes running with wal_debug=on more expensive, as we now make (unnecessary) temporary copies of the backup blocks, but it doesn't seem worth convoluting the code to keep that optimization. Reported by Alvaro Herrera.	2014-11-07 23:09:31 +02:00
Heikki Linnakangas	2076db2aea	Move the backup-block logic from XLogInsert to a new file, xloginsert.c. xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.	2014-11-06 13:55:36 +02:00
Heikki Linnakangas	5028f22f6e	Switch to CRC-32C in WAL and other places. The old algorithm was found to not be the usual CRC-32 algorithm, used by Ethernet et al. We were using a non-reflected lookup table with code meant for a reflected lookup table. That's a strange combination that AFAICS does not correspond to any bit-wise CRC calculation, which makes it difficult to reason about its properties. Although it has worked well in practice, seems safer to use a well-known algorithm. Since we're changing the algorithm anyway, we might as well choose a different polynomial. The Castagnoli polynomial has better error-correcting properties than the traditional CRC-32 polynomial, even if we had implemented it correctly. Another reason for picking that is that some new CPUs have hardware support for calculating CRC-32C, but not CRC-32, let alone our strange variant of it. This patch doesn't add any support for such hardware, but a future patch could now do that. The old algorithm is kept around for tsquery and pg_trgm, which use the values in indexes that need to remain compatible so that pg_upgrade works. While we're at it, share the old lookup table for CRC-32 calculation between hstore, ltree and core. They all use the same table, so might as well.	2014-11-04 11:39:48 +02:00
Fujii Masao	c7371c4a60	Prevent the already-archived WAL file from being archived again. Previously the archive recovery always created .ready file for the last WAL file of the old timeline at the end of recovery even when it's restored from the archive and has .done file. That is, there was the case where the WAL file had both .ready and .done files. This caused the already-archived WAL file to be archived again. This commit prevents the archive recovery from creating .ready file for the last WAL file if it has .done file, in order to prevent it from being archived again. This bug was added when cascading replication feature was introduced, i.e., the commit `5286105800`. So, back-patch to 9.2, where cascading replication was added. Reviewed by Michael Paquier	2014-10-23 16:21:27 +09:00
Andres Freund	5e5b65f359	Don't duplicate log_checkpoint messages for both of restart and checkpoints. The duplication originated in `cdd46c765`, where restartpoints were introduced. In LogCheckpointStart's case the duplication actually lead to the compiler's format string checking not to be effective because the format string wasn't constant. Arguably these messages shouldn't be elog(), but ereport() style messages. That'd even allow to translate the messages... But as there's more mistakes of that kind in surrounding code, it seems better to change that separately.	2014-10-21 01:01:56 +02:00
Andres Freund	11abd6c90f	Renumber CHECKPOINT_* flags. Commit `7dbb606938` added a new CHECKPOINT_FLUSH_ALL flag. As that commit needed to be backpatched I didn't change the numeric values of the existing flags as that could lead to nastly problems if any external code issued checkpoints. That's not a concern on master, so renumber them there. Also add a comment about CHECKPOINT_FLUSH_ALL above CreateCheckPoint().	2014-10-21 00:20:08 +02:00
Andres Freund	7dbb606938	Flush unlogged table's buffers when copying or moving databases. CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source database directory on the filesystem level. To ensure the on disk state is consistent they block out users of the affected database and force a checkpoint to flush out all data to disk. Unfortunately, up to now, that checkpoint didn't flush out dirty buffers from unlogged relations. That bug means there could be leftover dirty buffers in either the template database, or the database in its old location. Leading to problems when accessing relations in an inconsistent state; and to possible problems during shutdown in the SET TABLESPACE case because buffers belonging files that don't exist anymore are flushed. This was reported in bug #10675 by Maxim Boguk. Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and Fujii Masao. Backpatch to 9.1 where unlogged tables were introduced.	2014-10-20 23:43:46 +02:00
Peter Eisentraut	b7a08c8028	Message improvements	2014-10-12 01:06:35 -04:00
Heikki Linnakangas	5fa6c81a43	Remove num_xloginsert_locks GUC, replace with a #define I left the GUC in place for the beta period, so that people could experiment with different values. No-one's come up with any data that a different value would be better under some circumstances, so rather than try to document to users what the GUC, let's just hard-code the current value, 8.	2014-10-01 16:42:26 +03:00
Andres Freund	ef8863844b	Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE. As noted in http://bugs.debian.org/763098 there is a conflict between postgres' definition of CACHE_LINE_SIZE and the definition by various *bsd platforms. It's debatable who has the right to define such a name, but postgres' use was only introduced in `375d8526f2` (9.4), so it seems like a good idea to rename it. Discussion: 20140930195756.GC27407@msg.df7cb.de Per complaint of Christoph Berg in the above email, although he's not the original bug reporter. Backpatch to 9.4 where the define was introduced.	2014-10-01 12:17:03 +02:00
Andres Freund	6ba4ecbf47	Remove most volatile qualifiers from xlog.c For the reason outlined in `df4077cda2` also remove volatile qualifiers from xlog.c. Some of these uses of volatile have been added after noticing problems back when spinlocks didn't imply compiler barriers. So they are a good test - in fact removing the volatiles breaks when done without the barriers in spinlocks present. Several uses of volatile remain where they are explicitly used to access shared memory without locks. These locations are ok with slightly out of date data, but removing the volatile might lead to the variables never being reread from memory. These uses could also be replaced by barriers, but that's a separate change of doubtful value.	2014-09-22 23:35:08 +02:00
Andres Freund	604f7956b9	Improve code around the recently added rm_identify rmgr callback. There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2: * append_init() in heapdesc.c was ugly and required that rm_identify return values are only valid till the next call. Instead just add a couple more switch() cases for the INIT_PAGE cases. Now the returned value will always be valid. * a couple rm_identify() callbacks missed masking xl_info with ~XLR_INFO_MASK. * pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar string. * append_init() was called when id=NULL - which should never actually happen. But it's better to be careful.	2014-09-22 17:49:34 +02:00
Andres Freund	728f152e07	Add rmgr callback to name xlog record types for display purposes. This is primarily useful for the upcoming pg_xlogdump --stats feature, but also allows to remove some duplicated code in the rmgr_desc routines. Due to the separation and harmonization, the output of dipsplayed records changes somewhat. But since this isn't enduser oriented content that's ok. It's potentially desirable to further change pg_xlogdump's display of records. It previously wasn't possible to show the record type separately from the description forcing it to be in the last column. But that's better done in a separate commit. Author: Abhijit Menon-Sen, slightly editorialized by me Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas Discussion: 20140604104716.GA3989@toroid.org	2014-09-19 16:20:29 +02:00
Heikki Linnakangas	54685338e3	Move log_newpage and log_newpage_buffer to xlog.c. log_newpage is used by many indexams, in addition to heap, but for historical reasons it's always been part of the heapam rmgr. Starting with 9.3, we have another WAL record type for logging an image of a page, XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to xlog.c, and switch to using the XLOG_FPI record type. Bump the WAL version number because the code to replay the old HEAP_NEWPAGE records is removed.	2014-07-31 16:48:55 +03:00
Heikki Linnakangas	60d931827b	Oops, fix recoveryStopsBefore functions for regular commits. Pointed out by Tom Lane. Backpatch to 9.4, the code was structured differently in earlier branches and didn't have this mistake.	2014-07-29 17:19:43 +03:00
Heikki Linnakangas	e74e0906fa	Treat 2PC commit/abort the same as regular xacts in recovery. There were several oversights in recovery code where COMMIT/ABORT PREPARED records were ignored: * pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits) * recovery_min_apply_delay (2PC commits were applied immediately) * recovery_target_xid (recovery would not stop if the XID used 2PC) The first of those was reported by Sergiy Zuban in bug #11032, analyzed by Tom Lane and Andres Freund. The bug was always there, but was masked before commit `d19bd29f07`, because COMMIT PREPARED always created an extra regular transaction that was WAL-logged. Backpatch to all supported versions (older versions didn't have all the features and therefore didn't have all of the above bugs).	2014-07-29 11:59:22 +03:00
Robert Haas	250c26ba9c	Fix checkpointer crash in EXEC_BACKEND builds. Nothing in the checkpointer calls InitXLOGAccess(), so WALInsertLocks never got initialized there. Without EXEC_BACKEND, it works anyway because the correct value is inherited from the postmaster, but with EXEC_BACKEND we've got a problem. The problem appears to have been introduced by commit `68a2e52bba`. To fix, move the relevant initialization steps from InitXLOGAccess() to XLOGShmemInit(), making this more parallel to what we do elsewhere. Amit Kapila	2014-07-24 09:12:38 -04:00
Peter Eisentraut	d38228fe40	Add missing serial commas Also update one place where the wal_level "logical" was not added to an error message.	2014-07-15 08:31:50 -04:00
Heikki Linnakangas	1c6821be31	Fix and enhance the assertion of no palloc's in a critical section. The assertion failed if WAL_DEBUG or LWLOCK_STATS was enabled; fix that by using separate memory contexts for the allocations made within those code blocks. This patch introduces a mechanism for marking any memory context as allowed in a critical section. Previously ErrorContext was exempt as a special case. Instead of a blanket exception of the checkpointer process, only exempt the memory context used for the pending ops hash table.	2014-06-30 10:26:00 +03:00
Alvaro Herrera	f741300c90	Have multixact be truncated by checkpoint, not vacuum Instead of truncating pg_multixact at vacuum time, do it only at checkpoint time. The reason for doing it this way is twofold: first, we want it to delete only segments that we're certain will not be required if there's a crash immediately after the removal; and second, we want to do it relatively often so that older files are not left behind if there's an untimely crash. Per my proposal in http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org we now execute the truncation in the checkpointer process rather than as part of vacuum. Vacuum is in only charge of maintaining in shared memory the value to which it's possible to truncate the files; that value is stored as part of checkpoints also, and so upon recovery we can reuse the same value to re-execute truncate and reset the oldest-value-still-safe-to-use to one known to remain after truncation. Per bug reported by Jeff Janes in the course of his tests involving bug #8673. While at it, update some comments that hadn't been updated since multixacts were changed. Backpatch to 9.3, where persistency of pg_multixact files was introduced by commit `0ac5ad5134`.	2014-06-27 14:43:53 -04:00
Heikki Linnakangas	85ba0748ed	Fix bug in WAL_DEBUG. The record header was not copied correctly to the buffer that was passed to the rm_desc function. Broken by my rm_desc signature refactoring patch.	2014-06-23 12:22:36 +03:00
Heikki Linnakangas	0ef0b6784c	Change the signature of rm_desc so that it's passed a XLogRecord. Just feels more natural, and is more consistent with rm_redo.	2014-06-14 10:46:48 +03:00
Andres Freund	e04a9ccd2c	Consistency improvements for slot and decoding code. Change the order of checks in similar functions to be the same; remove a parameter that's not needed anymore; rename a memory context and expand a couple of comments. Per review comments from Amit Kapila	2014-06-12 13:33:27 +02:00
Tom Lane	5f93c37805	Add defenses against running with a wrong selection of LOBLKSIZE. It's critical that the backend's idea of LOBLKSIZE match the way data has actually been divided up in pg_largeobject. While we don't provide any direct way to adjust that value, doing so is a one-line source code change and various people have expressed interest recently in changing it. So, just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in pg_control and cross-check that the backend's compiled-in setting matches the on-disk data. Also tweak the code in inv_api.c so that fetches from pg_largeobject explicitly verify that the length of the data field is not more than LOBLKSIZE. Formerly we just had Asserts() for that, which is no protection at all in production builds. In some of the call sites an overlength data value would translate directly to a security-relevant stack clobber, so it seems worth one extra runtime comparison to be sure. In the back branches, we can't change the contents of pg_control; but we can still make the extra checks in inv_api.c, which will offer some amount of protection against running with the wrong value of LOBLKSIZE.	2014-06-05 11:31:06 -04:00
Andres Freund	f0c108560b	Consistently spell a replication slot's name as slot_name. Previously there's been a mix between 'slotname' and 'slot_name'. It's not nice to be unneccessarily inconsistent in a new feature. As a post beta1 initdb now is required in the wake of `eeca4cd35e`, fix the inconsistencies. Most the changes won't affect usage of replication slots because the majority of changes is around function parameter names. The prominent exception to that is that the recovery.conf parameter 'primary_slotname' is now named 'primary_slot_name'.	2014-06-05 16:29:20 +02:00
Tom Lane	c1907f0cc4	Fix a bunch of functions that were declared static then defined not-static. Per testing with a compiler that whines about this.	2014-05-17 17:57:53 -04:00
Tom Lane	0d0b2bf175	Rename min_recovery_apply_delay to recovery_min_apply_delay. Per discussion, this seems like a more consistent choice of name. Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut; some additional documentation wordsmithing by me	2014-05-10 19:46:19 -04:00
Bruce Momjian	0a78320057	pgindent run for 9.4 This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.	2014-05-06 12:12:18 -04:00
Tom Lane	5035701e07	Improve generation algorithm for database system identifier. As noted some time ago, the original coding had a typo ("\|" for "^") that made the result less unique than intended. Even the intended behavior is obsolete since it was based on wanting to produce a usable value even if we didn't have int64 arithmetic --- a limitation we stopped supporting years ago. Instead, let's redefine the system identifier as tv_sec in the upper 32 bits (same as before), tv_usec in the next 20 bits, and the low 12 bits of getpid() in the remaining bits. This is still hardly guaranteed-universally-unique, but it's noticeably better than before. Per my proposal at <29019.1374535940@sss.pgh.pa.us>	2014-04-26 15:11:10 -04:00
Bruce Momjian	83defef8c7	report stat() error in trigger file check Permissions might prevent the existence of the trigger file from being checked. Per report from Andres Freund	2014-04-17 11:55:57 -04:00
Heikki Linnakangas	848b9f05ab	Use correctly-sized buffer when zero-filling a WAL file. I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is allocated a couple of weeks ago. With the default settings, they are both 8k, but they can be changed at compile-time.	2014-04-16 10:26:36 +03:00
Heikki Linnakangas	787064cd00	Fix typo in comment. Tomonari Katsumata	2014-04-10 13:11:49 +03:00
Robert Haas	59202fae04	Fix some compiler warnings that clang emits with -pedantic. Andres Freund	2014-04-04 11:29:50 -04:00
Heikki Linnakangas	d9e7873bbb	In checkpoint, move the check for in-progress xacts out of critical section. GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical section. I thought I covered this case with the exemption for the checkpointer, but CreateCheckPoint is also called from the startup process.	2014-04-04 17:31:22 +03:00
Heikki Linnakangas	877b088785	Avoid allocations in critical sections. If a palloc in a critical section fails, it becomes a PANIC.	2014-04-04 13:35:44 +03:00
Heikki Linnakangas	c2a6724823	Pass more than the first XLogRecData entry to rm_desc, with WAL_DEBUG. If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to only pass the first XLogRecData entry to the rm_desc routine. I think the original assumprion was that the first XLogRecData entry contains all the necessary information for the rm_desc routine, but that's a pretty shaky assumption. At least standby_redo didn't get the memo. To fix, piece together all the data in a temporary buffer, and pass that to the rm_desc routine. It's been like this forever, but the patch didn't apply cleanly to back-branches. Probably wouldn't be hard to fix the conflicts, but it's not worth the trouble.	2014-03-26 18:17:53 +02:00
Fujii Masao	49638868f8	Don't forget to flush XLOG_PARAMETER_CHANGE record. Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.	2014-03-26 02:12:39 +09:00
Heikki Linnakangas	3ed249b741	Fix "the the" typos. Erik Rijkers	2014-03-24 08:42:13 +02:00
Heikki Linnakangas	68a2e52bba	Replace the XLogInsert slots with regular LWLocks. The special feature the XLogInsert slots had over regular LWLocks is the insertingAt value that was updated atomically with releasing backends waiting on it. Add new functions to the LWLock API to do that, and replace the slots with LWLocks. This reduces the amount of duplicated code. (There's still some duplication, but at least it's all in lwlock.c now.) Reviewed by Andres Freund.	2014-03-21 15:10:48 +01:00
Heikki Linnakangas	59a5ab3f42	Remove rm_safe_restartpoint machinery. It is no longer used, none of the resource managers have multi-record actions that would make it unsafe to perform a restartpoint. Also don't allow rm_cleanup to write WAL records, it's also no longer required. Move the call to rm_cleanup routines to make it more symmetric with rm_startup.	2014-03-18 22:10:35 +02:00
Bruce Momjian	886c0be3f6	C comments: remove odd blank lines after #ifdef WIN32 lines	2014-03-13 01:34:42 -04:00
Heikki Linnakangas	a3115f0d9e	Only WAL-log the modified portion in an UPDATE, if possible. When a row is updated, and the new tuple version is put on the same page as the old one, only WAL-log the part of the new tuple that's not identical to the old. This saves significantly on the amount of WAL that needs to be written, in the common case that most fields are not modified. Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.	2014-03-12 23:28:36 +02:00
Heikki Linnakangas	956685f82b	Do wal_level and hot standby checks when doing crash-then-archive recovery. CheckRequiredParameterValues() should perform the checks if archive recovery was requested, even if we are going to perform crash recovery first. Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive recovery mode.	2014-03-05 14:48:14 +02:00
Heikki Linnakangas	af246c37c0	Fix lastReplayedEndRecPtr calculation when starting from shutdown checkpoint. When entering crash recovery followed by archive recovery, and the latest checkpoint is a shutdown checkpoint, and there are no more WAL records to replay before transitioning from crash to archive recovery, we would not immediately allow read-only connections in hot standby mode even if we could. That's because when starting from a shutdown checkpoint, we set lastReplayedEndRecPtr incorrectly to the record before the checkpoint record, instead of the checkpoint record itself. We don't run the redo routine of the shutdown checkpoint record, but starting recovery from it goes through the same motions, so it should be considered as replayed. Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected, so backpatch to 9.0.	2014-03-05 13:51:19 +02:00
Robert Haas	b89e151054	Introduce logical decoding. This feature, building on previous commits, allows the write-ahead log stream to be decoded into a series of logical changes; that is, inserts, updates, and deletes and the transactions which contain them. It is capable of handling decoding even across changes to the schema of the effected tables. The output format is controlled by a so-called "output plugin"; an example is included. To make use of this in a real replication system, the output plugin will need to be modified to produce output in the format appropriate to that system, and to perform filtering. Currently, information can be extracted from the logical decoding system only via SQL; future commits will add the ability to stream changes via walsender. Andres Freund, with review and other contributions from many other people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan, Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve Singer.	2014-03-03 16:32:18 -05:00
Heikki Linnakangas	d8a42b150f	Remove bogus while-loop. Commit `abf5c5c9a4` added a bogus while- statement after the for(;;)-loop. It went unnoticed in testing, because it was dead code. Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced this was also applied to 9.2, but not the bogus while-loop part, because the code in 9.2 looks quite different.	2014-02-28 13:33:41 +02:00
Heikki Linnakangas	8f09ca436d	Improve comment on setting data_checksum GUC. There was an extra space there, and "fixed" wasn't very descriptive.	2014-02-20 10:58:30 +02:00
Heikki Linnakangas	057152b37c	Fix comment; checkpointer, not bgwriter, performs checkpoints since 9.2. Amit Langote	2014-02-18 09:48:18 +02:00
Tom Lane	01824385ae	Prevent potential overruns of fixed-size buffers. Coverity identified a number of places in which it couldn't prove that a string being copied into a fixed-size buffer would fit. We believe that most, perhaps all of these are in fact safe, or are copying data that is coming from a trusted source so that any overrun is not really a security issue. Nonetheless it seems prudent to forestall any risk by using strlcpy() and similar functions. Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports. In addition, fix a potential null-pointer-dereference crash in contrib/chkpass. The crypt(3) function is defined to return NULL on failure, but chkpass.c didn't check for that before using the result. The main practical case in which this could be an issue is if libc is configured to refuse to execute unapproved hashing algorithms (e.g., "FIPS mode"). This ideally should've been a separate commit, but since it touches code adjacent to one of the buffer overrun changes, I included it in this commit to avoid last-minute merge issues. This issue was reported by Honza Horak. Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()	2014-02-17 11:20:21 -05:00
Heikki Linnakangas	4d894b41cd	Change the order that pg_xlog and WAL archive are polled for WAL segments. If there is a WAL segment with same ID but different TLI present in both the WAL archive and pg_xlog, prefer the one with higher TLI. Before this patch, the archive was polled first, for all expected TLIs, and only if no file was found was pg_xlog scanned. This was a change in behavior from 9.3, which first scanned archive and pg_xlog for the highest TLI, then archive and pg_xlog for the next highest TLI and so forth. This patch reverts the behavior back to what it was in 9.2. The reason for this is that if for example you try to do archive recovery to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is not archived yet, we would replay past the timeline switch point on timeline 1 using the archived files, before even looking timeline 2's files in pg_xlog Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior was changed.	2014-02-14 15:15:09 +02:00
Heikki Linnakangas	d699ba4134	Fix WakeupWaiters() to not wake up an exclusive locker unnecessarily. WakeupWaiters() is supposed to wake up all LW_WAIT_UNTIL_FREE waiters of the slot, but the loop incorrectly also woke up the first LW_EXCLUSIVE waiter, if there was no LW_WAIT_UNTIL_FREE waiters in the queue. Noted by Andres Freund. This code is new in 9.4, so no backpatching.	2014-02-10 15:18:18 +02:00
Robert Haas	858ec11858	Introduce replication slots. Replication slots are a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation. In a few places, we refer to the type of replication slots introduced by this patch as "physical" slots, because forthcoming patches for logical decoding will also have slots, but with somewhat different properties. Andres Freund and Robert Haas	2014-01-31 22:45:36 -05:00
Heikki Linnakangas	71c6a8e375	Add recovery_target='immediate' option. This allows ending recovery as a consistent state has been reached. Without this, there was no easy way to e.g restore an online backup, without replaying any extra WAL after the backup ended. MauMau and me.	2014-01-25 17:34:04 +02:00
Tom Lane	ac4ef637ad	Allow use of "z" flag in our printf calls, and use it where appropriate. Since C99, it's been standard for printf and friends to accept a "z" size modifier, meaning "whatever size size_t has". Up to now we've generally dealt with printing size_t values by explicitly casting them to unsigned long and using the "l" modifier; but this is really the wrong thing on platforms where pointers are wider than longs (such as Win64). So let's start using "z" instead. To ensure we can do that on all platforms, teach src/port/snprintf.c to understand "z", and add a configure test to force use of that implementation when the platform's version doesn't handle "z". Having done that, modify a bunch of places that were using the unsigned-long hack to use "z" instead. This patch doesn't pretend to have gotten everyplace that could benefit, but it catches many of them. I made an effort in particular to ensure that all uses of the same error message text were updated together, so as not to increase the number of translatable strings. It's possible that this change will result in format-string warnings from pre-C99 compilers. We might have to reconsider if there are any popular compilers that will warn about this; but let's start by seeing what the buildfarm thinks. Andres Freund, with a little additional work by me	2014-01-23 17:18:33 -05:00
Tom Lane	061b079f89	Fix multiple bugs in index page locking during hot-standby WAL replay. In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.	2014-01-14 17:35:21 -05:00
Heikki Linnakangas	c945af80cf	Refactor checking whether we've reached the recovery target. Makes the replay loop slightly more readable, by separating the concerns of whether to stop and whether to delay, and how to extract the timestamp from a record. This has the user-visible change that the timestamp of the last applied record is now updated after actually applying it. Before, it was updated just before applying it. That meant that pg_last_xact_replay_timestamp() could return the timestamp of a commit record that is in process of being replayed, but not yet applied. Normally the difference is small, but if min_recovery_apply_delay is set, there could be a significant delay between reading a record and applying it. Another behavioral change is that if you recover to a restore point, we stop after the restore point record, not before it. It makes no difference as far as running queries on the server is concerned, as applying a restore point record changes nothing, but if examine the timeline history you will see that the new timeline branched off just after the restore point record, not before it. One practical consequence is that if you do PITR to the new timeline, and set recovery target to the same named restore point again, it will find and stop recovery at the same restore point. Conceptually, I think it makes more sense to consider the restore point as part of the new timeline's history than not. In principle, setting the last-replayed timestamp before actually applying the record was a bug all along, but it doesn't seem worth the risk to backpatch, since min_recovery_apply_delay was only added in 9.4.	2014-01-09 14:00:39 +02:00
Heikki Linnakangas	3739e5ab93	Fix pause_at_recovery_target + recovery_target_inclusive combination. If pause_at_recovery_target is set, recovery pauses before applying the target record, even if recovery_target_inclusive is set. If you then continue with pg_xlog_replay_resume(), it will apply the target record before ending recovery. In other words, if you log in while it's paused and verify that the database looks OK, ending recovery changes its state again, possibly destroying data that you were tring to salvage with PITR. Backpatch to 9.1, this has been broken since pause_at_recovery_target was added.	2014-01-08 23:28:52 +02:00
Heikki Linnakangas	815d71deed	If multiple recovery_targets are specified, use the latest one. The docs say that only one of recovery_target_xid, recovery_target_time, or recovery_target_name can be specified. But the code actually did something different, so that a name overrode time, and xid overrode both time and name. Now the target specified last takes effect, whether it's an xid, time or name. With this patch, we still accept multiple recovery_target settings, even though docs say that only one can be specified. It's a general property of the recovery.conf file parser that you if you specify the same option twice, the last one takes effect, like with postgresql.conf.	2014-01-08 22:26:39 +02:00
Heikki Linnakangas	d59ff6c110	Fix bug in determining when recovery has reached consistency. When starting WAL replay from an online checkpoint, the last replayed WAL record variable was initialized using the checkpoint record's location, even though the records between the REDO location and the checkpoint record had not been replayed yet. That was noted as "slightly confusing" but harmless in the comment, but in some cases, it fooled CheckRecoveryConsistency to incorrectly conclude that we had already reached a consistent state immediately at the beginning of WAL replay. That caused the system to accept read-only connections in hot standby mode too early, and also PANICs with message "WAL contains references to invalid pages". Fix by initializing the variables to the REDO location instead. In 9.2 and above, change CheckRecoveryConsistency() to use lastReplayedEndRecPtr variable when checking if backup end location has been reached. It was inconsistently using EndRecPtr for that check, but lastReplayedEndRecPtr when checking min recovery point. It made no difference before this patch, because in all the places where CheckRecoveryConsistency was called the two variables were the same, but it was always an accident waiting to happen, and would have been wrong after this patch anyway. Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0, where hot standby was introduced.	2014-01-08 15:03:09 +02:00
Bruce Momjian	7e04792a1c	Update copyright for 2014 Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.	2014-01-07 16:05:30 -05:00
Magnus Hagander	9544cc0d65	Move permissions check from do_pg_start_backup to pg_start_backup And the same for do_pg_stop_backup. The code in do_pg_* is not allowed to access the catalogs. For manual base backups, the permissions check can be handled in the calling function, and for streaming base backups only users with the required permissions can get past the authentication step in the first place. Reported by Antonin Houska, diagnosed by Andres Freund	2014-01-07 17:50:56 +01:00
Robert Haas	4b351841fa	Rename walLogHints to wal_log_hints for easier grepping. Michael Paquier	2014-01-01 20:17:00 -05:00
Fujii Masao	961bf59fb7	Rename wal_log_hintbits to wal_log_hints, per discussion on pgsql-hackers. Sawada Masahiko	2013-12-21 03:33:16 +09:00
Heikki Linnakangas	dde6282500	Fix more instances of "the the" in comments. Plus one instance of "to to" in the docs.	2013-12-13 20:02:01 +02:00
Heikki Linnakangas	50e547096c	Add GUC to enable WAL-logging of hint bits, even with checksums disabled. WAL records of hint bit updates is useful to tools that want to examine which pages have been modified. In particular, this is required to make the pg_rewind tool safe (without checksums). This can also be used to test how much extra WAL-logging would occur if you enabled checksums, without actually enabling them (which you can't currently do without re-initdb'ing). Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with further changes by me.	2013-12-13 16:26:14 +02:00
Simon Riggs	36da3cfb45	Allow time delayed standbys and recovery Set min_recovery_apply_delay to force a delay in recovery apply for commit and restore point WAL records. Other records are replayed immediately. Delay is measured between WAL record time and local standby time. Robert Haas, Fabrízio de Royes Mello and Simon Riggs Detailed review by Mitsumasa Kondo	2013-12-12 10:53:20 +00:00
Tom Lane	22310b808d	Remove bogus executable permissions on xlog.c. Apparently fat-fingered in `1a3d104475`. Noted by Peter Geoghegan.	2013-12-11 22:12:25 -05:00
Robert Haas	e55704d8b2	Add new wal_level, logical, sufficient for logical decoding. When wal_level=logical, we'll log columns from the old tuple as configured by the REPLICA IDENTITY facility added in commit `07cacba983`. This makes it possible a properly-configured logical replication solution to correctly follow table updates even if they change the chosen key columns, or, with REPLICA IDENTITY FULL, even if the table has no key at all. Note that updates which do not modify the replica identity column won't log anything extra, making the choice of a good key (i.e. one that will rarely be changed) important to performance when wal_level=logical is configured. Each insert, update, or delete to a catalog table will also log the CMIN and/or CMAX values of stamped by the current transaction. This is necessary because logical decoding will require access to historical snapshots of the catalog in order to decode some data types, and the CMIN/CMAX values that we may need in order to judge row visibility may have been overwritten by the time we need them. Andres Freund, reviewed in various versions by myself, Heikki Linnakangas, KONDO Mitsumasa, and many others.	2013-12-10 19:01:40 -05:00
Alvaro Herrera	1df0122daa	Truncate pg_multixact/'s contents during crash recovery Commit `9dc842f08` of 8.2 era prevented MultiXact truncation during crash recovery, because there was no guarantee that enough state had been setup, and because it wasn't deemed to be a good idea to remove data during crash recovery anyway. Since then, due to Hot-Standby, streaming replication and PITR, the amount of time a cluster can spend doing crash recovery has increased significantly, to the point that a cluster may even never come out of it. This has made not truncating the content of pg_multixact/ not defensible anymore. To fix, take care to setup enough state for multixact truncation before crash recovery starts (easy since checkpoints contain the required information), and move the current end-of-recovery actions to a new TrimMultiXact() function, analogous to TrimCLOG(). At some later point, this should probably done similarly to the way clog.c is doing it, which is to just WAL log truncations, but we can't do that for the back branches. Back-patch to 9.0. 8.4 also has the problem, but since there's no hot standby there, it's much less pressing. In 9.2 and earlier, this patch is simpler than in newer branches, because multixact access during recovery isn't required. Add appropriate checks to make sure that's not happening. Andres Freund	2013-11-29 21:47:15 -03:00
Heikki Linnakangas	1a3d104475	Avoid acquiring spinlock when checking if recovery has finished, for speed. RecoveryIsInProgress() can be called very frequently. During normal operation, it just checks a backend-local variable and returns quickly, but during hot standby, it checks a spinlock-protected shared variable. Those spinlock acquisitions can become a point of contention on a busy hot standby system. Replace the spinlock acquisition with a memory barrier. Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.	2013-11-22 13:07:23 +02:00
Robert Haas	cacbdd7810	Use appendStringInfoString instead of appendStringInfo where possible. This shaves a few cycles, and generally seems like good programming practice. David Rowley	2013-10-31 10:55:59 -04:00
Heikki Linnakangas	5962519b36	TYPEALIGN doesn't work on int64 on 32-bit platforms. The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with values larger than intptr_t, because TYPEALIGN casts the argument to intptr_t to do the arithmetic. That's not a problem when dealing with pointers or lengths or offsets related to pointers, but the XLogInsert scaling patch added a call to MAXALIGN with an XLogRecPtr argument. To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64, which are just like the existing variants but work with uint64 instead of intptr_t. Report and patch by David Rowley, analysis by Andres Freund.	2013-10-08 01:59:57 +03:00
Heikki Linnakangas	0892ecbc01	Add a GUC to report whether data page checksums are enabled. Bernd Helmle	2013-09-16 14:36:01 +03:00
Jeff Davis	b1892aaeaa	Revert WAL posix_fallocate() patches. This reverts commit `269e780822` and commit `5b571bb8c8`. Unfortunately, the initial patch had insufficient performance testing, and resulted in a regression. Per report by Thom Brown.	2013-09-04 23:43:41 -07:00
Heikki Linnakangas	375d8526f2	Keep heavily-contended fields in XLogCtlInsert on different cache lines. Performance testing shows that if the insertpos_lck spinlock and the fields that it protects are on the same cache line with other variables that are frequently accessed, the false sharing can hurt performance a lot. Keep them apart by adding some padding.	2013-09-04 23:14:33 +03:00
Heikki Linnakangas	3619a20d33	Rename the "fast_promote" file to just "promote". This keeps the usual trigger file name unchanged from 9.2, avoiding nasty issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa. The fallback behavior of creating a full checkpoint before starting up is now triggered by a file called "fallback_promote". That can be useful for debugging purposes, but we don't expect any users to have to resort to that and we might want to remove that in the future, which is why the fallback mechanism is undocumented.	2013-08-19 20:59:51 +03:00
Peter Eisentraut	072457b360	Message punctuation and pluralization fixes	2013-08-09 08:02:44 -04:00
Peter Eisentraut	626092a2e1	Message style improvements	2013-07-28 07:01:13 -04:00
Heikki Linnakangas	107cbc90a7	Fix variable names mentioned in comment to match the code. Also, in another comment, explain why holding an insertion slot is a critical section. Per review by Amit Kapila.	2013-07-17 23:32:32 +03:00
Heikki Linnakangas	59c02a36f0	Fix assert failure at end of recovery, broken by XLogInsert scaling patch. Initialization of the first XLOG buffer at end-of-recovery was broken for the case that the last read WAL record ended at a page boundary. Instead of trying to copy the last full xlog page to the buffer cache in that case, just set shared state so that the next page is initialized when the first WAL record after startup is inserted. (that's what we did in earlier version, too) To make the shared state required for that case less surprising, replace the XLogCtl->curridx variable, which was the index of the latest initialized buffer, with an XLogRecPtr of how far the buffers have been initialized. That also allows us to get rid of the XLogRecEndPtrToBufIdx macro. While we're at it, make a similar change for XLogCtl->Write.curridx, getting rid of that variable and calculating the next buffer to write from XLogCtl->LogwrtResult instead.	2013-07-17 23:12:22 +03:00
Heikki Linnakangas	f489470f8a	Fix Windows build. Was broken by my xloginsert scaling patch. XLogCtl global variable needs to be initialized in each process, as it's not inherited by fork() on Windows.	2013-07-08 17:28:48 +03:00
Heikki Linnakangas	9a20a9b21b	Improve scalability of WAL insertions. This patch replaces WALInsertLock with a number of WAL insertion slots, allowing multiple backends to insert WAL records to the WAL buffers concurrently. This is particularly useful for parallel loading large amounts of data on a system with many CPUs. This has one user-visible change: switching to a new WAL segment with pg_switch_xlog() now fills the remaining unused portion of the segment with zeros. This potentially adds some overhead, but it has been a very common practice by DBA's to clear the "tail" of the segment with an external pg_clearxlogtail utility anyway, to make the WAL files compress better. With this patch, it's no longer necessary to do that. This patch adds a new GUC, xloginsert_slots, to tune the number of WAL insertion slots. Performance testing suggests that the default, 8, works pretty well for all kinds of worklods, but I left the GUC in place to allow others with different hardware to test that easily. We might want to remove that before release. Reviewed by Andres Freund.	2013-07-08 11:23:56 +03:00
Jeff Davis	5b571bb8c8	Handle posix_fallocate() errors. On some platforms, posix_fallocate() is available but may still return EINVAL if the underlying filesystem does not support it. So, in case of an error, fall through to the alternate implementation that just writes zeros. Per buildfarm failure and analysis by Tom Lane.	2013-07-06 13:46:04 -07:00
Jeff Davis	269e780822	Use posix_fallocate() for new WAL files, where available. This function is more efficient than actually writing out zeroes to the new file, per microbenchmarks by Jon Nelson. Also, it may reduce the likelihood of WAL file fragmentation. Jon Nelson, with review by Andres Freund, Greg Smith and me.	2013-07-05 12:30:29 -07:00
Robert Haas	6bc8ef0b7f	Add new GUC, max_worker_processes, limiting number of bgworkers. In 9.3, there's no particular limit on the number of bgworkers; instead, we just count up the number that are actually registered, and use that to set MaxBackends. However, that approach causes problems for Hot Standby, which needs both MaxBackends and the size of the lock table to be the same on the standby as on the master, yet it may not be desirable to run the same bgworkers in both places. 9.3 handles that by failing to notice the problem, which will probably work fine in nearly all cases anyway, but is not theoretically sound. A further problem with simply counting the number of registered workers is that new workers can't be registered without a postmaster restart. This is inconvenient for administrators, since bouncing the postmaster causes an interruption of service. Moreover, there are a number of applications for background processes where, by necessity, the background process must be started on the fly (e.g. parallel query). While this patch doesn't actually make it possible to register new background workers after startup time, it's a necessary prerequisite. Patch by me. Review by Michael Paquier.	2013-07-04 11:24:24 -04:00
Heikki Linnakangas	79ce29c734	Retry short writes when flushing WAL. We don't normally bother retrying when the number of bytes written by write() is short of what was requested. It is generally assumed that a write() to disk doesn't return short, unless you run out of disk space. While writing the WAL, however, it seems prudent to try a bit harder, because a failure leads to PANIC. The write() is also much larger than most write()s in the backend (up to wal_buffers), so there's more room for surprises. Also retry on EINTR. All signals used in the backend are flagged SA_RESTART nowadays, so it shouldn't happen, but better to be defensive.	2013-07-01 09:36:00 +03:00
Simon Riggs	1f09121b4e	Ensure no xid gaps during Hot Standby startup In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund	2013-06-23 11:05:02 +01:00
Jeff Davis	b8fd1a09f3	Add buffer_std flag to MarkBufferDirtyHint(). MarkBufferDirtyHint() writes WAL, and should know if it's got a standard buffer or not. Currently, the only callers where buffer_std is false are related to the FSM. In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive. Back-patch to 9.3.	2013-06-17 08:02:12 -07:00
Tom Lane	c62866eeaf	Remove special-case treatment of LOG severity level in standalone mode. elog.c has historically treated LOG messages as low-priority during bootstrap and standalone operation. This has led to confusion and even masked a bug, because the normal expectation of code authors is that elog(LOG) will put something into the postmaster log, and that wasn't happening during initdb. So get rid of the special-case rule and make the priority order the same as it is in normal operation. To keep from cluttering initdb's output and the behavior of a standalone backend, tweak the severity level of three messages routinely issued by xlog.c during startup and shutdown so that they won't appear in these cases. Per my proposal back in December.	2013-06-13 23:15:15 -04:00
Noah Misch	fb435f40d5	Observe array length in HaveVirtualXIDsDelayingChkpt(). Since commit `f21bb9cfb5`, this function ignores the caller-provided length and loops until it finds a terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the previous loop control logic. In passing, revert the addition of an unused variable by the same commit, presumably a debugging relic.	2013-06-12 19:50:14 -04:00
Heikki Linnakangas	f73cb5567c	Fix typo in comment.	2013-06-06 18:27:01 +03:00
Heikki Linnakangas	e1e2bb34f1	Code review of recycling WAL segments in a restartpoint. Seems cleaner to get the currently-replayed TLI in the same call to GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the comment what the code does when recovery has already ended (RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make resetting ThisTimeLineID afterwards more explicit.	2013-06-03 09:25:12 +03:00
Stephen Frost	551938ae22	Post-pgindent cleanup Make slightly better decisions about indentation than what pgindent is capable of. Mostly breaking out long function calls into one line per argument, with a few other minor adjustments. No functional changes- all whitespace. pgindent ran cleanly (didn't change anything) after. Passes all regressions.	2013-06-01 09:38:15 -04:00
Bruce Momjian	9af4159fce	pgindent run for release 9.3 This is the first run of the Perl-based pgindent script. Also update pgindent instructions.	2013-05-29 16:58:43 -04:00
Simon Riggs	22a27ef113	After fast promotion use CHECKPOINT_FORCE Not necessary for correctness, just to make log_checkpoints output look less singular. Requested by Fujii Masao	2013-05-21 21:27:12 +01:00
Simon Riggs	75a192638f	Maintain ThisTimeLineID correctly in checkpointer checkpointer needs to reset ThisTimeLineID after a restartpoint to allow installing/recycling new WAL files. If recovery has already ended this would leave ThisTimeLineID set incorrectly and so we must reset it otherwise later checkpoints do not have the correct timeline. Bug report by Heikki Linnakangas. Further investigation by Heikki and myself.	2013-05-21 21:17:04 +01:00
Simon Riggs	d4337a0dcb	Init crash recovery using the latest available TLI This simplifies the handling of crashes after fast promotion and various minor cases that can exist in short timing windows around that case. Broad fix to bug reported by Michael Paquier on -hackers, approach prompted by Heikki Linnakangas	2013-05-19 17:31:07 +01:00
Simon Riggs	1781744cfc	Emit msg correctly for timeline-crossing crash	2013-05-19 17:00:18 +01:00
Simon Riggs	c94dff4c3c	Remove single space on end of a line in xlog.c Michael Paquier	2013-05-19 15:38:47 +01:00
Heikki Linnakangas	2ffa66f497	Fix walsender failure at promotion. If a standby server has a cascading standby server connected to it, it's possible that WAL has already been sent up to the next WAL page boundary, splitting a WAL record in the middle, when the first standby server is promoted. Don't throw an assertion failure or error in walsender if that happens. Also, fix a variant of the same bug in pg_receivexlog: if it had already received WAL on previous timeline up to a segment boundary, when the upstream standby server is promoted so that the timeline switch record falls on the previous segment, pg_receivexlog would miss the segment containing the timeline switch. To fix that, have walsender send the position of the timeline switch at end-of-streaming, in addition to the next timeline's ID. It was previously assumed that the switch happened exactly where the streaming stopped. Note: this is an incompatible change in the streaming protocol. You might get an error if you try to stream over timeline switches, if the client is running 9.3beta1 and the server is more recent. It should be fine after a reconnect, however. Reported by Fujii Masao.	2013-05-08 20:30:17 +03:00
Simon Riggs	443951748c	Record data_checksum_version in control file. The value is not used anywhere in code, but will allow future changes to the checksum version should that become necessary in the future.	2013-04-30 12:27:12 +01:00
Simon Riggs	2317a63328	Make fast promotion the default promotion mode. Continue to allow a request for synchronous checkpoints as a mechanism in case of problems.	2013-04-24 12:21:18 +01:00
Heikki Linnakangas	594041311c	Fix calculation of how many segments to retain for wal_keep_segments. KeepLogSeg function was broken when we switched to use a 64-bit int for the segment number. Per report from Jeff Janes.	2013-04-08 16:29:56 +03:00
Simon Riggs	5787c6730e	Skip extraneous locking in XLogCheckBuffer(). Heikki reported comment was wrong, so fixed code to match the comment: we only need to take additional locking precautions when we have a shared lock on the buffer.	2013-04-08 09:11:49 +01:00
Simon Riggs	47c4333189	Avoid tricky race condition recording XLOG_HINT We copy the buffer before inserting an XLOG_HINT to avoid WAL CRC errors caused by concurrent hint writes to buffer while share locked. To make this work we refactor RestoreBackupBlock() to allow an XLOG_HINT to avoid the normal path for backup blocks, which assumes the underlying buffer is exclusive locked. Resulting code completely changes layout of XLOG_HINT WAL records, but this isn't even beta code, so this is a low impact change. In passing, avoid taking WALInsertLock for full page writes on checksummed hints, remove related cruft from XLogInsert() and improve xlog_desc record for XLOG_HINT. Andres Freund Bug report by Fujii Masao, testing by Jeff Janes and Jaime Casanova, review by Jeff Davis and Simon Riggs. Applied with changes from review and some comment editing.	2013-04-08 08:52:39 +01:00
Tom Lane	ce9ab88981	Make REPLICATION privilege checks test current user not authenticated user. The pg_start_backup() and pg_stop_backup() functions checked the privileges of the initially-authenticated user rather than the current user, which is wrong. For example, a user-defined index function could successfully call these functions when executed by ANALYZE within autovacuum. This could allow an attacker with valid but low-privilege database access to interfere with creation of routine backups. Reported and fixed by Noah Misch. Security: CVE-2013-1901	2013-04-01 13:09:24 -04:00
Simon Riggs	593c39d156	Revoke `bc5334d867`	2013-03-28 09:18:02 +00:00
Simon Riggs	bc5334d867	Allow external recovery_config_directory If required, recovery.conf can now be located outside of the data directory. Server needs read/write permissions on this directory.	2013-03-27 11:45:42 +00:00
Simon Riggs	96ef3b8ff1	Allow I/O reliability checks using 16-bit checksums Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.	2013-03-22 13:54:07 +00:00
Simon Riggs	bb7cc2623f	Remove PageSetTLI and rename pd_tli to pd_checksum Remove use of PageSetTLI() from all page manipulation functions and adjust README to indicate change in the way we make changes to pages. Repurpose those bytes into the pd_checksum field and explain how that works in comments about page header. Refactoring ahead of actual feature patch which would make use of the checksum field, arriving later. Jeff Davis, with comments and doc changes by Simon Riggs Direction suggested by Robert Haas; many others providing review comments.	2013-03-18 13:46:42 +00:00
Tom Lane	da5aeccf64	Move pqsignal() to libpgport. We had two copies of this function in the backend and libpq, which was already pretty bogus, but it turns out that we need it in some other programs that don't use libpq (such as pg_test_fsync). So put it where it probably should have been all along. The signal-mask-initialization support in src/backend/libpq/pqsignal.c stays where it is, though, since we only need that in the backend.	2013-03-17 12:06:42 -04:00
Heikki Linnakangas	7ccefe8610	Fix tli history file fetching, broken by the archive after crash recevery patch. If we were about to enter archive recovery after crash recovery, we scanned the archive for the latest tli history file, and set the recovery target timeline to that. However, when we actually tried to read the history file, we would not fetch the file from the archive, because we were not in archive recovery yet. To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch the file from archive if archive recovery is requested, even if we're not in archive recovery yet. Backpatch to 9.2. Mitsumasa KONDO	2013-03-07 12:33:24 +02:00
Heikki Linnakangas	6c4f6664b2	Fix thinko in previous commit. We must still initialize minRecoveryPoint if we start straight with archive recovery, e.g when recovering from a normal base backup taken with pg_start/stop_backup. Otherwise we never consider the system consistent.	2013-02-22 13:12:43 +02:00
Heikki Linnakangas	abf5c5c9a4	If recovery.conf is created after "pg_ctl stop -m i", do crash recovery. If you create a base backup using an atomic filesystem snapshot, and try to perform PITR starting from that base backup, or if you just kill a master server and create recovery.conf to put it into standby mode, we don't know how far we need to recover before reaching consistency. Normally in crash recovery, we replay all the WAL present in pg_xlog, and assume that we're consistent after that. And normally in archive recovery, minRecoveryPoint, backupEndRequired, or backupEndPoint is set in the control file, indicating how far we need to replay to reach consistency. But if the server was previously up and running normally, and you kill -9 it or take an atomic filesystem snapshot, none of those fields are set in the control file. The solution is to perform crash recovery first, replaying all the WAL in pg_xlog. After that's done, we assume that the system is consistent like in normal crash recovery, and switch to archive recovery mode after that. Per report from Kyotaro HORIGUCHI. In his scenario, recovery.conf was created after "pg_ctl stop -m i". I'm not sure we need to support that exact scenario, but we should support backing up using a filesystem snapshot, which looks identical. This issue goes back to at least 9.0, where hot standby was introduced and we started to track when consistency is reached. In 9.1 and 9.2, we would open up for hot standby too early, and queries could briefly see an inconsistent state. But 9.2 made it more visible, as we started to PANIC if we see a reference to a non-existing page during recovery, if we've already reached consistency. This is a fairly big patch, so back-patch to 9.2 only, where the issue is more visible. We can consider back-patching further after this has received some more testing in 9.2 and master.	2013-02-22 12:32:41 +02:00
Heikki Linnakangas	1bd42cd70a	Better fix for "unarchived WAL files get deleted on crash recovery" bug. Revert my earlier fix for the bug that unarchived WAL files get deleted on crash recovery, commit `c9cc7e05c6`. We create a .done file for files streamed or restored from archive, so the WAL file recycling logic used during normal operation works just as well during archive recovery. Per Fujii Masao's suggestion.	2013-02-15 19:33:31 +02:00
Heikki Linnakangas	c9cc7e05c6	Don't delete unarchived WAL files during crash recovery. Bug reported by Jehan-Guillaume (ioguix) de Rorthais. This was introduced with the change to keep WAL files restored from archive in pg_xlog, in 9.2.	2013-02-15 17:43:59 +02:00
Heikki Linnakangas	62401db45c	Support unlogged GiST index. The reason this wasn't supported before was that GiST indexes need an increasing sequence to detect concurrent page-splits. In a regular WAL- logged GiST index, the LSN of the page-split record is used for that purpose, and in a temporary index, we can get away with a backend-local counter. Neither of those methods works for an unlogged relation. To provide such an increasing sequence of numbers, create a "fake LSN" counter that is saved and restored across shutdowns. On recovery, unlogged relations are blown away, so the counter doesn't need to survive that either. Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.	2013-02-11 23:07:09 +02:00
Heikki Linnakangas	b669f416ce	Fix checkpoint after fast promotion. The intention was to request a regular online checkpoint immediately after end of recovery, when performing "fast promotion". However, because the checkpoint was requested before other backends were allowed to write WAL, the checkpointer process performed a restartpoint rather than a checkpoint. Delay the RequestCheckPoint call until after recovery has truly ended, so that you get a real checkpoint.	2013-02-11 22:22:08 +02:00
Heikki Linnakangas	7803e9327d	Include previous TLI in end-of-recovery and shutdown checkpoint records. This isn't used for anything but a sanity check at the moment, but it could be highly valuable for debugging purposes. It could also be used to recreate timeline history by traversing WAL, which seems useful.	2013-02-11 18:16:25 +02:00
Simon Riggs	072521b8c8	Rely only on checkpoint 1 at end of recovery. Searching for checkpoint 2 (previous) is not correct in all cases. Bug report from Heikki Linnakangas	2013-02-07 16:33:05 +00:00
Simon Riggs	3f0ab05233	Switch timelines if we crash soon after promotion. Previous patch to skip checkpoints at end of recovery didn't correctly perform crash recovery, fumbling the timeline switch. Now we record the minRecoveryPointTLI of the newly selected timeline, so that we crash recover to the correct timeline. Bug report from Fujii Masao, investigated by me.	2013-01-31 19:29:32 +00:00
Simon Riggs	fd4ced5230	Fast promote mode skips checkpoint at end of recovery. pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we can achieve very fast failover when the apply delay is low. Write new WAL record XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log readers. If we skip synchronous end of recovery checkpoint we request a normal spread checkpoint so that the window of re-recovery is low. Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao. Review by Heikki Linnakangas	2013-01-29 00:06:15 +00:00
Alvaro Herrera	0ac5ad5134	Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov	2013-01-23 12:04:59 -03:00
Heikki Linnakangas	990fe3c4ed	Fix more issues with cascading replication and timeline switches. When a standby server follows the master using WAL archive, and it chooses a new timeline (recovery_target_timeline='latest'), it only fetches the timeline history file for the chosen target timeline, not any other history files that might be missing from pg_xlog. For example, if the current timeline is 2, and we choose 4 as the new recovery target timeline, the history file for timeline 3 is not fetched, even if it's part of this server's history. That's enough for the standby itself - the history file for timeline 4 includes timeline 3 as well - but if a cascading standby server wants to recover to timeline 3, it needs the history file. To fix, when a new recovery target timeline is chosen, try to copy any missing history files from the archive to pg_xlog between the old and new target timeline. A second similar issue was with the WAL files. When a standby recovers from archive, and it reaches a segment that contains a switch to a new timeline, recovery fetches only the WAL file labelled with the new timeline's ID. The file from the new timeline contains a copy of the WAL from the old timeline up to the point where the switch happened, and recovery recovers it from the new file. But in streaming replication, walsender only tries to read it from the old timeline's file. To fix, change walsender to read it from the new file, so that it behaves the same as recovery in that sense, and doesn't try to open the possibly nonexistent file with the old timeline's ID.	2013-01-23 10:19:20 +02:00
Alvaro Herrera	8c17144c75	Fix off-by-one bug in xlog reading logic Bug reported by Michael Paquier Author: Andres Freund	2013-01-18 11:19:53 -03:00
Heikki Linnakangas	2ff6555313	Use the right timeline when beginning to stream from master. The xlogreader refactoring broke the logic to decide which timeline to start streaming from. XLogPageRead() uses the timeline history to check which timeline the requested WAL position falls into. However, after the refactoring, XLogPageRead() is always first called with the first page in the segment, to verify the segment header, and only then with the actual WAL position we're interested in. That first read of the segment's header made XLogPageRead() to always start streaming from the old timeline containing the segment header, not the timeline containing the actual record, if there was a timeline switch within the segment. I thought I fixed this yesterday, but that fix was too narrow and only fixed this for the corner-case that the timeline switch happened in the first page of the segment. To fix this more robustly, pass explicitly the position of the record we're actually interested in to XLogPageRead, and use that to decide which timeline to read from, rather than deduce it from the page and offset. Per report from Fujii Masao.	2013-01-18 11:46:49 +02:00
Heikki Linnakangas	0b6329130e	Make pg_receivexlog and pg_basebackup -X stream work across timeline switches. This mirrors the changes done earlier to the server in standby mode. When receivelog reaches the end of a timeline, as reported by the server, it fetches the timeline history file of the next timeline, and restarts streaming from the new timeline by issuing a new START_STREAMING command. When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the last segment on the old timeline. This helps you to tell apart a partial segment left in the directory because of a timeline switch, and a completed segment. If you just follow a single server, it won't make a difference, but it can be significant in more complicated scenarios where new WAL is still generated on the old timeline. This includes two small changes to the streaming replication protocol: First, when you reach the end of timeline while streaming, the server now sends the TLI of the next timeline in the server's history to the client. pg_receivexlog uses that as the next timeline, so that it doesn't need to parse the timeline history file like a standby server does. Second, when BASE_BACKUP command sends the begin and end WAL positions, it now also sends the timeline IDs corresponding the positions.	2013-01-17 20:23:00 +02:00
Heikki Linnakangas	1296d5c53c	Fix a couple of error-handling bugs in the xlogreader patch. XLogReadRecord should reset its state on every error, to make sure it re-reads the page on next call. It was inconsistent in that some errors did that, but some did not. In ReadRecord(), don't give up on an error if we're in standby mode. The loop was set up to retry, but the checks within the loop broke out of the loop on any error. Andres Freund, with some tweaking by me.	2013-01-17 19:27:04 +02:00
Alvaro Herrera	7fcbf6a405	Split out XLog reading as an independent facility This new facility can not only be used by xlog.c to carry out crash recovery, but also by external programs. By supplying a function to read XLog pages from somewhere, all the WAL reading can be used for completely different purposes. For the standard backend use, the behavior should be pretty much the same as previously. As for non-backend programs, an hypothetical pg_xlogdump program is now closer to reality, but some more backend support is still necessary. This patch was originally submitted by Andres Freund in a different form, but Heikki Linnakangas opted for and authored another design of the concept. Andres has advanced the patch since Heikki's initial version. Review and some (mostly cosmetics) changes by me.	2013-01-16 16:12:53 -03:00
Heikki Linnakangas	b0daba57bb	Tolerate timeline switches while "pg_basebackup -X fetch" is running. If you take a base backup from a standby server with "pg_basebackup -X fetch", and the timeline switches while the backup is being taken, the backup used to fail with an error "requested WAL segment %s has already been removed". This is because the server-side code that sends over the required WAL files would not construct the WAL filename with the correct timeline after a switch. Fix that by using readdir() to scan pg_xlog for all the WAL segments in the range, regardless of timeline. Also, include all timeline history files in the backup, if taken with "-X fetch". That fixes another related bug: If a timeline switch happened just before the backup was initiated in a standby, the WAL segment containing the initial checkpoint record contains WAL from the older timeline too. Recovery will not accept that without a timeline history file that lists the older timeline. Backpatch to 9.2. Versions prior to that were not affected as you could not take a base backup from a standby before 9.2.	2013-01-03 19:51:00 +02:00
Heikki Linnakangas	ee994272ca	Delay reading timeline history file until it's fetched from master. Streaming replication can fetch any missing timeline history files from the master, but recovery would read the timeline history file for the target timeline before reading the checkpoint record, and before walreceiver has had a chance to fetch it from the master. Delay reading it, and the sanity checks involving timeline history, until after reading the checkpoint record. There is at least one scenario where this makes a difference: if you take a base backup from a standby server right after a timeline switch, the WAL segment containing the initial checkpoint record will begin with an older timeline ID. Without the timeline history file, recovering that file will fail as the older timeline ID is not recognized to be an ancestor of the target timeline. If you try to recover from such a backup, using only streaming replication to fetch the WAL, this patch is required for that to work.	2013-01-03 10:41:58 +02:00
Heikki Linnakangas	d194d7a526	Fix bug in streaming replication over multiple tli switches. After receiving some WAL over streaming replication, try to open the file from the timeline we're currently recieving, not recoveryTargetTLI. They are usually the same, which is why wasn't noticed before, but you'd get an error if there have been more than one timeline switch between the current point in WAL and the recovery target.	2013-01-02 14:35:15 +02:00
Heikki Linnakangas	4ffd589f44	Fix silly typo in code, which broke the check for reaching consistency.	2013-01-02 13:44:59 +02:00
Bruce Momjian	bd61a623ac	Update copyrights for 2013 Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.	2013-01-01 17:15:01 -05:00
Heikki Linnakangas	60df192aea	Keep timeline history files restored from archive in pg_xlog. The cascading standby patch in 9.2 changed the way WAL files are treated when restored from the archive. Before, they were restored under a temporary filename, and not kept in pg_xlog, but after the patch, they were copied under pg_xlog. This is necessary for a cascading standby to find them, but it also means that if the archive goes offline and a standby is restarted, it can recover back to where it was using the files in pg_xlog. It also means that if you take an offline backup from a standby server, it includes all the required WAL files in pg_xlog. However, the same change was not made to timeline history files, so if the WAL segment containing the checkpoint record contains a timeline switch, you will still get an error if you try to restart recovery without the archive, or recover from an offline backup taken from the standby. With this patch, timeline history files restored from archive are copied into pg_xlog like WAL files are, so that pg_xlog contains all the files required to recover. This is a corner-case pre-existing issue in 9.2, but even more important in master where it's possible for a standby to follow a timeline switch through streaming replication. To make that possible, the timeline history files must be present in pg_xlog.	2012-12-30 14:29:45 +02:00
Alvaro Herrera	5ab3af46dd	Remove obsolete XLogRecPtr macros This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance. These were useful for brevity when XLogRecPtrs were split in xlogid/xrecoff; but now that they are simple uint64's, they are just clutter. The only downside to making this change would be ease of backporting patches, but that has been negated by other substantive changes to the involved code anyway. The clarity of simpler expressions makes the change worthwhile. Most of the changes are mechanical, but in a couple of places, the patch author chose to invert the operator sense, making the code flow more logical (and more in line with preceding comments). Author: Andres Freund Eyeballed by Dimitri Fontaine and Alvaro Herrera	2012-12-28 13:06:15 -03:00
Alvaro Herrera	24eca7977e	Assign InvalidXLogRecPtr instead of MemSet(0) For consistency. Author: Andres Freund	2012-12-27 18:33:03 -03:00
Peter Eisentraut	a0bfb7b36e	Fix grammatical mistake in error message	2012-12-20 23:36:13 -05:00
Heikki Linnakangas	343ee00b73	Fix recycling of WAL segments after switching timeline during recovery. This was broken before, we would recycle old WAL segments on wrong timeline after the recovery target timeline had changed, but my recent commit to not initialize ThisTimeLineID at all in a standby's checkpointer process broke this completely. The problem is that when installing a recycled WAL segment as a future one, ThisTimeLineID is used to construct the filename. To fix, always update ThisTimeLineID to the current timeline being recovered, before recycling WAL segments at a restartpoint. This still leaves a small window where we might install WAL segments under wrong timeline ID, if the timeline is changed just as we're about to start recycling. Also, even if we're replaying timeline X at the momnent, there's no guarantee that we'll need as many WAL segments on that timeline as we recycle. We might be just about to reach the point where we switch to next timeline, so might only need one more WAL segment on the current timeline. We'll live with the waste in that situation. Bug pointed out by Fujii Masao. 9.1 and 9.2 had the same issue, when recovery target timeline was changed, but I committed a slightly different version of this patch on those branches.	2012-12-20 22:00:58 +02:00
Heikki Linnakangas	af275a12df	Follow TLI of last replayed record, not recovery target TLI, in walsenders. Most of the time, the last replayed record comes from the recovery target timeline, but there is a corner case where it makes a difference. When the startup process scans for a new timeline, and decides to change recovery target timeline, there is a window where the recovery target TLI has already been bumped, but there are no WAL segments from the new timeline in pg_xlog yet. For example, if we have just replayed up to point 0/30002D8, on timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog that contains the WAL up to that point. When recovery switches recovery target timeline to 2, a walsender can immediately try to read WAL from 0/30002D8, from timeline 2, so it will try to open WAL file 000000020000000000000003. However, that doesn't exist yet - the startup process hasn't copied that file from the archive yet nor has the walreceiver streamed it yet, so walsender fails with error "requested WAL segment 000000020000000000000003 has already been removed". That's harmless, in that the standby will try to reconnect later and by that time the segment is already created, but error messages that should be ignored are not good. To fix that, have walsender track the TLI of the last replayed record, instead of the recovery target timeline. That way walsender will not try to read anything from timeline 2, until the WAL segment has been created and at least one record has been replayed from it. The recovery target timeline is now xlog.c's internal affair, it doesn't need to be exposed in shared memory anymore. This fixes the error reported by Thom Brown. depesz the same error message, but I'm not sure if this fixes his scenario.	2012-12-20 14:39:04 +02:00
Heikki Linnakangas	1a11d4609e	Don't set ThisTimeLineID in checkpointer & bgwriter during recovery. We used to set it to the current recovery target timeline, but the recovery target timeline can change during recovery, leaving ThisTimeLineID at an old value. That seems worse than always leaving it at zero to begin with. AFAICS there was no good reason to set it in the first place. ThisTimeLineID is not needed in checkpointer or bgwriter process, until it's time to write the end-of-recovery checkpoint, and at that point ThisTimeLineID is updated anyway.	2012-12-20 14:39:04 +02:00
Heikki Linnakangas	e43f947bf3	Check if we've reached end-of-backup point also if no redo is required. If you restored from a backup taken from a standby, and the last record in the backup is the checkpoint record, ie. there is no redo required except for the checkpoint record, we would fail to notice that we've reached the end-of-backup point, and the database is consistent. The result was an error "WAL ends before end of online backup". To fix, move the have-we-reached-end-of-backup check into CheckRecoveryConsistency(), which is already responsible for similar checks with minRecoveryPoint, and is called in the right places. Backpatch to 9.2, this check and bug did not exist before that.	2012-12-19 14:22:00 +02:00
Heikki Linnakangas	abfd192b1b	Allow a streaming replication standby to follow a timeline switch. Before this patch, streaming replication would refuse to start replicating if the timeline in the primary doesn't exactly match the standby. The situation where it doesn't match is when you have a master, and two standbys, and you promote one of the standbys to become new master. Promoting bumps up the timeline ID, and after that bump, the other standby would refuse to continue. There's significantly more timeline related logic in streaming replication now. First of all, when a standby connects to primary, it will ask the primary for any timeline history files that are missing from the standby. The missing files are sent using a new replication command TIMELINE_HISTORY, and stored in standby's pg_xlog directory. Using the timeline history files, the standby can follow the latest timeline present in the primary (recovery_target_timeline='latest'), just as it can follow new timelines appearing in an archive directory. START_REPLICATION now takes a TIMELINE parameter, to specify exactly which timeline to stream WAL from. This allows the standby to request the primary to send over WAL that precedes the promotion. The replication protocol is changed slightly (in a backwards-compatible way although there's little hope of streaming replication working across major versions anyway), to allow replication to stop when the end of timeline reached, putting the walsender back into accepting a replication command. Many thanks to Amit Kapila for testing and reviewing various versions of this patch.	2012-12-13 19:17:32 +02:00
Heikki Linnakangas	970fb12de1	Consistency check should compare last record replayed, not last record read. EndRecPtr is the last record that we've read, but not necessarily yet replayed. CheckRecoveryConsistency should compare minRecoveryPoint with the last replayed record instead. This caused recovery to think it's reached consistency too early. Now that we do the check in CheckRecoveryConsistency correctly, we have to move the call of that function to after redoing a record. The current place, after reading a record but before replaying it, is wrong. In particular, if there are no more records after the one ending at minRecoveryPoint, we don't enter hot standby until one extra record is generated and read by the standby, and CheckRecoveryConsistency is called. These two bugs conspired to make the code appear to work correctly, except for the small window between reading the last record that reaches minRecoveryPoint, and replaying it. In the passing, rename recoveryLastRecPtr, which is the last record replayed, to lastReplayedEndRecPtr. This makes it slightly less confusing with replayEndRecPtr, which is the last record read that we're about to replay. Original report from Kyotaro HORIGUCHI, further diagnosis by Fujii Masao. Backpatch to 9.0, where Hot Standby subtly changed the test from "minRecoveryPoint < EndRecPtr" to "minRecoveryPoint <= EndRecPtr". The former works because where the test is performed, we have always read one more record than we've replayed.	2012-12-11 18:54:02 +02:00
Heikki Linnakangas	6be799664a	Fix the tracking of min recovery point timeline. Forgot to update it at the right place. Also, consider checkpoint record that switches to new timelne to be on the new timeline. This fixes erroneous "requested timeline 2 does not contain minimum recovery point" errors, pointed out by Amit Kapila while testing another patch.	2012-12-10 16:04:26 +02:00
Tom Lane	af4aba2f05	Ensure recovery pause feature doesn't pause unless users can connect. If we're not in hot standby mode, then there's no way for users to connect to reset the recoveryPause flag, so we shouldn't pause. The code was aware of this but the test to see if pausing was safe was seriously inadequate: it wasn't paying attention to reachedConsistency, and besides what it was testing was that we could legally enter hot standby, not that we have done so. Get rid of that in favor of checking LocalHotStandbyActive, which because of the coding in CheckRecoveryConsistency is tantamount to checking that we have told the postmaster to enter hot standby. Also, move the recoveryPausesHere() call that reacts to asynchronous recoveryPause requests so that it's not in the middle of application of a WAL record. I put it next to the recoveryStopsHere() call --- in future those are going to need to interact significantly, so this seems like a good waystation. Also, don't bother trying to read another WAL record if we've already decided not to continue recovery. This was no big deal when the code was written originally, but now that reading a record might entail actions like fetching an archive file, it seems a bit silly to do it like that. Per report from Jeff Janes and subsequent discussion. The pause feature needs quite a lot more work, but this gets rid of some indisputable bugs, and seems safe enough to back-patch.	2012-12-05 18:27:50 -05:00
Heikki Linnakangas	d67b06fe3e	Oops, meant to change the comment in writeTimeLineHistory.	2012-12-05 21:00:59 +02:00
Simon Riggs	6aa2e49a87	Must not reach consistency before XLOG_BACKUP_RECORD When waiting for an XLOG_BACKUP_RECORD the minRecoveryPoint will be incorrect, so we must not declare recovery as consistent before we have seen the record. Major bug allowing recovery to end too early in some cases, allowing people to see inconsistent db. This patch to HEAD and 9.2, other fix required for 9.1 and 9.0 Simon Riggs and Andres Freund, bug report by Jeff Janes	2012-12-05 13:28:03 +00:00
Heikki Linnakangas	90991c40eb	Downgrade a status message from LOG to DEBUG2. I never intended this to be anything other than a debugging aid, but forgot to change the level before committing.	2012-12-04 17:29:44 +02:00
Heikki Linnakangas	32f4de0adf	Write exact xlog position of timeline switch in the timeline history file. This allows us to do some more rigorous sanity checking for various incorrect point-in-time recovery scenarios, and provides more information for debugging purposes. It will also come handy in the upcoming patch to allow timeline switches to be replicated by streaming replication.	2012-12-04 17:29:07 +02:00
Heikki Linnakangas	5ce108bf32	Track the timeline associated with minRecoveryPoint, for more sanity checks. This allows recovery to notice certain incorrect recovery scenarios. If a server has recovered to point X on timeline 5, and you restart recovery, it better be on timeline 5 when it reaches point X again, not on some timeline with a higher ID. This can happen e.g if you a standby server is shut down, a new timeline appears in the WAL archive, and the standby server is restarted. It will try to follow the new timeline, which is wrong because some WAL on the old timeline was already replayed before shutdown. Requires an initdb (or at least pg_resetxlog), because this adds a field to the control file.	2012-12-04 11:31:00 +02:00
Andrew Dunstan	d5652e50d5	Attempt to unbreak MSVC builds broken by `f21bb9cfb5`. We can't use type uint, so use uint32.	2012-12-03 10:23:22 -05:00
Simon Riggs	f21bb9cfb5	Refactor inCommit flag into generic delayChkpt flag. Rename PGXACT->inCommit flag into delayChkpt flag, and generalise comments to allow use in other situations, such as the forthcoming potential use in checksum patch. Replace wait loop to look for VXIDs with delayChkpt set. No user visible changes, not behaviour changes at present. Simon Riggs, reviewed and rebased by Jeff Davis	2012-12-03 13:13:53 +00:00
Simon Riggs	7a764990d8	Clarify locking for PageGetLSN() in XLogCheckBuffer()	2012-12-03 12:20:31 +00:00
Heikki Linnakangas	a068c391ab	Refactor the code implementing standby-mode logic. It is now easier to see that it's a state machine, making the code easier to understand overall.	2012-12-03 12:32:44 +02:00
Tom Lane	3114cb60a1	Don't advance checkPoint.nextXid near the end of a checkpoint sequence. This reverts commit `c11130690d` in favor of actually fixing the problem: namely, that we should never have been modifying the checkpoint record's nextXid at this point to begin with. The nextXid should match the state as of the checkpoint's logical WAL position (ie the redo point), not the state as of its physical position. It's especially bogus to advance it in some wal_levels and not others. In any case there is no need for the checkpoint record to carry the same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by LogStandbySnapshot, as any replay operation will already have adopted that value as current. This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug #6291 from Daniel Farina, in that if a checkpoint were in progress at the instant of XID wraparound, the epoch bump would be lost as reported. (And, of course, these days there's at least a 50-50 chance of a checkpoint being in progress at any given instant.) Diagnosed by me and independently by Andres Freund. Back-patch to all branches supporting hot standby.	2012-12-02 15:20:41 -05:00
Simon Riggs	5c11725867	Rearrange storage of data in xl_running_xacts. Previously we stored all xids mixed together. Now we store top-level xids first, followed by all subxids. Also skip logging any subxids if the snapshot is suboverflowed, since there are potentially large numbers of them and they are not useful in that case anyway. Has value in the envisaged design for decoding of WAL. No planned effect on Hot Standby. Andres Freund, reviewed by me	2012-12-02 19:39:37 +00:00
Simon Riggs	c11130690d	XidEpoch++ if wraparound during checkpoint. If wal_level = hot_standby we update the checkpoint nextxid, though in the case where a wraparound occurred half-way through a checkpoint we would neglect updating the epoch also. Updating the nextxid is arguably the wrong thing to do, but changing that may introduce subtle bugs into hot standby startup, while updating the value doesn't cause any known bugs yet. Minimal fix now to HEAD and backbranches, wider fix later in HEAD. Bug reported in #6291 by Daniel Farina and slightly differently in Cause analysis and recommended fixes from Tom Lane and Andres Freund. Applied patch is minimal version of Andres Freund's work.	2012-12-02 14:57:44 +00:00
Simon Riggs	9f98704b82	Clarify operation of online checkpoints. Previous comments left, but were too obscure for such an important aspect of the system.	2012-12-02 13:09:55 +00:00
Alvaro Herrera	1577b46b7c	Split out rmgr rm_desc functions into their own files This is necessary (but not sufficient) to have them compilable outside of a backend environment.	2012-11-28 13:01:15 -03:00
Heikki Linnakangas	dd7353dde8	If we don't have a backup-end-location, don't claim we've reached it. This was apparently a typo, which caused recovery to think that it immediately reached the end of backup, and allowed the database to start up too early. Reported by Jeff Janes. Backpatch to 9.2, where this code was introduced.	2012-11-28 15:14:27 +02:00
Heikki Linnakangas	1f67078ea3	Add OpenTransientFile, with automatic cleanup at end-of-xact. Files opened with BasicOpenFile or PathNameOpenFile are not automatically cleaned up on error. That puts unnecessary burden on callers that only want to keep the file open for a short time. There is AllocateFile, but that returns a buffered FILE * stream, which in many cases is not the nicest API to work with. So add function called OpenTransientFile, which returns a unbuffered fd that's cleaned up like the FILE* returned by AllocateFile(). This plugs a few rare fd leaks in error cases: 1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile 2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't use OpenTransientFile here because the fd is supposed to persist over transaction boundaries. 3. lo_import/lo_export - fixed by using OpenTransientFile instead of PathNameOpenFile. In addition to plugging those leaks, this replaces many BasicOpenFile() calls with OpenTransientFile() that were not leaking, because the code meticulously closed the file on error. That wasn't strictly necessary, but IMHO it's good for robustness. The same leaks exist in older versions, but given the rarity of the issues, I'm not backpatching this. Not yet, anyway - it might be good to backpatch later, after this mechanism has had some more testing in master branch.	2012-11-27 10:25:50 +02:00
Heikki Linnakangas	24c19e6bf9	Avoid bogus "out-of-sequence timeline ID" errors in standby-mode. When startup process opens a WAL segment after replaying part of it, it validates the first page on the WAL segment, even though the page it's really interested in later in the file. As part of the validation, it checks that the TLI on the page header is >= the TLI it saw on the last page it read. If the segment contains a timeline switch, and we have already replayed it, and then re-open the WAL segment (because of streaming replication got disconnected and reconnected, for example), the TLI check will fail when the first page is validated. Fix that by relaxing the TLI check when re-opening a WAL segment. Backpatch to 9.0. Earlier versions had the same code, but before standby mode was introduced in 9.0, recovery never tried to re-read a segment after partially replaying it. Reported by Amit Kapila, while testing a new feature.	2012-11-22 11:44:44 +02:00
Heikki Linnakangas	644a0a6379	Fix archive_cleanup_command. When I moved ExecuteRecoveryCommand() from xlog.c to xlogarchive.c, I didn't realize that it's called from the checkpoint process, not the startup process. I tried to use InRedo variable to decide whether or not to attempt cleaning up the archive (must not do so before we have read the initial checkpoint record), but that variable is only valid within the startup process. Instead, let ExecuteRecoveryCommand() always clean up the archive, and add an explicit argument to RestoreArchivedFile() to say whether that's allowed or not. The caller knows better. Reported by Erik Rijkers, diagnosis by Fujii Masao. Only 9.3devel is affected.	2012-11-19 10:14:20 +02:00
Tom Lane	3bbf668de9	Fix multiple problems in WAL replay. Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.	2012-11-12 22:05:53 -05:00
Heikki Linnakangas	dbdf9679d7	Use correct text domain for translating errcontext() messages. errcontext() is typically used in an error context callback function, not within an ereport() invocation like e.g errmsg and errdetail are. That means that the message domain that the TEXTDOMAIN magic in ereport() determines is not the right one for the errcontext() calls. The message domain needs to be determined by the C file containing the errcontext() call, not the file containing the ereport() call. Fix by turning errcontext() into a macro that passes the TEXTDOMAIN to use for the errcontext message. "errcontext" was used in a few places as a variable or struct field name, I had to rename those out of the way, now that errcontext is a macro. We've had this problem all along, but this isn't doesn't seem worth backporting. It's a fairly minor issue, and turning errcontext from a function to a macro requires at least a recompile of any external code that calls errcontext().	2012-11-12 17:07:29 +02:00
Alvaro Herrera	2f1692d213	Fix erroneous choice of timeline variable, too	2012-10-31 17:05:55 -03:00
Alvaro Herrera	9b8dd7e8aa	Fix erroneous choices of segNo variables Commit `dfda6eba` (which changed segment numbers to use a single 64 bit variable instead of log/seg) introduced a couple of bogus choices of exactly which log segment number variable to use in each case. This is currently pretty harmless; in one place, the bogus number was only being used in an error message for a pretty unlikely condition (failure to fsync a WAL segment file). In the other, it was using a global variable instead of the local variable; but all callsites were passing the value of the global variable anyway. No need to backpatch because that commit is not on earlier branches.	2012-10-31 11:05:28 -03:00
Heikki Linnakangas	2d8c81ac86	Fix silly bug in previous refactoring. I extracted the refactoring patch from a larger patch that contained other changes too, but missed one unintentional change and didn't test enough...	2012-10-09 19:33:12 +03:00
Heikki Linnakangas	ff8f160bf4	Put the logic to wait for WAL in standby mode to a separate function. This is just refactoring with no user-visible effect, to make the code more readable.	2012-10-09 19:20:17 +03:00
Heikki Linnakangas	1a956481ba	Fix typo in comment, and reword it slightly while we're at it.	2012-10-04 10:35:48 +03:00
Heikki Linnakangas	93b6d78cf0	Add #includes needed on some platforms in the new files. Hopefully this makes the *BSD buildfarm animals happy.	2012-10-02 17:19:52 +03:00
Heikki Linnakangas	d5497b95f3	Split off functions related to timeline history files and XLOG archiving. This is just refactoring, to make the functions accessible outside xlog.c. A followup patch will make use of that, to allow fetching timeline history files over streaming replication.	2012-10-02 13:37:19 +03:00
Heikki Linnakangas	ab9a14e903	Fix WAL file replacement during cascading replication on Windows. When the startup process restores a WAL file from the archive, it deletes any old file with the same name and renames the new file in its place. On Windows, however, when a file is deleted, it still lingers as long as a process holds a file handle open on it. With cascading replication, a walsender process can hold the old file open, so the rename() in the startup process would fail. To fix that, rename the old file to a temporary name, to make the original file name available for reuse, before deleting the old file.	2012-09-05 18:52:12 -07:00
Tom Lane	2e0cc1f031	Fix inappropriate error messages for Hot Standby misconfiguration errors. Give the correct name of the GUC parameter being complained of. Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE, not the default INTERNAL_ERROR). Gurjeet Singh, errcode adjustment by me	2012-09-05 21:49:08 -04:00

... 2 3 4 5 6 ...

911 Commits