postgresql

Commit Graph

Author	SHA1	Message	Date
Alvaro Herrera	4aaddf2f00	Fix commit_ts for FrozenXid and BootstrapXid Previously, requesting commit timestamp for transactions FrozenTransactionId and BootstrapTransactionId resulted in an error. But since those values can validly appear in committed tuples' Xmin, this behavior is unhelpful and error prone: each caller would have to special-case those values before requesting timestamp data for an Xid. We already have a perfectly good interface for returning "the Xid you requested is too old for us to have commit TS data for it", so let's use that instead. Backpatch to 9.5, where commit timestamps appeared. Author: Craig Ringer Discussion: https://www.postgresql.org/message-id/CAMsr+YFM5Q=+ry3mKvWEqRTxrB0iU3qUSRnS28nz6FJYtBwhJg@mail.gmail.com	2016-11-24 15:39:55 -03:00
Robert Haas	e343dfa42b	Remove barrier.h A new thing also called a "barrier" is proposed, but whether we decide to take that patch or not, this file seems to have outlived its usefulness. Thomas Munro	2016-11-22 20:28:24 -05:00
Robert Haas	e8ac886c24	Support condition variables. Condition variables provide a flexible way to sleep until a cooperating process causes an arbitrary condition to become true. In simple cases, this can be accomplished with a WaitLatch/ResetLatch loop; the cooperating process can call SetLatch after performing work that might cause the condition to be satisfied, and the waiting process can recheck the condition each time. However, if the process performing the work doesn't have an easy way to identify which processes might be waiting, this doesn't work, because it can't identify which latches to set. Condition variables solve that problem by internally maintaining a list of waiters; a process that may have caused some waiter's condition to be satisfied must "signal" or "broadcast" on the condition variable. Robert Haas and Thomas Munro	2016-11-22 14:27:11 -05:00
Robert Haas	a43f1939d5	Remove or reduce verbosity of some debug messages. The debug messages that merely print StartTransactionCommand, CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no additional details seem to be useless. Get rid of them. The transaction status messages produced by ShowTransactionState are occasionally useful, but they are extremely verbose, producing multiple lines of log output every time they fire, which can happens multiple times per transaction. So, reduce the level to DEBUG5; avoid emitting an extra line just to explain which debug point is at issue; and tighten up the rest of the message so it doesn't use quite so much horizontal space. With these changes, it's possible to run a somewhat busy system with a log level even as high as DEBUG4, whereas previously anything above DEBUG2 would flood the log with output that probably wasn't really all that useful.	2016-11-17 17:05:16 -05:00
Tom Lane	5485c99e7f	Fix silly nil-pointer-dereference bug introduced in commit `d5f6f13f8`. Don't fetch record->xl_info before we've verified that record isn't NULL. Per Coverity. Michael Paquier	2016-11-06 11:29:40 -05:00
Tom Lane	d5f6f13f8e	Be more consistent about masking xl_info with ~XLR_INFO_MASK. Generally, WAL resource managers are only supposed to examine the top 4 bits of a WAL record's xl_info; the rest are reserved for the WAL mechanism itself. A few places were not consistent about doing this with respect to XLOG_CHECKPOINT and XLOG_SWITCH records. There's no bug currently, since no additional bits ever get set in these specific record types, but that might not be true forever. Let's follow the generic coding rule here too. Michael Paquier	2016-11-04 13:26:49 -04:00
Robert Haas	33839b5ffb	Fix leftover reference to background writer performing checkpoints. This was changed in PostgreSQL 9.2, but somehow this comment never got updated.	2016-10-28 09:09:00 -04:00
Robert Haas	f267c1c244	Fix possible pg_basebackup failure on standby with "include WAL". If a restartpoint flushed no dirty buffers, it could fail to update the minimum recovery point, leading to a minimum recovery point prior to the starting REDO location. perform_base_backup() would interpret that as meaning that no WAL files at all needed to be included in the backup, failing an internal sanity check. To fix, have restartpoints always update the minimum recovery point to just after the checkpoint record itself, so that the file (or files) containing the checkpoint record will always be included in the backup. Code by Amit Kapila, per a design suggestion by me, with some additional work on the code comment by me. Test case by Michael Paquier. Report by Kyotaro Horiguchi.	2016-10-27 11:19:51 -04:00
Alvaro Herrera	00f15338b2	Preserve commit timestamps across clean restart An oversight in setting the boundaries of known commit timestamps during startup caused old commit timestamps to become inaccessible after a server restart. Author and reporter: Julien Rouhaud Review, test code: Craig Ringer	2016-10-24 09:45:48 -03:00
Robert Haas	919c811ca1	Fix comment formatting.	2016-10-21 12:04:21 -04:00
Robert Haas	f82ec32ac3	Rename "pg_xlog" directory to "pg_wal". "xlog" is not a particularly clear abbreviation for "write-ahead log", and it sometimes confuses users into believe that the contents of the "pg_xlog" directory are not critical data, leading to unpleasant consequences. So, rename the directory to "pg_wal". This patch modifies pg_upgrade and pg_basebackup to understand both the old and new directory layouts; the former is necessary given the purpose of the tool, while the latter merely avoids an unnecessary backward-compatibility break. We may wish to consider renaming other programs, switches, and functions which still use the old "xlog" naming to also refer to "wal". However, that's still under discussion, so let's do just this much for now. Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com Michael Paquier	2016-10-20 11:32:18 -04:00
Robert Haas	6f3bd98ebf	Extend framework from commit `53be0b1ad` to report latch waits. WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an additional wait_event_info parameter; legal values are defined in pgstat.h. This makes it possible to uniquely identify every point in the core code where we are waiting for a latch; extensions can pass WAIT_EXTENSION. Because latches were the major wait primitive not previously covered by this patch, it is now possible to see information in pg_stat_activity on a large number of important wait events not previously addressed, such as ClientRead, ClientWrite, and SyncRep. Unfortunately, many of the wait events added by this patch will fail to appear in pg_stat_activity because they're only used in background processes which don't currently appear in pg_stat_activity. We should fix this either by creating a separate view for such information, or else by deciding to include them in pg_stat_activity after all. Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and Thomas Munro.	2016-10-04 11:01:42 -04:00
Peter Eisentraut	ebdf5bf7d1	Delay updating control file to "in production" Move the updating of the control file to "in production" status until the point where WAL writes are allowed. Before, there could be a significant gap between the control file update and write transactions actually being allowed. This makes it more reliable to use the control status to verify the end of a promotion. From: Michael Paquier <michael.paquier@gmail.com>	2016-09-21 12:00:00 -04:00
Robert Haas	445a38aba2	Have heapam.h include lockdefs.h rather than lock.h. lockdefs.h was only split from lock.h relatively recently, and represents a minimal subset of the old lock.h. heapam.h only needs that smaller subset, so adjust it to include only that. This requires some corresponding adjustments elsewhere. Peter Geoghegan	2016-09-13 09:21:35 -04:00
Simon Riggs	ec253de1fd	Fix corruption of 2PC recovery with subxacts Reading 2PC state files during recovery was borked, causing corruptions during recovery. Effect limited to servers with 2PC, subtransactions and recovery/replication. Stas Kelvich, reviewed by Michael Paquier and Pavan Deolasee	2016-09-09 11:55:12 +01:00
Simon Riggs	67c6bd1ca3	Fix minor memory leak in Standby startup StandbyRecoverPreparedTransactions() leaked the buffer used for two phase state file. This was leaked once at startup and at every shutdown checkpoint seen. Backpatch to 9.6 Stas Kelvich	2016-09-08 10:32:58 +01:00
Simon Riggs	35250b6ad7	New recovery target recovery_target_lsn Michael Paquier	2016-09-03 17:48:01 +01:00
Tom Lane	0e0f43d6fd	Prevent starting a standalone backend with standby_mode on. This can't really work because standby_mode expects there to be more WAL arriving, which there will not ever be because there's no WAL receiver process to fetch it. Moreover, if standby_mode is on then hot standby might also be turned on, causing even more strangeness because that expects read-only sessions to be executing in parallel. Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid got confused, but rather than band-aiding individual problems it seems best to prevent getting anywhere near this state in the first place. Back-patch to all supported branches. In passing, also fix some omissions of errcodes in other ereport's in readRecoveryCommandFile(). Michael Paquier (errcode hacking by me) Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>	2016-08-31 08:52:13 -04:00
Alvaro Herrera	8e1e3f958f	Split hash.h → hash_xlog.h Since the hash AM is going to be revamped to have WAL, this is a good opportunity to clean up the include file a little bit to avoid including a lot of extra stuff in the future. Author: Amit Kapila	2016-08-29 18:55:49 -03:00
Fujii Masao	bab7823a49	Fix pg_xlogdump so that it handles cross-page XLP_FIRST_IS_CONTRECORD record. Previously pg_xlogdump failed to dump the contents of the WAL file if the file starts with the continuation WAL record which spans more than one pages. Since pg_xlogdump assumed that the continuation record always fits on a page, it could not find the valid WAL record to start reading from in that case. This patch changes pg_xlogdump so that it can handle a continuation WAL record which crosses a page boundary and find the valid record to start reading from. Back-patch to 9.3 where pg_xlogdump was introduced. Author: Pavan Deolasee Reviewed-By: Michael Paquier and Craig Ringer Discussion: CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com	2016-08-29 14:34:58 +09:00
Tom Lane	ea268cdc9a	Add macros to make AllocSetContextCreate() calls simpler and safer. I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls had typos in the context-sizing parameters. While none of these led to especially significant problems, they did create minor inefficiencies, and it's now clear that expecting people to copy-and-paste those calls accurately is not a great idea. Let's reduce the risk of future errors by introducing single macros that encapsulate the common use-cases. Three such macros are enough to cover all but two special-purpose contexts; those two calls can be left as-is, I think. While this patch doesn't in itself improve matters for third-party extensions, it doesn't break anything for them either, and they can gradually adopt the simplified notation over time. In passing, change TopMemoryContext to use the default allocation parameters. Formerly it could only be extended 8K at a time. That was probably reasonable when this code was written; but nowadays we create many more contexts than we did then, so that it's not unusual to have a couple hundred K in TopMemoryContext, even without considering various dubious code that sticks other things there. There seems no good reason not to let it use growing blocks like most other contexts. Back-patch to 9.6, mostly because that's still close enough to HEAD that it's easy to do so, and keeping the branches in sync can be expected to avoid some future back-patching pain. The bugs fixed by these changes don't seem to be significant enough to justify fixing them further back. Discussion: <21072.1472321324@sss.pgh.pa.us>	2016-08-27 17:50:38 -04:00
Tom Lane	78dcd027e8	Fix potential memory leakage from HandleParallelMessages(). HandleParallelMessages leaked memory into the caller's context. Since it's called from ProcessInterrupts, there is basically zero certainty as to what CurrentMemoryContext is, which means we could be leaking into long-lived contexts. Over the processing of many worker messages that would grow to be a problem. Things could be even worse than just a leak, if we happened to service the interrupt while ErrorContext is current: elog.c thinks it can reset that on its own whim, possibly yanking storage out from under HandleParallelMessages. Give HandleParallelMessages its own dedicated context instead, which we can reset during each call to ensure there's no accumulation of wasted memory. Discussion: <16610.1472222135@sss.pgh.pa.us>	2016-08-26 15:04:05 -04:00
Tom Lane	fbf28b6b52	Fix logic for adding "parallel worker" context line to worker errors. The previous coding here was capable of adding a "parallel worker" context line to errors that were not, in fact, returned from a parallel worker. Instead of using an errcontext callback to add that annotation, just paste it onto the message by hand; this looks uglier but is more reliable. Discussion: <19757.1472151987@sss.pgh.pa.us>	2016-08-26 10:07:28 -04:00
Peter Eisentraut	f0fe1c8f70	Fix typos From: Alexander Law <exclusion@gmail.com>	2016-08-16 14:52:29 -04:00
Tom Lane	b6a97b91ff	Block interrupts during HandleParallelMessages(). As noted by Alvaro, there are CHECK_FOR_INTERRUPTS() calls in the shm_mq.c functions called by HandleParallelMessages(). I believe they're all unreachable since we always pass nowait = true, but it doesn't seem like a great idea to assume that no such call will ever be reachable from HandleParallelMessages(). If that did happen, there would be a risk of a recursive call to HandleParallelMessages(), which it does not appear to be designed for --- for example, there's nothing that would prevent out-of-order processing of received messages. And certainly such cases cannot easily be tested. So let's prevent it by holding off interrupts for the duration of the function. Back-patch to 9.5 which contains identical code. Discussion: <14869.1470083848@sss.pgh.pa.us>	2016-08-02 16:39:16 -04:00
Tom Lane	a5fe473ad7	Minor cleanup for access/transam/parallel.c. ParallelMessagePending must be marked volatile, because it's set by a signal handler. On the other hand, it's pointless for HandleParallelMessageInterrupt to save/restore errno; that must be, and is, done at the outer level of the SIGUSR1 signal handler. Calling CHECK_FOR_INTERRUPTS() inside HandleParallelMessages, which itself is called from CHECK_FOR_INTERRUPTS(), seems both useless and hazardous. The comment claiming that this is needed to handle the error queue going away is certainly misguided, in any case. Improve a couple of error message texts, and use ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE to report loss of parallel worker connection, since that's what's used in e.g. tqueue.c. (Maybe it would be worth inventing a dedicated ERRCODE for this type of failure? But I do not think ERRCODE_INTERNAL_ERROR is appropriate.) Minor stylistic cleanups.	2016-08-01 16:12:01 -04:00
Peter Eisentraut	ef5d4a3cfa	Message style improvements	2016-07-28 16:34:44 -04:00
Robert Haas	1091402b5a	Remove unused structure member. Michael Paquier	2016-07-21 11:53:44 -04:00
Magnus Hagander	87d84d67bb	Fix start WAL filename for concurrent backups from standby On a standby, ThisTimelineID is always 0, so we would generate a filename in timeline 0 even for other timelines. Instead, use starttli which we have retreived from the controlfile. Report by: Francesco Canovai in bug #14230 Author: Marco Nenciarini Reviewed by: Michael Paquier and Amit Kapila	2016-07-11 12:02:31 +02:00
Robert Haas	10c0558ffe	Fix several mistakes around parallel workers and client_encoding. Previously, workers sent data to the leader using the client encoding. That mostly worked, but the leader the converted the data back to the server encoding. Since not all encoding conversions are reversible, that could provoke failures. Fix by using the database encoding for all communication between worker and leader. Also, while temporary changes to GUC settings, as from the SET clause of a function, are in general OK for parallel query, changing client_encoding this way inside of a parallel worker is not OK. Previously, that would have confused the leader; with these changes, it would not confuse the leader, but it wouldn't do anything either. So refuse such changes in parallel workers. Also, the previous code naively assumed that when it received a NotifyResonse from the worker, it could pass that directly back to the user. But now that worker-to-leader communication always uses the database encoding, that's clearly no longer correct - though, actually, the old way was always broken for V2 clients. So disassemble and reconstitute the message instead. Issues reported by Peter Eisentraut. Patch by me, reviewed by Peter Eisentraut.	2016-06-30 18:35:32 -04:00
Alvaro Herrera	e3ad3ffa68	Fix handling of multixacts predating pg_upgrade After pg_upgrade, it is possible that some tuples' Xmax have multixacts corresponding to the old installation; such multixacts cannot have running members anymore. In many code sites we already know not to read them and clobber them silently, but at least when VACUUM tries to freeze a multixact or determine whether one needs freezing, there's an attempt to resolve it to its member transactions by calling GetMultiXactIdMembers, and if the multixact value is "in the future" with regards to the current valid multixact range, an error like this is raised: ERROR: MultiXactId 123 has not been created yet -- apparent wraparound and vacuuming fails. Per discussion with Andrew Gierth, it is completely bogus to try to resolve multixacts coming from before a pg_upgrade, regardless of where they stand with regards to the current valid multixact range. It's possible to get from under this problem by doing SELECT FOR UPDATE of the problem tuples, but if tables are large, this is slow and tedious, so a more thorough solution is desirable. To fix, we realize that multixacts in xmax created in 9.2 and previous have a specific bit pattern that is never used in 9.3 and later (we already knew this, per comments and infomask tests sprinkled in various places, but we weren't leveraging this knowledge appropriately). Whenever the infomask of the tuple matches that bit pattern, we just ignore the multixact completely as if Xmax wasn't set; or, in the case of tuple freezing, we act as if an unwanted value is set and clobber it without decoding. This guarantees that no errors will be raised, and that the values will be progressively removed until all tables are clean. Most callers of GetMultiXactIdMembers are patched to recognize directly that the value is a removable "empty" multixact and avoid calling GetMultiXactIdMembers altogether. To avoid changing the signature of GetMultiXactIdMembers() in back branches, we keep the "allow_old" boolean flag but rename it to "from_pgupgrade"; if the flag is true, we always return an empty set instead of looking up the multixact. (I suppose we could remove the argument in the master branch, but I chose not to do so in this commit). This was broken all along, but the error-facing message appeared first because of commit `8e9a16ab8f` and was partially fixed in `a25c2b7c4d`. This fix, backpatched all the way back to 9.3, goes approximately in the same direction as `a25c2b7c4d` but should cover all cases. Bug analysis by Andrew Gierth and Álvaro Herrera. A number of public reports match this bug: https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org	2016-06-24 18:29:28 -04:00
Alvaro Herrera	1ca4a1b5d2	Finish up XLOG_HINT renaming Commit `b8fd1a09f3` renamed XLOG_HINT to XLOG_FPI, but neglected two places. Backpatch to 9.3, like that commit.	2016-06-17 18:05:55 -04:00
Robert Haas	292794f82b	Remove PID from 'parallel worker' context message. Discussion: <bfd204ab-ab1a-792a-b345-0274a09a4b5f@2ndquadrant.com>	2016-06-17 09:26:17 -04:00
Tom Lane	bfb937427b	Fix fuzzy thinking in ReinitializeParallelDSM(). The fact that no workers were successfully launched in the previous iteration does not excuse us from setting up properly to try again. This appears to explain crashes I saw in parallel regression testing due to error_mqh being NULL when it shouldn't be. Minor other cosmetic fixes too.	2016-06-16 15:20:29 -04:00
Tom Lane	cae1c788b9	Improve the situation for parallel query versus temp relations. Transmit the leader's temp-namespace state to workers. This is important because without it, the workers do not really have the same search path as the leader. For example, there is no good reason (and no extant code either) to prevent a worker from executing a temp function that the leader created previously; but as things stood it would fail to find the temp function, and then either fail or execute the wrong function entirely. We still prohibit a worker from creating a temp namespace on its own. In effect, a worker can only see the session's temp namespace if the leader had created it before starting the worker, which seems like the right semantics. Also, transmit the leader's BackendId to workers, and arrange for workers to use that when determining the physical file path of a temp relation belonging to their session. While the original intent was to prevent such accesses entirely, there were a number of holes in that, notably in places like dbsize.c which assume they can safely access temp rels of other sessions anyway. We might as well get this right, as a small down payment on someday allowing workers to access the leader's temp tables. (With this change, directly using "MyBackendId" as a relation or buffer backend ID is deprecated; you should use BackendIdForTempRelations() instead. I left a couple of such uses alone though, as they're not going to be reachable in parallel workers until we do something about localbuf.c.) Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down into localbuf.c, which is where it actually matters, instead of having it in relation_open(). This amounts to recognizing that access to temp tables' catalog entries is perfectly safe in a worker, it's only the data in local buffers that is problematic. Having done all that, we can get rid of the test in has_parallel_hazard() that says that use of a temp table's rowtype is unsafe in parallel workers. That test was unduly expensive, and if we really did need such a prohibition, that was not even close to being a bulletproof guard for it. (For example, any user-defined function executed in a parallel worker might have attempted such access.)	2016-06-09 20:16:11 -04:00
Robert Haas	4bc424b968	pgindent run for 9.6	2016-06-09 18:02:36 -04:00
Peter Eisentraut	5c6d2a5e7c	Message style and wording fixes	2016-06-07 14:18:55 -04:00
Greg Stark	e1623c3959	Fix various common mispellings. Mostly these are just comments but there are a few in documentation and a handful in code and tests. Hopefully this doesn't cause too much unnecessary pain for backpatching. I relented from some of the most common like "thru" for that reason. The rest don't seem numerous enough to cause problems. Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings	2016-06-03 16:08:45 +01:00
Teodor Sigaev	7c979c95a3	Allocate all page images at once in generic wal interface That reduces number of allocation. Per gripe from Michael Paquier and Tom Lane suggestion.	2016-05-17 22:09:22 +03:00
Teodor Sigaev	7c8345f67f	Correctly align page's images in generic wal API Page image should be MAXALIGN'ed because existing code could directly align pointers in page instead of align offset from beginning of page. Found during play with indexes as extenstion, Alexander Korotkov and me	2016-05-17 00:01:35 +03:00
Alvaro Herrera	cca2a27860	Fix bogus comments Some comments mentioned XLogReplayBuffer, but there's no such function: that was an interim name for a function that got renamed to XLogReadBufferForRedo, before commit `2c03216d83` was pushed.	2016-05-12 16:07:07 -03:00
Tom Lane	9eb7a0ac6b	Fix poorly-worded log message. Euler Taveira	2016-05-08 01:37:07 -04:00
Robert Haas	06bd458cb8	Use mul_size when multiplying by the number of parallel workers. That way, if the result overflows size_t, you'll get an error instead of undefined behavior, which seems like a plus. This also has the effect of casting the number of workers from int to Size, which is better because it's harder to overflow int than size_t. Dilip Kumar reported this issue and provided a patch upon which this patch is based, but his version did use mul_size.	2016-05-06 14:32:58 -04:00
Alvaro Herrera	c1543a81a7	Revert timeline following in replication slots This reverts commits `f07d18b6e9`, `82c83b3372`, `3a3b309041`, and `24c5f1a103`. This feature has shown enough immaturity that it was deemed better to rip it out before rushing some more fixes at the last minute. There are discussions on larger changes in this area for the next release.	2016-05-04 17:32:22 -03:00
Andres Freund	c6ff84b06a	Emit invalidations to standby for transactions without xid. So far, when a transaction with pending invalidations, but without an assigned xid, committed, we simply ignored those invalidation messages. That's problematic, because those are actually sent for a reason. Known symptoms of this include that existing sessions on a hot-standby replica sometimes fail to notice new concurrently built indexes and visibility map updates. The solution is to WAL log such invalidations in transactions without an xid. We considered to alternatively force-assign an xid, but that'd be problematic for vacuum, which might be run in systems with few xids. Important: This adds a new WAL record, but as the patch has to be back-patched, we can't bump the WAL page magic. This means that standbys have to be updated before primaries; otherwise "PANIC: standby_redo: unknown op code 32" errors can be encountered. XXX: Reported-By: Васильев Дмитрий, Masahiko Sawada Discussion: CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com	2016-04-26 20:21:54 -07:00
Kevin Grittner	a343e223a5	Revert no-op changes to BufferGetPage() The reverted changes were intended to force a choice of whether any newly-added BufferGetPage() calls needed to be accompanied by a test of the snapshot age, to support the "snapshot too old" feature. Such an accompanying test is needed in about 7% of the cases, where the page is being used as part of a scan rather than positioning for other purposes (such as DML or vacuuming). The additional effort required for back-patching, and the doubt whether the intended benefit would really be there, have indicated it is best just to rely on developers to do the right thing based on comments and existing usage, as we do with many other conventions. This change should have little or no effect on generated executable code. Motivated by the back-patching pain of Tom Lane and Robert Haas	2016-04-20 08:31:19 -05:00
Tom Lane	5713f03973	Improve API of GenericXLogRegister(). Rename this function to GenericXLogRegisterBuffer() to make it clearer what it does, and leave room for other sorts of "register" actions in future. Also, replace its "bool isNew" argument with an integer flags argument, so as to allow adding more flags in future without an API break. Alexander Korotkov, adjusted slightly by me	2016-04-12 11:42:06 -04:00
Tom Lane	bdf7db8192	In generic WAL application and replay, ensure page "hole" is always zero. The previous coding could allow the contents of the "hole" between pd_lower and pd_upper to diverge during replay from what it had been when the update was originally applied. This would pose a problem if checksums were in use, and in any case would complicate forensic comparisons between master and slave servers. So force the "hole" to contain zeroes, both at initial application of a generically-logged action, and at replay. Alexander Korotkov, adjusted slightly by me	2016-04-12 11:14:00 -04:00
Tom Lane	660d5fb856	Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD. Since we're requiring pages handled by generic_xlog.c to be standard format, specify REGBUF_STANDARD when doing a full-page image, so that xloginsert.c can compress out the "hole" between pd_lower and pd_upper. Given the current API in which this path will be taken only for a newly initialized page, the hole is likely to be particularly large in such cases, so that this oversight could easily be performance-significant. I don't notice any particular change in the runtime of contrib/bloom's regression test, though.	2016-04-10 00:24:28 -04:00
Tom Lane	68689c66ef	Micro-optimize GenericXLogFinish(). Make the inner comparison loops of computeDelta() as tight as possible by pulling considerations of valid and invalid ranges out of the inner loops, and extending a match or non-match detection as far as possible before deciding what to do next. To keep this tractable, give up the possibility of merging fragments across the pd_lower to pd_upper gap. The fraction of pages where that could happen (ie, there are 4 or fewer bytes in the gap, and data changes immediately adjacent to it on both sides) is too small to be worth spending cycles on. Also, avoid two BLCKSZ-length memcpy()s by computing the delta before moving data into the target buffer, instead of after. This doesn't save nearly as many cycles as being tenser about computeDelta(), but it still seems worth doing. On my machine, this patch cuts a full 40% off the runtime of contrib/bloom's regression test.	2016-04-09 19:30:56 -04:00
Tom Lane	08e785436f	Get rid of GenericXLogUnregister(). This routine is unsafe as implemented, because it invalidates the page image pointers returned by previous GenericXLogRegister() calls. Rather than complicate the API or the implementation to avoid that, let's just get rid of it; the use-case for having it seems much too thin to justify a lot of work here. While at it, do some wordsmithing on the SGML docs for generic WAL.	2016-04-09 16:39:30 -04:00
Tom Lane	db03cf375d	Code review/prettification for generic_xlog.c. Improve commentary, use more specific names for the delta fields, const-ify pointer arguments where possible, avoid assuming that initializing only the first element of a local array will guarantee that the remaining elements end up as we need them. (I think that code in generic_redo actually worked, but only because InvalidBuffer is zero; this is a particularly ugly way of depending on that ...)	2016-04-09 15:02:19 -04:00
Tom Lane	2dd318d277	Run pgindent on generic_xlog.c. This code desperately needs some micro-optimization, and I'd like it to be formatted a bit more nicely while I work on it.	2016-04-09 13:33:33 -04:00
Kevin Grittner	8b65cf4c5e	Modify BufferGetPage() to prepare for "snapshot too old" feature This patch is a no-op patch which is intended to reduce the chances of failures of omission once the functional part of the "snapshot too old" patch goes in. It adds parameters for snapshot, relation, and an enum to specify whether the snapshot age check needs to be done for the page at this point. This initial patch passes NULL for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the third. The follow-on patch will change the places where the test needs to be made.	2016-04-08 14:30:10 -05:00
Andres Freund	5364b357fb	Increase maximum number of clog buffers. Benchmarking has shown that the current number of clog buffers limits scalability. We've previously increased the number in `33aaa139`, but that's not sufficient with a large number of clients. We've benchmarked the cost of increasing the limit by benchmarking worst case scenarios; testing showed that 128 buffers don't cause a regression, even in contrived scenarios, whereas 256 does There are a number of more complex patches flying around to address various clog scalability problems, but this is simple enough that we can get it into 9.6; and is beneficial even after those patches have been applied. It is a bit unsatisfactory to increase this in small steps every few releases, but a better solution seems to require a rewrite of slru.c; not something done quickly. Author: Amit Kapila and Andres Freund Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com	2016-04-08 08:25:59 -07:00
Stephen Frost	1574783b4c	Use GRANT system to manage access to sensitive functions Now that pg_dump will properly dump out any ACL changes made to functions which exist in pg_catalog, switch to using the GRANT system to manage access to those functions. This means removing 'if (!superuser()) ereport()' checks from the functions themselves and then REVOKEing EXECUTE right from 'public' for these functions in system_views.sql. Reviews by Alexander Korotkov, Jose Luis Tallon	2016-04-06 21:45:32 -04:00
Simon Riggs	3fe3511d05	Generic Messages for Logical Decoding API and mechanism to allow generic messages to be inserted into WAL that are intended to be read by logical decoding plugins. This commit adds an optional new callback to the logical decoding API. Messages are either text or bytea. Messages can be transactional, or not, and are identified by a prefix to allow multiple concurrent decoding plugins. (Not to be confused with Generic WAL records, which are intended to allow crash recovery of extensible objects.) Author: Petr Jelinek and Andres Freund Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs Discussion: 5685F999.6010202@2ndquadrant.com	2016-04-06 10:05:41 +01:00
Magnus Hagander	7117685461	Implement backup API functions for non-exclusive backups Previously non-exclusive backups had to be done using the replication protocol and pg_basebackup. With this commit it's now possible to make them using pg_start_backup/pg_stop_backup as well, as long as the backup program can maintain a persistent connection to the database. Doing this, backup_label and tablespace_map are returned as results from pg_stop_backup() instead of being written to the data directory. This makes the server safe from a crash during an ongoing backup, which can be a problem with exclusive backups. The old syntax of the functions remain and work exactly as before, but since the new syntax is safer this should eventually be deprecated and removed. Only reference documentation is included. The main section on backup still needs to be rewritten to cover this, but since that is already scheduled for a separate large rewrite, it's not included in this patch. Reviewed by David Steele and Amit Kapila	2016-04-05 20:03:49 +02:00
Alvaro Herrera	890614d2b3	Display WAL pointer in rm_redo error callback This makes it easier to identify the source of a recovery problem in case of a bug or data corruption.	2016-04-04 18:12:12 -03:00
Alvaro Herrera	c9ff752a85	Silence compiler warning Reported by Peter Eisentraut to occur on 32bit systems	2016-04-04 17:07:23 -03:00
Teodor Sigaev	65578341af	Add Generic WAL interface This interface is designed to give an access to WAL for extensions which could implement new access method, for example. Previously it was impossible because restoring from custom WAL would need to access system catalog to find a redo custom function. This patch suggests generic way to describe changes on page with standart layout. Bump XLOG_PAGE_MAGIC because of new record type. Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and minor editorization by my Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby, Michael Paquier	2016-04-01 12:21:48 +03:00
Alvaro Herrera	24c5f1a103	Enable logical slots to follow timeline switches When decoding from a logical slot, it's necessary for xlog reading to be able to read xlog from historical (i.e. not current) timelines; otherwise, decoding fails after failover, because the archives are in the historical timeline. This is required to make "failover logical slots" possible; it currently has no other use, although theoretically it could be used by an extension that creates a slot on a standby and continues to replay from the slot when the standby is promoted. This commit includes a module in src/test/modules with functions to manipulate the slots (which is not otherwise possible in SQL code) in order to enable testing, and a new test in src/test/recovery to ensure that the behavior is as expected. Author: Craig Ringer Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek	2016-03-30 20:07:05 -03:00
Alvaro Herrera	3b02ea4f07	XLogReader general code cleanup Some minor tweaks and comment additions, for cleanliness sake and to avoid having the upcoming timeline-following patch be polluted with unrelated cleanup. Extracted from a larger patch by Craig Ringer, reviewed by Andres Freund, with some additions by myself.	2016-03-30 18:56:13 -03:00
Robert Haas	314cbfc5da	Add new replication mode synchronous_commit = 'remote_apply'. In this mode, the master waits for the transaction to be applied on the remote side, not just written to disk. That means that you can count on a transaction started on the standby to see all commits previously acknowledged by the master. To make this work, the standby sends a reply after replaying each commit record generated with synchronous_commit >= 'remote_apply'. This introduces a small inefficiency: the extra replies will be sent even by standbys that aren't the current synchronous standby. But previously-existing synchronous_commit levels make no attempt at all to optimize which replies are sent based on what the primary cares about, so this is no worse, and at least avoids any extra replies for people not using the feature at all. Thomas Munro, reviewed by Michael Paquier and by me. Some additional tweaks by me.	2016-03-29 21:29:49 -04:00
Andres Freund	1a7a43672b	Don't use !! but != 0/NULL to force boolean evaluation. I introduced several uses of !! to force bit arithmetic to be boolean, but per discussion the project prefers != 0/NULL. Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com	2016-03-27 18:10:19 +02:00
Peter Eisentraut	b555ed8102	Merge wal_level "archive" and "hot_standby" into new name "replica" The distinction between "archive" and "hot_standby" existed only because at the time "hot_standby" was added, there was some uncertainty about stability. This is now a long time ago. We would like to move forward with simplifying the replication configuration, but this distinction is in the way, because a primary server cannot tell (without asking a standby or predicting the future) which one of these would be the appropriate level. Pick a new name for the combined setting to make it clearer that it covers all (non-logical) backup and replication uses. The old values are still accepted but are converted internally. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reviewed-by: David Steele <david@pgmasters.net>	2016-03-18 23:56:03 +01:00
Robert Haas	3aff33aa68	Fix typos. Oskari Saarenmaa	2016-03-15 18:06:11 -04:00
Alvaro Herrera	5bcc413f80	Fix typos in comments	2016-03-15 17:57:17 -03:00
Robert Haas	481c76abf4	Fix a typo, and remove unnecessary pgstat_report_wait_end(). Per Amit Kapila.	2016-03-11 07:34:00 -05:00
Robert Haas	53be0b1add	Provide much better wait information in pg_stat_activity. When a process is waiting for a heavyweight lock, we will now indicate the type of heavyweight lock for which it is waiting. Also, you can now see when a process is waiting for a lightweight lock - in which case we will indicate the individual lock name or the tranche, as appropriate - or for a buffer pin. Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful discussion and suggestions by many others, including Alexander Korotkov, Vladimir Borodin, and many others.	2016-03-10 12:44:09 -05:00
Simon Riggs	e0694cf9c7	Reduce size of two phase file header Previously 2PC header was fixed at 200 bytes, which in most cases wasted WAL space for a workload using 2PC heavily. Pavan Deolasee, reviewed by Petr Jelinek	2016-03-10 12:51:46 +00:00
Andres Freund	1d4a0ab19a	Avoid unlikely data-loss scenarios due to rename() without fsync. Renaming a file using rename(2) is not guaranteed to be durable in face of crashes. Use the previously added durable_rename()/durable_link_or_rename() in various places where we previously just renamed files. Most of the changed call sites are arguably not critical, but it seems better to err on the side of too much durability. The most prominent known case where the previously missing fsyncs could cause data loss is crashes at the end of a checkpoint. After the actual checkpoint has been performed, old WAL files are recycled. When they're filled, their contents are fdatasynced, but we did not fsync the containing directory. An OS/hardware crash in an unfortunate moment could then end up leaving that file with its old name, but new content; WAL replay would thus not replay it. Reported-By: Tomas Vondra Author: Michael Paquier, Tomas Vondra, Andres Freund Discussion: 56583BDD.9060302@2ndquadrant.com Backpatch: All supported branches	2016-03-09 18:53:53 -08:00
Robert Haas	b6fb6471f6	Add a generic command progress reporting facility. Using this facility, any utility command can report the target relation upon which it is operating, if there is one, and up to 10 64-bit counters; the intent of this is that users should be able to figure out what a utility command is doing without having to resort to ugly hacks like attaching strace to a backend. As a demonstration, this adds very crude reporting to lazy vacuum; we just report the target relation and nothing else. A forthcoming patch will make VACUUM report a bunch of additional data that will make this much more interesting. But this gets the basic framework in place. Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao, and Masanori Oyama.	2016-03-09 12:08:58 -05:00
Fujii Masao	d34794f7d5	Ignore recovery_min_apply_delay until recovery has reached consistent state Previously recovery_min_apply_delay was applied even before recovery had reached consistency. This could cause us to wait a long time unexpectedly for read-only connections to be allowed. It's problematic because the standby was useless during that wait time. This patch changes recovery_min_apply_delay so that it's applied once the database has reached the consistent state. That is, even if the delay is set, the standby tries to replay WAL records as fast as possible until it has reached consistency. Author: Michael Paquier Reviewed-By: Julien Rouhaud Reported-By: Greg Clough Backpatch: 9.4, where recovery_min_apply_delay was added Bug: #13770 Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org	2016-03-06 02:29:04 +09:00
Robert Haas	6fcde8a5c8	Minor improvements to transaction manager README. A simple SELECT is handled by PortalRunSelect, not ProcessQuery. Also, the previous indentation was unclear: change it so that a deeper level of indentation indicates that the outer function calls the inner one. Stas Kelvich	2016-03-04 14:12:28 -05:00
Robert Haas	df4685fb0c	Minor optimizations based on ParallelContext having nworkers_launched. Originally, we didn't have nworkers_launched, so code that used parallel contexts had to be preprared for the possibility that not all of the workers requested actually got launched. But now we can count on knowing the number of workers that were successfully launched, which can shave off a few cycles and simplify some code slightly. Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter Geoghegan.	2016-03-04 12:59:10 -05:00
Simon Riggs	481725c0ba	Correct StartupSUBTRANS for page wraparound StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans data structure, which in some cases could lead to errors in startup for Hot Standby. This patch wraps the pageids correctly, avoiding any such errors. Identified by exhaustive crash testing by Jeff Janes. Jeff Janes	2016-02-19 08:31:12 +00:00
Andres Freund	7975c5e0a9	Allow the WAL writer to flush WAL at a reduced rate. Commit `4de82f7d7` increased the WAL flush rate, mainly to increase the likelihood that hint bits can be set quickly. More quickly set hint bits can reduce contention around the clog et al. But unfortunately the increased flush rate can have a significant negative performance impact, I have measured up to a factor of ~4. The reason for this slowdown is that if there are independent writes to the underlying devices, for example because shared buffers is a lot smaller than the hot data set, or because a checkpoint is ongoing, the fdatasync() calls force cache flushes to be emitted to the storage. This is achieved by flushing WAL only if the last flush was longer than wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC) unflushed blocks are pending. Based on some tests the default for wal_writer_delay is 1MB, which seems to work well both on SSD and rotational media. To avoid negative performance impact due to `4de82f7d7` an earlier commit (`db76b1e`) made SetHintBits() more likely to succeed; preventing performance regressions in the pgbench tests I performed. Discussion: 20160118163908.GW10941@awork2.anarazel.de	2016-02-16 00:56:34 +01:00
Joe Conway	59a884e985	Change delimiter used for display of NextXID NextXID has been rendered in the form of a pg_lsn even though it really is not. This can cause confusion, so change the format from %u/%u to %u:%u, per discussion on hackers. Complaint by me, patch by me and Bruce, reviewed by Michael Paquier and Alvaro. Applied to HEAD only. Author: Joe Conway, Bruce Momjian Reviewed-by: Michael Paquier, Alvaro Herrera Backpatch-through: master	2016-02-12 14:23:59 -08:00
Robert Haas	63461a63f9	Make builtin lwlock tranche names consistent. Previously, we had a mix of styles. Amit Kapila	2016-02-12 08:07:11 -05:00
Tom Lane	d18643c4a6	Shift the responsibility for emitting "database system is shut down". Historically this message has been emitted at the end of ShutdownXLOG(). That's not an insane place for it in a standalone backend, but in the postmaster environment we've grown a fair amount of stuff that happens later, including archiver/walsender shutdown, stats collector shutdown, etc. Recent buildfarm experimentation showed that on slower machines there could be many seconds' delay between finishing ShutdownXLOG() and actual postmaster exit. That's fairly confusing, both for testing purposes and for DBAs. Hence, move the code that prints this message into UnlinkLockFiles(), so that it comes out just after we remove the postmaster's pidfile. That is a more appropriate definition of "is shut down" from the point of view of "pg_ctl stop", for example. In general, removing the pidfile should be the last externally-visible action of either a postmaster or a standalone backend; compare commit `d73d14c271` for instance. So this seems like a reasonably future-proof approach.	2016-02-11 14:14:22 -05:00
Tom Lane	c5e9b77127	Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier." This reverts commit `3971f64843` and a couple of followon debugging commits; I think we've learned what we can from them.	2016-02-10 16:01:04 -05:00
Tom Lane	7351e18286	Add more chattiness in server shutdown. Early returns from the buildfarm show that there's a bit of a gap in the logging I added in 3971f64843b02e4a: the portion of CreateCheckPoint() after CheckPointGuts() can take a fair amount of time. Add a few more log messages in that section of code. This too shall be reverted later.	2016-02-09 11:21:46 -05:00
Tom Lane	3971f64843	Temporarily make pg_ctl and server shutdown a whole lot chattier. This is a quick hack, due to be reverted when its purpose has been served, to try to gather information about why some of the buildfarm critters regularly fail with "postmaster does not shut down" complaints. Maybe they are just really overloaded, but maybe something else is going on. Hence, instrument pg_ctl to print the current time when it starts waiting for postmaster shutdown and when it gives up, and add a lot of logging of the current time in the server's checkpoint and shutdown code paths. No attempt has been made to make this pretty. I'm not even totally sure if it will build on Windows, but we'll soon find out.	2016-02-08 18:43:11 -05:00
Robert Haas	7c944bd903	Introduce a new GUC force_parallel_mode for testing purposes. When force_parallel_mode = true, we enable the parallel mode restrictions for all queries for which this is believed to be safe. For the subset of those queries believed to be safe to run entirely within a worker, we spin up a worker and run the query there instead of running it in the original process. When force_parallel_mode = regress, make additional changes to allow the regression tests to run cleanly even though parallel workers have been injected under the hood. Taken together, this facilitates both better user testing and better regression testing of the parallelism code. Robert Haas, with help from Amit Kapila and Rushabh Lathia.	2016-02-07 11:41:33 -05:00
Robert Haas	a1c1af2a1f	Introduce group locking to prevent parallel processes from deadlocking. For locking purposes, we now regard heavyweight locks as mutually non-conflicting between cooperating parallel processes. There are some possible pitfalls to this approach that are not to be taken lightly, but it works OK for now and can be changed later if we find a better approach. Without this, it's very easy for parallel queries to silently self-deadlock if the user backend holds strong relation locks. Robert Haas, with help from Amit Kapila. Thanks to Noah Misch and Andres Freund for extensive discussion of possible issues with this approach.	2016-02-07 10:16:13 -05:00
Robert Haas	7191ce8bea	Make all built-in lwlock tranche IDs fixed. This makes the values more stable, which seems like a good thing for anybody who needs to look at at them. Alexander Korotkov and Amit Kapila	2016-02-02 06:45:55 -05:00
Heikki Linnakangas	61ce1e8f15	Fix misspelled function name in comment.	2016-02-01 10:10:24 +02:00
Simon Riggs	978b2f65aa	Speedup 2PC by skipping two phase state files in normal path 2PC state info is written only to WAL at PREPARE, then read back from WAL at COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr checkpoint cycle will be written to disk in the same form as previously. Crash recovery path is not altered. Measured performance gains of 50-100% for short 2PC transactions by completely avoiding writing files and fsyncing. Other optimizations still available, further patches in related areas expected. Stas Kelvich and heavily edited by Simon Riggs Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas, a concrete example of how Postgres-XC has fed back ideas into PostgreSQL. Reviewed by Michael Paquier, Jeff Janes and Andres Freund Performance testing by Jesper Pedersen	2016-01-20 18:40:44 -08:00
Simon Riggs	422a55a687	Refactor to create generic WAL page read callback Previously we didn’t have a generic WAL page read callback function, surprisingly. Logical decoding has logical_read_local_xlog_page(), which was actually generic, so move that to xlogfunc.c and rename to read_local_xlog_page(). Maintain logical_read_local_xlog_page() so existing callers still work. As requested by Michael Paquier, Alvaro Herrera and Andres Freund	2016-01-20 17:18:58 -08:00
Simon Riggs	e63bb4549a	Add new user fn pg_current_xlog_flush_location() Tomas Vondra, reviewed by Michael Paquier and Amit Kapila Minor edits by me	2016-01-12 07:54:52 +00:00
Simon Riggs	1e29e6324c	Maintain local LogwrtResult consistently Teach GetFlushRecPtr() to update LogwrtResult cache as performed by all other functions in xlog.c	2016-01-12 07:33:20 +00:00
Bruce Momjian	ee94300446	Update copyright for 2016 Backpatch certain files through 9.1	2016-01-02 13:33:40 -05:00
Noah Misch	3cd1ba147e	Fix comments about WAL rule "write xlog before data" versus pg_multixact. Recovery does not achieve its goal of zeroing all pg_multixact entries whose accompanying WAL records never reached disk. Remove that claim and justify its expendability. Detail the need for TrimMultiXact(), which has little in common with the TrimCLOG() rationale. Merge two tightly-related comments. Stop presenting pg_multixact as specific to heap_lock_tuple(); PostgreSQL 9.3 extended its use to heap_update(). Noticed while investigating a report from Andres Freund.	2016-01-01 01:46:46 -05:00
Joe Conway	241448b23a	Rename (new\|old)estCommitTs to (new\|old)estCommitTsXid The variables newestCommitTs and oldestCommitTs sound as if they are timestamps, but in fact they are the transaction Ids that correspond to the newest and oldest timestamps rather than the actual timestamps. Rename these variables to reflect that they are actually xids: to wit newestCommitTsXid and oldestCommitTsXid respectively. Also modify related code in a similar fashion, particularly the user facing output emitted by pg_controldata and pg_resetxlog. Complaint and patch by me, review by Tom Lane and Alvaro Herrera. Backpatch to 9.5 where these variables were first introduced.	2015-12-28 12:34:11 -08:00
Robert Haas	3fed417452	Provide a way to predefine LWLock tranche IDs. It's a bit cumbersome to use LWLockNewTrancheId(), because the returned value needs to be shared between backends so that each backend can call LWLockRegisterTranche() with the correct ID. So, for built-in tranches, use a hard-coded value instead. This is motivated by an upcoming patch adding further built-in tranches. Andres Freund and Robert Haas	2015-12-15 11:48:19 -05:00
Andres Freund	cca705a5d9	Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure. Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would install 0 into MultiXactState->offsetStopLimit if it previously succeeded. Luckily, there are no known cases where find_multixact_start() will return an error in 9.5 and above. But if it were to happen, for example due to filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId() could continue allocating mxids even if close to a wraparound, or it could erroneously stop allocating mxids, even if no wraparound is looming. The wrong value would be corrected the next time SetOffsetVacuumLimit() is called, or by a restart. Reported-By: Noah Misch, although this is not his preferred fix Discussion: 20151210140450.GA22278@alap3.anarazel.de Backpatch: 9.5, where the bug was introduced as part of 4f627f	2015-12-14 11:34:16 +01:00
Alvaro Herrera	69e7235c93	Fix commit timestamp initialization This module needs explicit initialization in order to replay WAL records in recovery, but we had broken this recently following changes to make other (stranger) scenarios work correctly. To fix, rework the initialization sequence so that it always takes place before WAL replay commences for both master and standby. I could have gone for a more localized fix that just added a "startup" call for the master server, but it seemed better to restructure the existing callers as well so that the whole thing made more sense. As a drawback, there is more control logic in xlog.c now than previously, but doing otherwise meant passing down the ControlFile flag, which seemed uglier as a whole. This also meant adding a check to not re-execute ActivateCommitTs if it had already been called. Reported by Fujii Masao. Backpatch to 9.5.	2015-12-11 14:30:43 -03:00
Peter Eisentraut	a351705d8a	Improve some messages	2015-12-10 22:05:27 -05:00
Andres Freund	e3f4cfc7aa	Fix bug leading to restoring unlogged relations from empty files. At the end of crash recovery, unlogged relations are reset to the empty state, using their init fork as the template. The init fork is copied to the main fork without going through shared buffers. Unfortunately WAL replay so far has not necessarily flushed writes from shared buffers to disk at that point. In normal crash recovery, and before the introduction of 'fast promotions' in `fd4ced523` / 9.3, the END_OF_RECOVERY checkpoint flushes the buffers out in time. But with fast promotions that's not the case anymore. To fix, force WAL writes targeting the init fork to be flushed immediately (using the new FlushOneBuffer() function). In 9.5+ that flush can centrally be triggered from the code dealing with restoring full page writes (XLogReadBufferForRedoExtended), in earlier releases that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay function. Backpatch to 9.1, even if this currently is only known to trigger in 9.3+. Flushing earlier is more robust, and it is advantageous to keep the branches similar. Typical symptoms of this bug are errors like 'ERROR: index "..." contains unexpected zero page at block 0' shortly after promoting a node. Reported-By: Thom Brown Author: Andres Freund and Michael Paquier Discussion: 20150326175024.GJ451@alap3.anarazel.de Backpatch: 9.1-	2015-12-10 16:29:26 +01:00
Alvaro Herrera	820ddb2c2f	Further tweak commit_timestamp behavior As pointed out by Fujii Masao, we weren't quite there on a standby behaving sanely: first because we were failing to acquire the correct state in the case where no XLOG_PARAMETER_CHANGE message was sent (because a checkpoint had already happened after the setting was changed in the master, and then the standby was restarted); and second because promoting the standby with the feature enabled failed to activate it if the master had the feature disabled. This patch fixes both those misbehaviors hopefully without re-introducing any old problems. Also change the hint emitted in a standby together with the error message about the feature being disabled, to make it point out that the place to chance the setting is the master. Otherwise, if the setting is already enabled in the standby, it is very confusing to have it say that the setting must be enabled ... Authors: Álvaro Herrera, Petr Jelínek. Backpatch to 9.5.	2015-12-03 19:22:31 -03:00
Peter Eisentraut	5db837d3f2	Message improvements	2015-11-16 21:39:23 -05:00
Robert Haas	fe702a7b3f	Move each SLRU's lwlocks to a separate tranche. This makes it significantly easier to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output. It's also arguably better from a modularity standpoint, since lwlock.c no longer needs to know anything about the LWLock needs of the higher-level SLRU facility. Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.	2015-11-12 14:59:09 -05:00
Robert Haas	64b2e7ad91	Pass extra data to bgworkers, and use this to fix parallel contexts. Up until now, the total amount of data that could be passed to a background worker at startup was one datum, which can be a small as 4 bytes on some systems. That's enough to pass a dsm_handle or an array index, but not much else. Add a bgw_extra flag to the BackgroundWorker struct, allowing up to 128 bytes to be passed to a new worker on any platform. Use this to fix a problem I recently discovered with the parallel context machinery added in 9.5: the master assigns each worker an array index, and each worker subsequently assigns itself an array index, and there's nothing to guarantee that the two sets of indexes match, leading to chaos. Normally, I would not back-patch the change to add bgw_extra, since it is basically a feature addition. However, since 9.5 is still in beta and there seems to be no other sensible way to repair the broken parallel context machinery, back-patch to 9.5. Existing background worker code can ignore the bgw_extra field without a problem, but might need to be recompiled since the structure size has changed. Report and patch by me. Review by Amit Kapila.	2015-11-05 12:13:56 -05:00
Robert Haas	3a1f8611f2	Update parallel executor support to reuse the same DSM. Commit `b0b0d84b3d` purported to make it possible to relaunch workers using the same parallel context, but it had an unpleasant race condition: we might reinitialize after the workers have sent their last control message but before they have dettached the DSM, leaving to crashes. Repair by introducing a new ParallelContext operation, ReinitializeParallelDSM. Adjust execParallel.c to use this new support, so that we can rescan a Gather node by relaunching workers but without needing to recreate the DSM. Amit Kapila, with some adjustments by me. Extracted from latest parallel sequential scan patch.	2015-10-30 10:44:54 +01:00
Peter Eisentraut	a8d585c091	Message style improvements Message style, plurals, quoting, spelling, consistency with similar messages	2015-10-28 20:38:36 -04:00
Alvaro Herrera	531d21b75f	Cleanup commit timestamp module activaction, again Further tweak commit_ts.c so that on a standby the state is completely consistent with what that in the master, rather than behaving differently in the cases that the settings differ. Now in standby and master the module should always be active or inactive in lockstep. Author: Petr Jelínek, with some further tweaks by Álvaro Herrera. Backpatch to 9.5, where commit timestamps were introduced. Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com	2015-10-27 15:06:50 -03:00
Robert Haas	31ba62ce32	Fix typos in comments. CharSyam	2015-10-22 14:52:23 -04:00
Robert Haas	b0b0d84b3d	Allow a parallel context to relaunch workers. This may allow some callers to avoid the overhead involved in tearing down a parallel context and then setting up a new one, which means releasing the DSM and then allocating and populating a new one. I suspect we'll want to revise the Gather node to make use of this new capability, but even if not it may be useful elsewhere and requires very little additional code.	2015-10-16 17:18:05 -04:00
Robert Haas	a53c06a13e	Prohibit parallel query when the isolation level is serializable. In order for this to be safe, the code which hands true serializability will need to taught that the SIRead locks taken by a parallel worker pertain to the same transaction as those taken by the parallel leader. Some further changes may be needed as well. Until the necessary adaptations are made, don't generate parallel plans in serializable mode, and if a previously-generated parallel plan is used after serializable mode has been activated, run it serially. This fixes a bug in commit `7aea8e4f2d`.	2015-10-16 11:58:27 -04:00
Robert Haas	82b37765c7	Fix a problem with parallel workers being unable to restore role. check_role() tries to verify that the user has permission to become the requested role, but this is inappropriate in a parallel worker, which needs to exactly recreate the master's authorization settings. So skip the check in that case. This fixes a bug in commit `924bcf4f16`.	2015-10-16 11:37:19 -04:00
Robert Haas	6de6d96d97	Invalidate caches after cranking up a parallel worker transaction. Starting a parallel worker transaction changes our notion of which XIDs are in-progress or committed, and our notion of the current command counter ID. Therefore, our view of these caches prior to starting this transaction may no longer valid. Defend against that by clearing them. This fixes a bug in commit `924bcf4f16`.	2015-10-16 11:31:23 -04:00
Robert Haas	94b4f7e2a6	Tighten up application of parallel mode checks. Commit `924bcf4f16` failed to enforce parallel mode checks during the commit of a parallel worker, because we exited parallel mode prior to ending the transaction so that we could pop the active snapshot. Re-establish parallel mode during parallel worker commit. Without this, it's far too easy for unsafe actions during the pre-commit sequence to crash the server instead of hitting the error checks as intended. Just to be extra paranoid, adjust a couple of the sanity checks in xact.c to check not only IsInParallelMode() but also IsParallelWorker().	2015-10-16 09:59:57 -04:00
Robert Haas	423ec0877f	Transfer current command counter ID to parallel workers. Commit `924bcf4f16` correctly forbade parallel workers to modify the command counter while in parallel mode, but it inexplicably neglected to actually transfer the current command counter from leader to workers. This can result in the workers seeing a different set of tuples from the leader, which is bad. Repair.	2015-10-16 09:58:42 -04:00
Robert Haas	2ad5c27bb5	Don't send protocol messages to a shm_mq that no longer exists. Commit `2bd9e412f9` introduced a mechanism for relaying protocol messages from a background worker to another backend via a shm_mq. However, there was no provision for shutting down the communication channel. Therefore, a protocol message sent late in the shutdown sequence, such as a DEBUG message resulting from cranking up log_min_messages, could crash the server. To fix, install an on_dsm_detach callback that disables sending messages to the shm_mq when the associated DSM is detached.	2015-10-16 09:42:33 -04:00
Alvaro Herrera	e06b2e1d2e	Don't disable commit_ts in standby if enabled locally Bug noticed by Fujii Masao	2015-10-02 12:49:01 -03:00
Peter Eisentraut	87c2b517ac	Fix message punctuation according to style guide	2015-10-01 21:39:56 -04:00
Alvaro Herrera	f12e814b88	Fix commit_ts for standby Module initialization was still not completely correct after commit `6b61955135`, per crash report from Takashi Ohnishi. To fix, instead of trying to monkey around with the value of the GUC setting directly, add a separate boolean flag that enables the feature on a standby, but only for the startup (recovery) process, when it sees that its master server has the feature enabled. Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp Also change the deactivation routine to delete all segment files rather than leaving the last one around. (This doesn't need separate WAL-logging, because on recovery we execute the same deactivation routine anyway.) In passing, clean up the code structure somewhat, particularly so that xlog.c doesn't know so much about when to activate/deactivate the feature. Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion. Back-patch to 9.5, where commit_ts was introduced.	2015-10-01 15:06:55 -03:00
Robert Haas	227d57f358	Don't dump core when destroying an unused ParallelContext. If a transaction or subtransaction creates a ParallelContext but ends without calling InitializeParallelDSM, the previous code would seg fault. Fix that.	2015-09-30 18:36:31 -04:00
Alvaro Herrera	6b61955135	Code review for transaction commit timestamps There are three main changes here: 1. No longer cause a start failure in a standby if the feature is disabled in postgresql.conf but enabled in the master. This reverts one part of commit 4f3924d9cd43; what we keep is the ability of the standby to activate/deactivate the module (which includes creating and removing segments as appropriate) during replay of such actions in the master. 2. Replay WAL records affecting commitTS even if the feature is disabled. This means the standby will always have the same state as the master after replay. 3. Have COMMIT PREPARE record the transaction commit time as well. We were previously only applying it in the normal transaction commit path. Author: Petr Jelínek Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com Additionally, I cleaned up nearby code related to replication origins, which I found a bit hard to follow, and fixed a couple of typos. Backpatch to 9.5, where this code was introduced. Per bug reports from Fujii Masao and subsequent discussion.	2015-09-29 14:40:56 -03:00
Alvaro Herrera	17f5831c81	Fix "sesssion" typo It was introduced alongside replication origins, by commit `5aa2350426`, so backpatch to 9.5. Pointed out by Fujii Masao	2015-09-28 19:13:42 -03:00
Andres Freund	aa29c1ccd9	Remove legacy multixact truncation support. In 9.5 and master there is no need to support legacy truncation. This is just committed separately to make it easier to backpatch the WAL logged multixact truncation to 9.3 and 9.4 if we later decide to do so. I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to 0xD087 to avoid 9.5 reusing a value that has been in use on master while keeping the numbers increasing between major versions. Discussion: 20150621192409.GA4797@alap3.anarazel.de Backpatch: 9.5	2015-09-26 19:04:25 +02:00
Andres Freund	4f627f8973	Rework the way multixact truncations work. The fact that multixact truncations are not WAL logged has caused a fair share of problems. Amongst others it requires to do computations during recovery while the database is not in a consistent state, delaying truncations till checkpoints, and handling members being truncated, but offset not. We tried to put bandaids on lots of these issues over the last years, but it seems time to change course. Thus this patch introduces WAL logging for multixact truncations. This allows: 1) to perform the truncation directly during VACUUM, instead of delaying it to the checkpoint. 2) to avoid looking at the offsets SLRU for truncation during recovery, we can just use the master's values. 3) simplify a fair amount of logic to keep in memory limits straight, this has gotten much easier During the course of fixing this a bunch of additional bugs had to be fixed: 1) Data was not purged from memory the member's SLRU before deleting segments. This happened to be hard or impossible to hit due to the interlock between checkpoints and truncation. 2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but that doesn't work for offsets that haven't yet been flushed to disk. Add code to flush the SLRUs to fix. Not pretty, but it feels slightly safer to only make decisions based on actual on-disk state. 3) find_multixact_start() could be called concurrently with a truncation and thus fail. Via SetOffsetVacuumLimit() that could lead to a round of emergency vacuuming. The problem remains in pg_get_multixact_members(), but that's quite harmless. For now this is going to only get applied to 9.5+, leaving the issues in the older branches in place. It is quite possible that we need to backpatch at a later point though. For the case this gets backpatched we need to handle that an updated standby may be replaying WAL from a not-yet upgraded primary. We have to recognize that situation and use "old style" truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to before, this now happens in the startup process, when replaying a checkpoint record, instead of the checkpointer. Doing truncation in the restartpoint is incorrect, they can happen much later than the original checkpoint, thereby leading to wraparound. To avoid "multixact_redo: unknown op code 48" errors standbys would have to be upgraded before primaries. A later patch will bump the WAL page magic, and remove the legacy truncation codepaths. Legacy truncation support is just included to make a possible future backpatch easier. Discussion: 20150621192409.GA4797@alap3.anarazel.de Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro Backpatch: 9.5 for now	2015-09-26 19:04:25 +02:00
Peter Eisentraut	4a1e15e4a9	Add missing serial comma	2015-09-18 22:41:42 -04:00
Fujii Masao	10fbb79f1a	Improve log messages related to tablespace_map file This patch changes the log message which is logged when the server successfully renames backup_label file to *.old but fails to rename tablespace_map file during the shutdown. Previously the WARNING message "online backup mode was not canceled" was logged in that case. However this message is confusing because the backup mode is treated as canceled whenever backup_label is successfully renamed. So this commit makes the server log the message "online backup mode canceled" in that case. Also this commit changes errdetail messages so that they follow the error message style guide. Back-patch to 9.5 where tablespace_map file is introduced. Original patch by Amit Kapila, heavily modified by me.	2015-09-15 23:21:51 +09:00
Fujii Masao	96f6a0cb41	Remove files signaling a standby promotion request at postmaster startup This commit makes postmaster forcibly remove the files signaling a standby promotion request. Otherwise, the existence of those files can trigger a promotion too early, whether a user wants that or not. This removal of files is usually unnecessary because they can exist only during a few moments during a standby promotion. However there is a race condition: if pg_ctl promote is executed and creates the files during a promotion, the files can stay around even after the server is brought up to new master. Then, if new standby starts by using the backup taken from that master, the files can exist at the server startup and should be removed in order to avoid an unexpected promotion. Back-patch to 9.1 where promote signal file was introduced. Problem reported by Feike Steenbergen. Original patch by Michael Paquier, modified by me. Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org	2015-09-09 22:51:44 +09:00
Heikki Linnakangas	c80b5f66c6	Fix misc typos. Oskari Saarenmaa. Backpatch to stable branches where applicable.	2015-09-05 11:35:49 +03:00
Tom Lane	c5454f99c4	Fix subtransaction cleanup after an outer-subtransaction portal fails. Formerly, we treated only portals created in the current subtransaction as having failed during subtransaction abort. However, if the error occurred while running a portal created in an outer subtransaction (ie, a cursor declared before the last savepoint), that has to be considered broken too. To allow reliable detection of which ones those are, add a bookkeeping field to struct Portal that tracks the innermost subtransaction in which each portal has actually been executed. (Without this, we'd end up failing portals containing functions that had called the subtransaction, thereby breaking plpgsql exception blocks completely.) In addition, when we fail an outer-subtransaction Portal, transfer its resources into the subtransaction's resource owner, so that they're released early in cleanup of the subxact. This fixes a problem reported by Jim Nasby in which a function executed in an outer-subtransaction cursor could cause an Assert failure or crash by referencing a relation created within the inner subtransaction. The proximate cause of the Assert failure is that AtEOSubXact_RelationCache assumed it could blow away a relcache entry without first checking that the entry had zero refcount. That was a bad idea on its own terms, so add such a check there, and to the similar coding in AtEOXact_RelationCache. This provides an independent safety measure in case there are still ways to provoke the situation despite the Portal-level changes. This has been broken since subtransactions were invented, so back-patch to all supported branches. Tom Lane and Michael Paquier	2015-09-04 13:37:14 -04:00
Fujii Masao	1ea5ce5c5f	Document that max_worker_processes must be high enough in standby. The setting values of some parameters including max_worker_processes must be equal to or higher than the values on the master. However, previously max_worker_processes was not listed as such parameter in the document. So this commit adds it to that list. Back-patch to 9.4 where max_worker_processes was added.	2015-09-03 22:30:16 +09:00
Alvaro Herrera	e68be16b0d	Do not allow timestamp to be passed as NULL The code had bugs that would cause crashes if NULL was passed as that argument (originally intended to mean not to bother returning its value), and after inspection it turns out that nothing seems interested in the case that ts is NULL anyway. Therefore, remove the partial checks intended to support that case. Author: Michael Paquier though I didn't include a proposed Assert. Backpatch to 9.5.	2015-08-21 14:36:54 -03:00
Andres Freund	e95126cf04	Don't use function definitions looking like old-style ones. This fixes a bunch of somewhat pedantic warnings with new compilers. Since by far the majority of other functions definitions use the (void) style it just seems to be consistent to do so as well in the remaining few places.	2015-08-15 17:25:00 +02:00
Andres Freund	18e8613564	Address points made in post-commit review of replication origins. Amit reviewed the replication origins patch and made some good points. Address them. This fixes typos in error messages, docs and comments and adds a missing error check (although in a should-never-happen scenario). Discussion: CAA4eK1JqUBVeWWKwUmBPryFaje4190ug0y-OAUHWQ6tD83V4xg@mail.gmail.com Backpatch: 9.5, where replication origins were introduced.	2015-08-07 15:09:05 +02:00
Robert Haas	0e141c0fbb	Reduce ProcArrayLock contention by removing backends in batches. When a write transaction commits, it must clear its XID advertised via the ProcArray, which requires that we hold ProcArrayLock in exclusive mode in order to prevent concurrent processes running GetSnapshotData from seeing inconsistent results. When many processes try to commit at once, ProcArrayLock must change hands repeatedly, with each concurrent process trying to commit waking up to acquire the lock in turn. To make things more efficient, when more than one backend is trying to commit a write transaction at the same time, have just one of them acquire ProcArrayLock in exclusive mode and clear the XIDs of all processes in the group. Benchmarking reveals that this is much more efficient at very high client counts. Amit Kapila, heavily revised by me, with some review also from Pavan Deolasee.	2015-08-06 12:02:12 -04:00
Fujii Masao	dd85acf0c4	Make recovery rename tablespace_map to *.old if backup_label is not present. If tablespace_map file is present without backup_label file, there is no use of such file. There is no harm in retaining it, but it is better to get rid of the map file so that we don't have any redundant file in data directory and it will avoid any sort of confusion. It seems prudent though to just rename the file out of the way rather than delete it completely, also we ignore any error that occurs in rename operation as even if map file is present without backup_label file, it is harmless. Back-patch to 9.5 where tablespace_map file was introduced. Amit Kapila, reviewed by Robert Haas, Alvaro Herrera and me.	2015-08-03 23:04:41 +09:00
Heikki Linnakangas	358cde320b	Fix race condition that lead to WALInsertLock deadlock with commit_delay. If a call to WaitForXLogInsertionsToFinish() returned a value in the middle of a page, and another backend then started to insert a record to the same page, and then you called WaitXLogInsertionsToFinish() again, the second call might return a smaller value than the first call. The problem was in GetXLogBuffer(), which always updated the insertingAt value to the beginning of the requested page, not the actual requested location. Because of that, the second call might return a xlog pointer to the beginning of the page, while the first one returned a later position on the same page. XLogFlush() performs two calls to WaitXLogInsertionsToFinish() in succession, and holds WALWriteLock on the second call, which can deadlock if the second call to WaitXLogInsertionsToFinish() blocks. Reported by Spiros Ioannou. Backpatch to 9.4, where the more scalable WALInsertLock mechanism, and this bug, was introduced.	2015-08-02 20:08:10 +03:00
Andres Freund	7039760114	Fix issues around the "variable" support in the lwlock infrastructure. The lwlock scalability work introduced two race conditions into the lwlock variable support provided for xlog.c. First, and harmlessly on most platforms, it set/read the variable without the spinlock in some places. Secondly, due to the removal of the spinlock, it was possible that a backend missed changes to the variable's state if it changed in the wrong moment because checking the lock's state, the variable's state and the queuing are not protected by a single spinlock acquisition anymore. To fix first move resetting the variable's from LWLockAcquireWithVar to WALInsertLockRelease, via a new function LWLockReleaseClearVar. That prevents issues around waiting for a variable's value to change when a new locker has acquired the lock, but not yet set the value. Secondly re-check that the variable hasn't changed after enqueing, that prevents the issue that the lock has been released and already re-acquired by the time the woken up backend checks for the lock's state. Reported-By: Jeff Janes Analyzed-By: Heikki Linnakangas Reviewed-By: Heikki Linnakangas Discussion: 5592DB35.2060401@iki.fi Backpatch: 9.5, where the lwlock scalability went in	2015-08-02 18:41:23 +02:00
Heikki Linnakangas	5e65f45c6e	Another attempt at fixing memory leak in xlogreader. max_block_id is also reset between reading records. Michael Paquier	2015-07-28 09:09:36 +03:00
Heikki Linnakangas	61a65c53bd	Fix memory leak in xlogreader facility. XLogReaderFree failed to free the per-block data buffers, when they happened to not be used by the latest read WAL record. Michael Paquier. Backpatch to 9.5, where the per-block buffers were added.	2015-07-27 18:29:31 +03:00
Heikki Linnakangas	766dcfb16c	Fix off-by-one error in calculating subtrans/multixact truncation point. If there were no subtransactions (or multixacts) active, we would calculate the oldestxid == next xid. That's correct, but if next XID happens to be on the next pg_subtrans (pg_multixact) page, the page does not exist yet, and SimpleLruTruncate will produce an "apparent wraparound" warning. The warning is harmless in this case, but looks very alarming to users. Backpatch to all supported versions. Patch and analysis by Thomas Munro.	2015-07-23 01:29:59 +03:00
Heikki Linnakangas	eb11de8ff5	Sanity-check that a page zeroed by redo routine is marked with WILL_INIT. There was already a sanity-check in the other direction: if a page was marked with WILL_INIT, it had to be initialized by the redo routine. It's not strictly necessary for correctness that a page is marked with WILL_INIT if it's going to be initialized at redo, but it's a missed optimization if nothing else. Fix a few instances of this issue in SP-GiST, where a block in WAL record was not marked with WILL_INIT, but was in fact always initialized at redo. We were creating a full-page image of the page unnecessarily in those cases. Backpatch to 9.5, where the new WILL_INIT flag was added.	2015-07-20 22:34:01 +03:00
Heikki Linnakangas	f92d6a540a	Use appendStringInfoString/Char et al where appropriate. Patch by David Rowley. Backpatch to 9.5, as some of the calls were new in 9.5, and keeping the code in sync with master makes future backpatching easier.	2015-07-02 12:36:03 +03:00
Fujii Masao	8217370864	Make XLogFileCopy() look the same as in 9.4. XLogFileCopy() was changed heavily in commit `de76884`. However it was partially reverted in commit `7abc685` and most of those changes to XLogFileCopy() were no longer needed. Then commit `7cbee7c` removed those unnecessary code, but XLogFileCopy() looked different in master and 9.4 though the contents are almost the same. This patch makes XLogFileCopy() look the same in master and back-branches, which makes back-patching easier, per discussion on pgsql-hackers. Back-patch to 9.5. Discussion: 55760844.7090703@iki.fi Michael Paquier	2015-07-01 10:54:47 +09:00
Heikki Linnakangas	d661532e27	Also trigger restartpoints based on max_wal_size on standby. When archive recovery and restartpoints were initially introduced, checkpoint_segments was ignored on the grounds that the files restored from archive don't consume any space in the recovery server. That was changed in later releases, but even then it was arguably a feature rather than a bug, as performing restartpoints as often as checkpoints during normal operation might be excessive, but you might nevertheless not want to waste a lot of space for pre-allocated WAL by setting checkpoint_segments to a high value. But now that we have separate min_wal_size and max_wal_size settings, you can bound WAL usage with max_wal_size, and still avoid consuming excessive space usage by setting min_wal_size to a lower value, so that argument is moot. There are still some issues with actually limiting the space usage to max_wal_size: restartpoints in recovery can only start after seeing the checkpoint record, while a checkpoint starts flushing buffers as soon as the redo-pointer is set. Restartpoint is paced to happen at the same leisurily speed, determined by checkpoint_completion_target, as checkpoints, but because they are started later, max_wal_size can be exceeded by upto one checkpoint cycle's worth of WAL, depending on checkpoint_completion_target. But that seems better than not trying at all, and max_wal_size is a soft limit anyway. The documentation already claimed that max_wal_size is obeyed in recovery, so this just fixes the behaviour to match the docs. However, add some weasel-words there to mention that max_wal_size may well be exceeded by some amount in recovery.	2015-06-29 00:09:10 +03:00
Heikki Linnakangas	a32c3ec893	Promote the assertion that XLogBeginInsert() is not called twice into ERROR. Seems like cheap insurance for WAL bugs. A spurious call to XLogBeginInsert() in itself would be fairly harmless, but if there is any data registered and the insertion is not completed/cancelled properly, there is a risk that the data ends up in a wrong WAL record. Per Jeff Janes's suggestion.	2015-06-28 22:30:39 +03:00
Robert Haas	8f15f74a44	Be more conservative about removing tablespace "symlinks". Don't apply rmtree(), which will gleefully remove an entire subtree, and don't even apply unlink() unless it's symlink or a directory, the only things that we expect to find. Amit Kapila, with minor tweaks by me, per extensive discussions involving Andrew Dunstan, Fujii Masao, and Heikki Linnakangas, at least some of whom also reviewed the code.	2015-06-26 15:53:13 -04:00
Andres Freund	667912aee6	Improve multixact emergency autovacuum logic. Previously autovacuum was not necessarily triggered if space in the members slru got tight. The first problem was that the signalling was tied to values in the offsets slru, but members can advance much faster. Thats especially a problem if old sessions had been around that previously prevented the multixact horizon to increase. Secondly the skipping logic doesn't work if the database was restarted after autovacuum was triggered - that knowledge is not preserved across restart. This is especially a problem because it's a common panic-reaction to restart the database if it gets slow to anti-wraparound vacuums. Fix the first problem by separating the logic for members from offsets. Trigger autovacuum whenever a multixact crosses a segment boundary, as the current member offset increases in irregular values, so we can't use a simple modulo logic as for offsets. Add a stopgap for the second problem, by signalling autovacuum whenver ERRORing out because of boundaries. Discussion: 20150608163707.GD20772@alap3.anarazel.de Backpatch into 9.3, where it became more likely that multixacts wrap around.	2015-06-21 18:57:28 +02:00
Andres Freund	90231cd518	Add missing check for wal_debug GUC. `9a20a9b2` added a new elog(), enabled when WAL_DEBUG is defined. The other WAL_DEBUG dependant messages check for the wal_debug GUC, but this one did not. While at it replace 'upto' with 'up to'. Discussion: 20150610110253.GF3832@alap3.anarazel.de Backpatch to 9.4, the first release containing `9a20a9b2`.	2015-06-21 18:37:09 +02:00
Robert Haas	ed16f73c57	Fix corner case in autovacuum-forcing logic for multixact wraparound. Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(), and that function looks only at the on-disk state, it's possible for it to fail to find a page that exists in the in-memory SLRU that has not been written yet. If that happens, SetOffsetVacuumLimit() will erroneously decide to force emergency autovacuuming immediately. We should probably fix find_multixact_start() to consider the data cached in memory as well as on the on-disk state, but that's no excuse for SetOffsetVacuumLimit() to be stupid about the case where it can no longer read the value after having previously succeeded in doing so. Report by Andres Freund.	2015-06-19 11:28:30 -04:00
Alvaro Herrera	94232c909d	Fix typos tablesapce -> tablespace there -> their These were introduced in `72d422a52`, so no need to backpatch.	2015-06-08 15:37:42 -03:00
Fujii Masao	7abc685974	Refactor WAL segment copying code. * Remove unused argument "dstfname" and related code from XLogFileCopy(). * Previously XLogFileCopy() returned a pstrdup'd string so that InstallXLogFileSegment() used it later. Since the pstrdup'd string was never free'd, there could be a risk of memory leak. It was almost harmless because the startup process exited just after calling XLogFileCopy(), it existed. This commit changes XLogFileCopy() so that it directly calls InstallXLogFileSegment() and doesn't call pstrdup() at all. Which fixes that memory leak problem. * Extend InstallXLogFileSegment() so that the caller can specify the log level. Which allows us to emit an error when InstallXLogFileSegment() fails a disk file access like link() and rename(). Previously it was always logged with LOG level and additionally needed to be logged with ERROR when we wanted to treat it as an error. Michael Paquier	2015-06-09 03:03:24 +09:00
Andres Freund	d1b958218a	Allow HotStandbyActiveInReplay() to be called in single user mode. HotStandbyActiveInReplay, introduced in `061b079f`, only allowed WAL replay to happen in the startup process, missing the single user case. This buglet is fairly harmless as it only causes problems when single user mode in an assertion enabled build is used to replay a btree vacuum record. Backpatch to 9.2. `061b079f` was backpatched further, but the assertion was not.	2015-06-08 14:09:27 +02:00
Robert Haas	068cfadf9e	Cope with possible failure of the oldest MultiXact to exist. Recent commits, mainly `b69bf30b9b` and `53bb309d2d`, introduced mechanisms to protect against wraparound of the MultiXact member space: the number of multixacts that can exist at one time is limited to 2^32, but the total number of members in those multixacts is also limited to 2^32, and older code did not take care to enforce the second limit, potentially allowing old data to be overwritten while it was still needed. Unfortunately, these new mechanisms failed to account for the fact that the code paths in which they run might be executed during recovery or while the cluster was in an inconsistent state. Also, they failed to account for the fact that users who used pg_upgrade to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have might oldestMultiXid = 1 in the control file despite the true value being larger. To fix these problems, first, avoid unnecessarily examining the mmembers of MultiXacts when the cluster is not known to be consistent. TruncateMultiXact has done this for a long time, and this patch does not fix that. But the new calls used to prevent member wraparound are not needed until we reach normal running, so avoid calling them earlier. (SetMultiXactIdLimit is actually called before InRecovery is set, so we can't rely on that; we invent our own multixact-specific flag instead.) Second, make failure to look up the members of a MultiXact a non-fatal error. Instead, if we're unable to determine the member offset at which wraparound would occur, postpone arming the member wraparound defenses until we are able to do so. If we're unable to determine the member offset that should force autovacuum, force it continuously until we are able to do so. If we're unable to deterine the member offset at which we should truncate the members SLRU, log a message and skip truncation. An important consequence of these changes is that anyone who does have a bogus oldestMultiXid = 1 value in pg_control will experience immediate emergency autovacuuming when upgrading to a release that contains this fix. The release notes should highlight this fact. If a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1 in the control file, they may wish to vacuum any tables with relminmxid = 1 prior to upgrading in order to avoid an immediate emergency autovacuum after the upgrade. This must be done with a PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age set to 0. This patch also adds an additional log message at each database server startup, indicating either that protections against member wraparound have been engaged, or that they have not. In the latter case, once autovacuum has advanced oldestMultiXid to a sane value, the message indicating that the guards have been engaged will appear at the next checkpoint. A few additional messages have also been added at the DEBUG1 level so that the correct operation of this code can be properly audited. Along the way, this patch fixes another, related bug in TruncateMultiXact that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at all, the truncation code looks up NextMultiXactId, which doesn't exist yet. This can lead to TruncateMultiXact removing every file in pg_multixact/offsets instead of keeping one around, as it should. This in turn will cause the database server to refuse to start afterwards. Patch by me. Review by Álvaro Herrera, Andres Freund, Noah Misch, and Thomas Munro.	2015-06-05 09:31:57 -04:00
Tom Lane	d8179b001a	Fix fsync-at-startup code to not treat errors as fatal. Commit `2ce439f337` introduced a rather serious regression, namely that if its scan of the data directory came across any un-fsync-able files, it would fail and thereby prevent database startup. Worse yet, symlinks to such files also caused the problem, which meant that crash restart was guaranteed to fail on certain common installations such as older Debian. After discussion, we agreed that (1) failure to start is worse than any consequence of not fsync'ing is likely to be, therefore treat all errors in this code as nonfatal; (2) we should not chase symlinks other than those that are expected to exist, namely pg_xlog/ and tablespace links under pg_tblspc/. The latter restriction avoids possibly fsync'ing a much larger part of the filesystem than intended, if the user has left random symlinks hanging about in the data directory. This commit takes care of that and also does some code beautification, mainly moving the relevant code into fd.c, which seems a much better place for it than xlog.c, and making sure that the conditional compilation for the pre_sync_fname pass has something to do with whether pg_flush_data works. I also relocated the call site in xlog.c down a few lines; it seems a bit silly to be doing this before ValidateXLOGDirectoryStructure(). The similar logic in initdb.c ought to be made to match this, but that change is noncritical and will be dealt with separately. Back-patch to all active branches, like the prior commit. Abhijit Menon-Sen and Tom Lane	2015-05-28 17:33:03 -04:00
Bruce Momjian	807b9e0dff	pgindent run for 9.5	2015-05-23 21:35:49 -04:00
Tom Lane	72809480d6	Fix incorrect snprintf() limit. Typo in commit `7cbee7c0a`. No practical effect since the buffer should never actually be overrun, but various compilers and static analyzers will whine about it. Petr Jelinek	2015-05-23 16:05:52 -04:00
Heikki Linnakangas	7cbee7c0a1	At promotion, don't leave behind a partial segment on the old timeline. With commit `de768844`, a copy of the partial segment was archived with the .partial suffix, but the original file was still left in pg_xlog, so it didn't actually solve the problems with archiving the partial segment that it was supposed to solve. With this patch, the partial segment is renamed rather than copied, so we only archive it with the .partial suffix. Also be more robust in detecting if the last segment is already being archived. Previously I used XLogArchiveIsBusy() for that, but that's not quite right. With archive_mode='always', there might be a .ready file for it, and we don't want to rename it to .partial in that case. The old segment is needed until we're fully committed to the new timeline, i.e. until we've written the end-of-recovery WAL record and updated the min recovery point and timeline in the control file. So move the renaming later in the startup sequence, after all that's been done.	2015-05-22 11:04:33 +03:00
Fujii Masao	85d0e661aa	Make recovery_target_action = pause work. Previously even if recovery_target_action was set to pause and the recovery target was reached, the recovery could never be paused. Because the setting of pause was always overridden with that of shutdown unexpectedly. This override is valid and intentional if hot_standby is not enabled because there is no way to resume the paused recovery in this case and the setting of pause is completely useless. But not if hot_standby is enabled. This patch changes the code so that the setting of pause is overridden with that of shutdown only when hot_standby is not enabled. Bug reported by Andres Freund	2015-05-21 13:56:17 +09:00
Heikki Linnakangas	fa60fb63e5	Fix more typos in comments. Patch by CharSyam, plus a few more I spotted with grep.	2015-05-20 19:45:43 +03:00
Heikki Linnakangas	4fc72cc7bb	Collection of typo fixes. Use "a" and "an" correctly, mostly in comments. Two error messages were also fixed (they were just elogs, so no translation work required). Two function comments in pg_proc.h were also fixed. Etsuro Fujita reported one of these, but I found a lot more with grep. Also fix a few other typos spotted while grepping for the a/an typos. For example, "consists out of ..." -> "consists of ...". Plus a "though"/ "through" mixup reported by Euler Taveira. Many of these typos were in old code, which would be nice to backpatch to make future backpatching easier. But much of the code was new, and I didn't feel like crafting separate patches for each branch. So no backpatching.	2015-05-20 16:56:22 +03:00
Simon Riggs	f6a54fefc2	Fix spelling in comment	2015-05-19 18:37:46 -04:00
Magnus Hagander	3b075e9d7b	Fix typos in comments Dmitriy Olshevskiy	2015-05-17 14:58:04 +02:00
Peter Eisentraut	e6dc503445	Fix whitespace	2015-05-16 20:43:32 -04:00
Heikki Linnakangas	ffd37740ee	Add archive_mode='always' option. In 'always' mode, the standby independently archives all files it receives from the primary. Original patch by Fujii Masao, docs and review by me.	2015-05-15 18:55:24 +03:00
Andrew Dunstan	72d422a522	Map basebackup tablespaces using a tablespace_map file Windows can't reliably restore symbolic links from a tar format, so instead during backup start we create a tablespace_map file, which is used by the restoring postgres to create the correct links in pg_tblspc. The backup protocol also now has an option to request this file to be included in the backup stream, and this is used by pg_basebackup when operating in tar mode. This is done on all platforms, not just Windows. This means that pg_basebackup will not not work in tar mode against 9.4 and older servers, as this protocol option isn't implemented there. Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.	2015-05-12 09:29:10 -04:00
Robert Haas	b4d4ce1d50	Increase threshold for multixact member emergency autovac to 50%. Analysis by Noah Misch shows that the 25% threshold set by commit `53bb309d2d` is lower than any other, similar autovac threshold. While we don't know exactly what value will be optimal for all users, it is better to err a little on the high side than on the low side. A higher value increases the risk that users might exhaust the available space and start seeing errors before autovacuum can clean things up sufficiently, but a user who hits that problem can compensate for it by reducing autovacuum_multixact_freeze_max_age to a value dependent on their average multixact size. On the flip side, if the emergency cap imposed by that patch kicks in too early, the user will experience excessive wraparound scanning and will be unable to mitigate that problem by configuration. The new value will hopefully reduce the risk of such bad experiences while still providing enough headroom to avoid multixact member exhaustion for most users. Along the way, adjust the documentation to reflect the effects of commit `04e6d3b877`, which taught autovacuum to run for multixact wraparound even when autovacuum is configured off.	2015-05-11 12:15:50 -04:00
Robert Haas	04e6d3b877	Even when autovacuum=off, force it for members as we do in other cases. Thomas Munro, with some adjustments by me.	2015-05-11 10:51:14 -04:00
Robert Haas	f6a6c46d7f	Advance the stop point for multixact offset creation only at checkpoint. Commit `b69bf30b9b` advanced the stop point at vacuum time, but this has subsequently been shown to be unsafe as a result of analysis by myself and Thomas Munro and testing by Thomas Munro. The crux of the problem is that the SLRU deletion logic may get confused about what to remove if, at exactly the right time during the checkpoint process, the head of the SLRU crosses what used to be the tail. This patch, by me, fixes the problem by advancing the stop point only following a checkpoint. This has the additional advantage of making the removal logic work during recovery more like the way it works during normal running, which is probably good. At least one of the calls to DetermineSafeOldestOffset which this patch removes was already dead, because MultiXactAdvanceOldest is called only during recovery and DetermineSafeOldestOffset was set up to do nothing during recovery. That, however, is inconsistent with the principle that recovery and normal running should work similarly, and was confusing to boot. Along the way, fix some comments that previous patches in this area neglected to update. It's not clear to me whether there's any concrete basis for the decision to use only half of the multixact ID space, but it's neither necessary nor sufficient to prevent multixact member wraparound, so the comments should not say otherwise.	2015-05-10 22:21:20 -04:00
Robert Haas	312747c224	Fix DetermineSafeOldestOffset for the case where there are no mxacts. Commit `b69bf30b9b` failed to take into account the possibility that there might be no multixacts in existence at all. Report by Thomas Munro; patch by me.	2015-05-10 21:34:26 -04:00
Heikki Linnakangas	de7688442f	At promotion, archive last segment from old timeline with .partial suffix. Previously, we would archive the possible-incomplete WAL segment with its normal filename, but that causes trouble if the server owning that timeline is still running, and tries to archive the same segment later. It's not nice for the standby to trip up the master's archival like that. And it's pretty confusing, anyway, to have an incomplete segment in the archive that's indistinguishable from a normal, complete segment. To avoid such confusion, add a .partial suffix to the file. Or to be more precise, make a copy of the old segment under the .partial suffix, and archive that instead of the original file. pg_receivexlog also uses the .partial suffix for the same purpose, to tell apart incompletely streamed files from complete ones. There is no automatic mechanism to use the .partial files at recovery, so they will go unused, unless the administrator manually copies to them to the pg_xlog directory (and removes the .partial suffix). Recovery won't normally need the WAL - when recovering to the new timeline, it will find the same WAL on the first segment on the new timeline instead - but it nevertheless feels better to archive the file with the .partial suffix, for debugging purposes if nothing else.	2015-05-08 21:59:01 +03:00
Heikki Linnakangas	179cdd0981	Add macros to check if a filename is a WAL segment or other such file. We had many instances of the strlen + strspn combination to check for that. This makes the code a bit easier to read.	2015-05-08 21:58:57 +03:00
Peter Eisentraut	16c73e773b	Fix whitespace	2015-05-08 14:45:53 -04:00
Robert Haas	53bb309d2d	Teach autovacuum about multixact member wraparound. The logic introduced in commit `b69bf30b9b` and repaired in commits `669c7d20e6` and `7be47c56af` helps to ensure that we don't overwrite old multixact member information while it is still needed, but a user who creates many large multixacts can still exhaust the member space (and thus start getting errors) while autovacuum stands idly by. To fix this, progressively ramp down the effective value (but not the actual contents) of autovacuum_multixact_freeze_max_age as member space utilization increases. This makes autovacuum more aggressive and also reduces the threshold for a manual VACUUM to perform a full-table scan. This patch leaves unsolved the problem of ensuring that emergency autovacuums are triggered even when autovacuum=off. We'll need to fix that via a separate patch. Thomas Munro and Robert Haas	2015-05-08 12:53:00 -04:00
Robert Haas	7be47c56af	Fix incorrect math in DetermineSafeOldestOffset. The old formula didn't have enough parentheses, so it would do the wrong thing, and it used / rather than % to find a remainder. The effect of these oversights is that the stop point chosen by the logic introduced in commit `b69bf30b9b` might be rather meaningless. Thomas Munro, reviewed by Kevin Grittner, with a whitespace tweak by me.	2015-05-07 11:19:31 -04:00
Robert Haas	1998261034	Avoid using a C++ keyword as a structure member name. Per request from Peter Eisentraut.	2015-05-05 22:41:03 -04:00
Robert Haas	2ce439f337	Recursively fsync() the data directory after a crash. Otherwise, if there's another crash, some writes from after the first crash might make it to disk while writes from before the crash fail to make it to disk. This could lead to data corruption. Back-patch to all supported versions. Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised by me.	2015-05-04 14:13:53 -04:00
Heikki Linnakangas	ec3d976bce	Fix the same-rel optimization when creating WAL records. prev_regbuf was never set, and therefore the same-rel flag was never set on WAL records. Report and fix by Zhanq Zq	2015-05-04 21:03:36 +03:00
Robert Haas	924bcf4f16	Create an infrastructure for parallel computation in PostgreSQL. This does four basic things. First, it provides convenience routines to coordinate the startup and shutdown of parallel workers. Second, it synchronizes various pieces of state (e.g. GUCs, combo CID mappings, transaction snapshot) from the parallel group leader to the worker processes. Third, it prohibits various operations that would result in unsafe changes to that state while parallelism is active. Finally, it propagates events that would result in an ErrorResponse, NoticeResponse, or NotifyResponse message being sent to the client from the parallel workers back to the master, from which they can then be sent on to the client. Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke. Suggestions and review from Andres Freund, Heikki Linnakangas, Noah Misch, Simon Riggs, Euler Taveira, and Jim Nasby.	2015-04-30 15:02:14 -04:00
Alvaro Herrera	669c7d20e6	Fix pg_upgrade's multixact handling (again) We need to create the pg_multixact/offsets file deleted by pg_upgrade much earlier than we originally were: it was in TrimMultiXact(), which runs after we exit recovery, but it actually needs to run earlier than the first call to SetMultiXactIdLimit (before recovery), because that routine already wants to read the first offset segment. Per pg_upgrade trouble report from Jeff Janes. While at it, silence a compiler warning about a pointless assert that an unsigned variable was being tested non-negative. This was a signed constant in Thomas Munro's patch which I changed to unsigned before commit. Pointed out by Andres Freund.	2015-04-30 13:55:06 -03:00
Andres Freund	5aa2350426	Introduce replication progress tracking infrastructure. When implementing a replication solution ontop of logical decoding, two related problems exist: * How to safely keep track of replication progress * How to change replication behavior, based on the origin of a row; e.g. to avoid loops in bi-directional replication setups The solution to these problems, as implemented here, consist out of three parts: 1) 'replication origins', which identify nodes in a replication setup. 2) 'replication progress tracking', which remembers, for each replication origin, how far replay has progressed in a efficient and crash safe manner. 3) The ability to filter out changes performed on the behest of a replication origin during logical decoding; this allows complex replication topologies. E.g. by filtering all replayed changes out. Most of this could also be implemented in "userspace", e.g. by inserting additional rows contain origin information, but that ends up being much less efficient and more complicated. We don't want to require various replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure also replaces the 'nodeid' infrastructure of commit timestamps. It is intended to provide all the former capabilities, except that there's only 2^16 different origins; but now they integrate with logical decoding. Additionally more functionality is accessible via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 (commit `73c986add`) changing the API is not a problem. For now the number of origins for which the replication progress can be tracked simultaneously is determined by the max_replication_slots GUC. That GUC is not a perfect match to configure this, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer Discussion: 20150216002155.GI15326@awork2.anarazel.de, 20140923182422.GA15776@alap3.anarazel.de, 20131114172632.GE7522@alap2.anarazel.de	2015-04-29 19:30:53 +02:00
Alvaro Herrera	d3821e70c9	Code review for multixact bugfix Reword messages, rename a confusingly named function. Per Robert Haas.	2015-04-28 14:52:29 -03:00
Alvaro Herrera	b69bf30b9b	Protect against multixact members wraparound Multixact member files are subject to early wraparound overflow and removal: if the average multixact size is above a certain threshold (see note below) the protections against offset overflow are not enough: during multixact truncation at checkpoint time, some pg_multixact/members files would be removed because the server considers them to be old and not needed anymore. This leads to loss of files that are critical to interpret existing tuples's Xmax values. To protect against this, since we don't have enough info in pg_control and we can't modify it in old branches, we maintain shared memory state about the oldest value that we need to keep; we use this during new multixact creation to abort if an old still-needed file would get overwritten. This value is kept up to date by checkpoints, which makes it not completely accurate but should be good enough. We start emitting warnings sometime earlier, so that the eventual multixact-shutdown doesn't take DBAs completely by surprise (more precisely: once 20 members SLRU segments are remaining before shutdown.) On troublesome average multixact size: The threshold size depends on the multixact freeze parameters. The oldest age is related to the greater of multixact_freeze_table_age and multixact_freeze_min_age: anything older than that should be removed promptly by autovacuum. If autovacuum is keeping up with multixact freezing, the troublesome multixact average size is (2^32-1) / Max(freeze table age, freeze min age) or around 28 members per multixact. Having an average multixact size larger than that will eventually cause new multixact data to overwrite the data area for older multixacts. (If autovacuum is not able to keep up, or there are errors in vacuuming, the actual maximum is multixact_freeeze_max_age instead, at which point multixact generation is stopped completely. The default value for this limit is 400 million, which means that the multixact size that would cause trouble is about 10 members). Initial bug report by Timothy Garnett, bug #12990 Backpatch to 9.3, where the problem was introduced. Authors: Álvaro Herrera, Thomas Munro Reviews: Thomas Munro, Amit Kapila, Robert Haas, Kevin Grittner	2015-04-28 11:32:53 -03:00
Andres Freund	6aab1f45ac	Fix various typos and grammar errors in comments. Author: Dmitriy Olshevskiy Discussion: 553D00A6.4090205@bk.ru	2015-04-26 18:42:31 +02:00
Heikki Linnakangas	2c47fe16a7	Fix deadlock at startup, if max_prepared_transactions is too small. When the startup process recovers transactions by scanning pg_twophase directory, it should clear MyLockedGxact after it's done processing each transaction. Like we do during normal operation, at PREPARE TRANSACTION. Otherwise, if the startup process exits due to an error, it will try to clear the locking_backend field of the last recovered transaction. That's usually harmless, but if the error happens in MarkAsPreparing, while holding TwoPhaseStateLock, the shmem-exit hook will try to acquire TwoPhaseStateLock again, and deadlock with itself. This fixes bug #13128 reported by Grant McAlister. The bug was introduced by commit `bb38fb0d`, so backpatch to all supported versions like that commit.	2015-04-23 21:39:35 +03:00
Heikki Linnakangas	3d80a1e0e3	Fix logic to skip checkpoint if no records have been inserted. After the WAL format changes, the calculation of the size of a checkpoint record became incorrect. Instead of trying to fix the math, check that the previous record, i.e. the xl_prev value that we'd write for the next record, matches the last checkpoint's redo pointer. That way it's not dependent on the size of the checkpoint record at all. The old logic was actually slightly wrong all along: if the previous checkpoint record crossed a page boundary, the page headers threw off the record size calculation, and the checkpoint was not skipped. The new checkpoint would not cross a page boundary, so this only resulted in at most one extra checkpoint after the system became idle. The new logic fixes that. (It's not worth fixing in backbranches). However, it makes some sense to try to keep the latest checkpoint contained fully in a page, or at least in a single WAL segment, just on general robustness grounds. If something goes awfully wrong, it's more likely that you can recover the latest WAL segment, than the last two WAL segments. So I added an extra check that the checkpoint is not skipped if the previous checkpoint crossed a WAL segment. Reported by Jeff Janes.	2015-04-15 17:21:04 +03:00
Alvaro Herrera	0a52fafce4	Fix typo in comment SLRU_SEGMENTS_PER_PAGE -> SLRU_PAGES_PER_SEGMENT I introduced this ancient typo in subtrans.c and later propagated it to multixact.c. I fixed the latter in `f741300c`, but only back to 9.3; backpatch to all supported branches for consistency.	2015-04-14 12:12:18 -03:00
Heikki Linnakangas	4f700bcd20	Reorganize our CRC source files again. Now that we use CRC-32C in WAL and the control file, the "traditional" and "legacy" CRC-32 variants are not used in any frontend programs anymore. Move the code for those back from src/common to src/backend/utils/hash. Also move the slicing-by-8 implementation (back) to src/port. This is in preparation for next patch that will add another implementation that uses Intel SSE 4.2 instructions to calculate CRC-32C, where available.	2015-04-14 17:03:42 +03:00
Heikki Linnakangas	b2a5545bd6	Don't archive bogus recycled or preallocated files after timeline switch. After a timeline switch, we would leave behind recycled WAL segments that are in the future, but on the old timeline. After promotion, and after they become old enough to be recycled again, we would notice that they don't have a .ready or .done file, create a .ready file for them, and archive them. That's bogus, because the files contain garbage, recycled from an older timeline (or prealloced as zeros). We shouldn't archive such files. This could happen when we're following a timeline switch during replay, or when we switch to new timeline at end-of-recovery. To fix, whenever we switch to a new timeline, scan the data directory for WAL segments on the old timeline, but with a higher segment number, and remove them. Those don't belong to our timeline history, and are most likely bogus recycled or preallocated files. They could also be valid files that we streamed from the primary ahead of time, but in any case, they're not needed to recover to the new timeline.	2015-04-13 16:53:49 +03:00
Alvaro Herrera	27846f02c1	Optimize locking a tuple already locked by another subxact Locking and updating the same tuple repeatedly led to some strange multixacts being created which had several subtransactions of the same parent transaction holding locks of the same strength. However, once a subxact of the current transaction holds a lock of a given strength, it's not necessary to acquire the same lock again. This made some coding patterns much slower than required. The fix is twofold. First we change HeapTupleSatisfiesUpdate to return HeapTupleBeingUpdated for the case where the current transaction is already a single-xid locker for the given tuple; it used to return HeapTupleMayBeUpdated for that case. The new logic is simpler, and the change to pgrowlocks is a testament to that: previously we needed to check for the single-xid locker separately in a very ugly way. That test is simpler now. As fallout from the HTSU change, some of its callers need to be amended so that tuple-locked-by-own-transaction is taken into account in the BeingUpdated case rather than the MayBeUpdated case. For many of them there is no difference; but heap_delete() and heap_update now check explicitely and do not grab tuple lock in that case. The HTSU change also means that routine MultiXactHasRunningRemoteMembers introduced in commit `11ac4c73cb` is no longer necessary and can be removed; the case that used to require it is now handled naturally as result of the changes to heap_delete and heap_update. The second part of the fix to the performance issue is to adjust heap_lock_tuple to avoid the slowness: 1. Previously we checked for the case that our own transaction already held a strong enough lock and returned MayBeUpdated, but only in the multixact case. Now we do it for the plain Xid case as well, which saves having to LockTuple. 2. If the current transaction is the only locker of the tuple (but with a lock not as strong as what we need; otherwise it would have been caught in the check mentioned above), we can skip sleeping on the multixact, and instead go straight to create an updated multixact with the additional lock strength. 3. Most importantly, make sure that both the single-xid-locker case and the multixact-locker case optimization are applied always. We do this by checking both in a single place, rather than them appearing in two separate portions of the routine -- something that is made possible by the HeapTupleSatisfiesUpdate API change. Previously we would only check for the single-xid case when HTSU returned MayBeUpdated, and only checked for the multixact case when HTSU returned BeingUpdated. This was at odds with what HTSU actually returned in one case: if our own transaction was locker in a multixact, it returned MayBeUpdated, so the optimization never applied. This is what led to the large multixacts in the first place. Per bug report #8470 by Oskari Saarenmaa.	2015-04-10 13:47:15 -03:00
Fujii Masao	6e4bf4ecd3	Fix error handling of XLogReaderAllocate in case of OOM Similarly to previous fix `9b8d478`, commit `2c03216` has switched XLogReaderAllocate() to use a set of palloc calls instead of malloc, causing any callers of this function to fail with an error instead of receiving a NULL pointer in case of out-of-memory error. Fix this by using palloc_extended with MCXT_ALLOC_NO_OOM that will safely return NULL in case of an OOM. Michael Paquier, slightly modified by me.	2015-04-03 21:55:37 +09:00
Fujii Masao	9b8d4782ba	Rework handling of OOM when allocating record buffer in XLOG reader. Commit `2c03216` changed allocate_recordbuf() so that it uses a palloc to allocate the read buffer and fails immediately when an out-of-memory error shows up, even though its callers still expect that NULL is returned in that case. This bug is fixed making allocate_recordbuf() use a palloc_extended with MCXT_ALLOC_NO_OOM flag and return NULL in OOM case. Michael Paquier	2015-04-03 18:29:38 +09:00
Andres Freund	62e2a8dc2c	Define integer limits independently from the system definitions. In `83ff1618` we defined integer limits iff they're not provided by the system. That turns out not to be the greatest idea because there's different ways some datatypes can be represented. E.g. on OSX PG's 64bit datatype will be a 'long int', but OSX unconditionally uses 'long long'. That disparity then can lead to warnings, e.g. around printf formats. One way to fix that would be to back int64 using stdint.h's int64_t. While a good idea it's not that easy to implement. We would e.g. need to include stdint.h in our external headers, which we don't today. Also computing the correct int64 printf formats in that case is nontrivial. Instead simply prefix the integer limits with PG_ and define them unconditionally. I've adjusted all the references to them in code, but not the ones in comments; the latter seems unnecessary to me. Discussion: 20150331141423.GK4878@alap3.anarazel.de	2015-04-02 17:43:35 +02:00
Tom Lane	785941cdc3	Tweak __attribute__-wrapping macros for better pgindent results. This improves on commit `bbfd7edae5` by making two simple changes: * pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn(). Likewise pg_attribute_unused(), pg_attribute_packed(). This reduces pgindent's tendency to misformat declarations involving them. * attributes are now always attached to function declarations, not definitions. Previously some places were taking creative shortcuts, which were not merely candidates for bad misformatting by pgindent but often were outright wrong anyway. (It does little good to put a noreturn annotation where callers can't see it.) In any case, if we would like to believe that these macros can be used with non-gcc compilers, we should avoid gratuitous variance in usage patterns. I also went through and manually improved the formatting of a lot of declarations, and got rid of excessively repetitive (and now obsolete anyway) comments informing the reader what pg_attribute_printf is for.	2015-03-26 14:03:25 -04:00
Andres Freund	83ff1618bc	Centralize definition of integer limits. Several submitted and even committed patches have run into the problem that C89, our baseline, does not provide minimum/maximum values for various integer datatypes. C99's stdint.h does, but we can't rely on it. Several parts of the code defined limits locally, so instead centralize the definitions to c.h. This patch also changes the more obvious usages of literal limit values; there's more places that could be changed, but it's less clear whether it's beneficial to change those. Author: Andrew Gierth Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk	2015-03-25 22:39:42 +01:00
Andres Freund	87cec51d3a	Don't delay replication for less than recovery_min_apply_delay's resolution. Recovery delays are implemented by waiting on a latch, and latches take milliseconds as a parameter. The required amount of waiting was computed using microsecond resolution though and the wait loop's abort condition was checking the delay in microseconds as well. This could lead to short spurts of busy looping when the overall wait time was below a millisecond, but above 0 microseconds. Instead just formulate the wait loop's abort condition in millisecond granularity as well. Given that that's recovery_min_apply_delay resolution, it seems harmless to not wait for less than a millisecond. Backpatch to 9.4 where recovery_min_apply_delay was introduced. Discussion: 20150323141819.GH26995@alap3.anarazel.de	2015-03-23 16:51:11 +01:00
Andres Freund	a1105c3dd4	Fix copy & paste error in `4f1b890b13`. Due to the bug delayed standbys would not delay when applying prepared transactions. Discussion: CAB7nPqT6BO1cCn+sAyDByBxA4EKZNAiPi2mFJ=ANeZmnmewRyg@mail.gmail.com Michael Paquier via Coverity.	2015-03-23 15:53:40 +01:00
Andres Freund	4f1b890b13	Merge the various forms of transaction commit & abort records. Since `465883b0a` two versions of commit records have existed. A compact version that was used when no cache invalidations, smgr unlinks and similar were needed, and a full version that could deal with all that. Additionally the full version was embedded into twophase commit records. That resulted in a measurable reduction in the size of the logged WAL in some workloads. But more recently additions like logical decoding, which e.g. needs information about the database something was executed on, made it applicable in fewer situations. The static split generally made it hard to expand the commit record, because concerns over the size made it hard to add anything to the compact version. Additionally it's not particularly pretty to have twophase.c insert RM_XACT records. Rejigger things so that the commit and abort records only have one form each, including the twophase equivalents. The presence of the various optional (in the sense of not being in every record) pieces is indicated by a bits in the 'xinfo' flag. That flag previously was not included in compact commit records. To prevent an increase in size due to its presence, it's only included if necessary; signalled by a bit in the xl_info bits available for xact.c, similar to heapam.c's XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE. Twophase commit/aborts are now the same as their normal counterparts. The original transaction's xid is included in an optional data field. This means that commit records generally are smaller, except in the case of a transaction with subtransactions, but no other special cases; the increase there is four bytes, which seems acceptable given that the more common case of not having subtransactions shrank. The savings are especially measurable for twophase commits, which previously always used the full version; but will in practice only infrequently have required that. The motivation for this work are not the space savings and and deduplication though; it's that it makes it easier to extend commit records with additional information. That's just a few lines of code now; without impacting the common case where that information is not needed. Discussion: 20150220152150.GD4149@awork2.anarazel.de, 235610.92468.qm%40web29004.mail.ird.yahoo.com Reviewed-By: Heikki Linnakangas, Simon Riggs	2015-03-15 17:37:07 +01:00
Andres Freund	a0f5954af1	Increase max_wal_size's default from 128MB to 1GB. The introduction of min_wal_size & max_wal_size in `88e9823026` makes it feasible to increase the default upper bound in checkpoint size. Previously raising the default would lead to a increased disk footprint, even if more segments weren't beneficial. The low default of checkpoint size is one of common performance problem users have thus increasing the default makes sense. Setups where the increase in maximum disk usage is a problem will very likely have to run with a modified configuration anyway. Discussion: 54F4EFB8.40202@agliodbs.com, CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com Author: Josh Berkus, after a discussion involving lots of people.	2015-03-15 17:37:07 +01:00
Andres Freund	51c11a7025	Remove pause_at_recovery_target recovery.conf setting. The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4) replaces it's functionality. Having both seems likely to cause more confusion than it saves worry due to the incompatibility. Discussion: 5484FC53.2060903@2ndquadrant.com Author: Petr Jelinek	2015-03-15 17:37:07 +01:00
Fujii Masao	cd6c45cbee	Suppress maybe-uninitialized compiler warnings. Previously some compilers were thinking that the variables that `57aa5b2` added maybe-uninitialized. Spotted by Andres Freund	2015-03-15 10:40:43 +09:00
Andres Freund	bbfd7edae5	Add macros wrapping all usage of gcc's __attribute__. Until now __attribute__() was defined to be empty for all compilers but gcc. That's problematic because it prevents using it in other compilers; which is necessary e.g. for atomics portability. It's also just generally dubious to do so in a header as widely included as c.h. Instead add pg_attribute_format_arg, pg_attribute_printf, pg_attribute_noreturn macros which are implemented in the compilers that understand them. Also add pg_attribute_noreturn and pg_attribute_packed, but don't provide fallbacks, since they can affect functionality. This means that external code that, possibly unwittingly, relied on __attribute__ defined to be empty on !gcc compilers may now run into warnings or errors on those compilers. But there shouldn't be many occurances of that and it's hard to work around... Discussion: 54B58BA3.8040302@ohmu.fi Author: Oskari Saarenmaa, with some minor changes by me.	2015-03-11 14:30:01 +01:00
Fujii Masao	57aa5b2bb1	Add GUC to enable compression of full page images stored in WAL. When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup. A compressed page image will be decompressed during WAL replay. Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. This commit changes the WAL format (so bumping WAL version number) so that the one-byte flag indicating whether a full page image is compressed or not is included in its header information. This means that the commit increases the WAL volume one-byte per a full page image even if WAL compression is not used at all. We can save that one-byte by borrowing one-bit from the existing field like hole_offset in the header and using it as the flag, for example. But which would reduce the code readability and the extensibility of the feature. Per discussion, it's not worth paying those prices to save only one-byte, so we decided to add the one-byte flag to the header. This commit doesn't introduce any new compression algorithm like lz4. Currently a full page image is compressed using the existing PGLZ algorithm. Per discussion, we decided to use it at least in the first version of the feature because there were no performance reports showing that its compression ratio is unacceptably lower than that of other algorithm. Of course, in the future, it's worth considering the support of other compression algorithm for the better compression. Rahila Syed and Michael Paquier, reviewed in various versions by myself, Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.	2015-03-11 15:52:24 +09:00
Alvaro Herrera	4f3924d9cd	Keep CommitTs module in sync in standby and master We allow this module to be turned off on restarts, so a restart time check is enough to activate or deactivate the module; however, if there is a standby replaying WAL emitted from a master which is restarted, but the standby isn't, the state in the standby becomes inconsistent and can easily be crashed. Fix by activating and deactivating the module during WAL replay on parameter change as well as on system start. Problem reported by Fujii Masao in http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com Author: Petr Jelínek	2015-03-09 17:44:00 -03:00
Heikki Linnakangas	f1fd515b39	Move WAL-related definitions from dbcommands.h to separate header file. This makes it easier to write frontend programs that needs to understand the WAL record format of CREATE/DROP DATABASE. dbcommands.h cannot easily be #included in a frontend program, because it pulls in other header files that need backend stuff, but the new dbcommands_xlog.h header file has fewer dependencies.	2015-03-09 15:50:49 +02:00
Fujii Masao	c74c04b8aa	Add missing "goto err" statements in xlogreader.c. Spotted by Andres Freund.	2015-03-09 14:31:10 +09:00
Fujii Masao	934d122685	Fix typo in comment.	2015-03-05 20:15:16 +09:00
Andres Freund	fd6a3f3ad4	Reconsider when to wait for WAL flushes/syncrep during commit. Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.	2015-02-26 12:50:07 +01:00
Heikki Linnakangas	88e9823026	Replace checkpoint_segments with min_wal_size and max_wal_size. Instead of having a single knob (checkpoint_segments) that both triggers checkpoints, and determines how many checkpoints to recycle, they are now separate concerns. There is still an internal variable called CheckpointSegments, which triggers checkpoints. But it no longer determines how many segments to recycle at a checkpoint. That is now auto-tuned by keeping a moving average of the distance between checkpoints (in bytes), and trying to keep that many segments in reserve. The advantage of this is that you can set max_wal_size very high, but the system won't actually consume that much space if there isn't any need for it. The min_wal_size sets a floor for that; you can effectively disable the auto-tuning behavior by setting min_wal_size equal to max_wal_size. The max_wal_size setting is now the actual target size of WAL at which a new checkpoint is triggered, instead of the distance between checkpoints. Previously, you could calculate the actual WAL usage with the formula "(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this patch, you set the desired WAL usage with max_wal_size, and the system calculates the appropriate CheckpointSegments with the reverse of that formula. That's a lot more intuitive for administrators to set. Reviewed by Amit Kapila and Venkata Balaji N.	2015-02-23 18:53:02 +02:00
Fujii Masao	5d2b45e3f7	Add GUC to control the time to wait before retrieving WAL after failed attempt. Previously when the standby server failed to retrieve WAL files from any sources (i.e., streaming replication, local pg_xlog directory or WAL archive), it always waited for five seconds (hard-coded) before the next attempt. For example, this is problematic in warm-standby because restore_command can fail every five seconds even while new WAL file is expected to be unavailable for a long time and flood the log files with its error messages. This commit adds new parameter, wal_retrieve_retry_interval, to control that wait time. Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.	2015-02-23 20:55:17 +09:00
Tom Lane	33a3b03d63	Use FLEXIBLE_ARRAY_MEMBER in some more places. Fix a batch of structs that are only visible within individual .c files. Michael Paquier	2015-02-20 17:32:01 -05:00
Heikki Linnakangas	49b04188f8	Fix thinko in re-setting wal_log_hints flag from a parameter-change record. The flag is supposed to be copied from the record. Same issue with track_commit_timestamps, but that's master-only. Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was added.	2015-01-15 20:52:41 +02:00
Heikki Linnakangas	1e78d81e88	Don't open a WAL segment for writing at end of recovery. Since commit `ba94518a`, we used XLogFileOpen to open the next segment for writing, but if the end-of-recovery happens exactly at a segment boundary, the new segment might not exist yet. (Before `ba94518a`, XLogFileOpen was correct, because we would open the previous segment if the switch happened at the boundary.) Instead of trying to create it if necessary, it's simpler to not bother opening the segment at all. XLogWrite() will open or create it soon anyway, after writing the checkpoint or end-of-recovery record. Reported by Andres Freund.	2015-01-07 16:20:20 +02:00
Bruce Momjian	4baaf863ec	Update copyright for 2015 Backpatch certain files through 9.0	2015-01-06 11:43:47 -05:00
Andres Freund	14570c2828	Remove superflous variable from xlogreader's XLogFindNextRecord(). Pointed out by Coverity. Since this is mere, and debatable, cosmetics I'm not backpatching this.	2015-01-04 15:35:46 +01:00
Tom Lane	d6657d2a10	Treat negative values of recovery_min_apply_delay as having no effect. At one point in the development of this feature, it was claimed that allowing negative values would be useful to compensate for timezone differences between master and slave servers. That was based on a mistaken assumption that commit timestamps are recorded in local time; but of course they're in UTC. Nor is a negative apply delay likely to be a sane way of coping with server clock skew. However, the committed patch still treated negative delays as doing something, and the timezone misapprehension survived in the user documentation as well. If recovery_min_apply_delay were a proper GUC we'd just set the minimum allowed value to be zero; but for the moment it seems better to treat negative settings as if they were zero. In passing do some extra wordsmithing on the parameter's documentation, including correcting a second misstatement that the parameter affects processing of Restore Point records. Issue noted by Michael Paquier, who also provided the code patch; doc changes by me. Back-patch to 9.4 where the feature was introduced.	2015-01-03 13:14:03 -05:00
Andres Freund	7882c3b0b9	Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it. Besides being shorter and much easier to read it changes the logic in LWLockRelease() to release all shared lockers when waking up any. This can yield some significant performance improvements - and the fairness isn't really much worse than before, as we always allowed new shared lockers to jump the queue.	2014-12-25 17:24:30 +01:00
Alvaro Herrera	a609d96778	Revert "Use a bitmask to represent role attributes" This reverts commit `1826987a46`. The overall design was deemed unacceptable, in discussion following the previous commit message; we might find some parts of it still salvageable, but I don't want to be on the hook for fixing it, so let's wait until we have a new patch.	2014-12-23 15:35:49 -03:00
Alvaro Herrera	1826987a46	Use a bitmask to represent role attributes The previous representation using a boolean column for each attribute would not scale as well as we want to add further attributes. Extra auxilliary functions are added to go along with this change, to make up for the lost convenience of access of the old representation. Catalog version bumped due to change in catalogs and the new functions. Author: Adam Brightwell, minor tweaks by Álvaro Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera	2014-12-23 10:22:09 -03:00
Heikki Linnakangas	2ef6c66a2b	Fix file descriptor leak at end of recovery. XLogFileInit() returns a file descriptor, which needs to be closed. The leak was short-lived, since the startup process exits shortly afterwards, but it was clearly a bug, nevertheless. Per Coverity report.	2014-12-21 21:51:59 +02:00
Heikki Linnakangas	5c805d0a81	Fix timestamp in end-of-recovery WAL records. We used time(null) to set a TimestampTz field, which gave bogus results. Noticed while looking at pg_xlogdump output. Backpatch to 9.3 and above, where the fast promotion was introduced.	2014-12-19 17:04:20 +02:00
Tom Lane	4a14f13a0a	Improve hash_create's API for selecting simple-binary-key hash functions. Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane	2014-12-18 13:36:36 -05:00
Heikki Linnakangas	ba94518aad	Change how first WAL segment on new timeline after promotion is created. Two changes: 1. When copying a WAL segment from old timeline to create the first segment on the new timeline, only copy up to the point where the timeline switch happens, and zero-fill the rest. This avoids corner cases where we might think that the copied WAL from the previous timeline belong to the new timeline. 2. If the timeline switch happens at a segment boundary, don't copy the whole old segment to the new timeline. It's pointless, because it's 100% identical to the old segment.	2014-12-18 20:23:03 +02:00
Andres Freund	c303e9e7e5	Fix (re-)starting from a basebackup taken off a standby after a failure. When starting up from a basebackup taken off a standby extra logic has to be applied to compute the point where the data directory is consistent. Normal base backups use a WAL record for that purpose, but that isn't possible on a standby. That logic had a error check ensuring that the cluster's control file indicates being in recovery. Unfortunately that check was too strict, disregarding the fact that the control file could also indicate that the cluster was shut down while in recovery. That's possible when the a cluster starting from a basebackup is shut down before the backup label has been removed. When everything goes well that's a short window, but when either restore_command or primary_conninfo isn't configured correctly the window can get much wider. That's because inbetween reading and unlinking the label we restore the last checkpoint from WAL which can need additional WAL. To fix simply also allow starting when the control file indicates "shutdown in recovery". There's nicer fixes imaginable, but they'd be more invasive. Backpatch to 9.2 where support for taking basebackups from standbys was added.	2014-12-18 08:47:27 +01:00
Simon Riggs	b8e33a85d4	Tweaks for recovery_target_action Rename parameter action_at_recovery_target to recovery_target_action suggested by Christoph Berg. Place into recovery.conf suggested by Fujii Masao, replacing (deprecating) earlier parameters, per Michael Paquier.	2014-12-07 21:55:29 +09:00
Alvaro Herrera	73c986adde	Keep track of transaction commit timestamps Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut	2014-12-03 11:53:02 -03:00
Alvaro Herrera	6597ec9be6	Fix typos	2014-12-03 11:52:15 -03:00
Alvaro Herrera	ae04bf5027	Update transaction README for persistent multixacts Multixacts are now maintained during recovery, but the README didn't get the memo. Backpatch to 9.3, where the divergence was introduced.	2014-11-28 18:06:18 -03:00
Heikki Linnakangas	afeacd2748	Fix assertion failure at end of PITR. InitXLogInsert() cannot be called in a critical section, because it allocates memory. But CreateCheckPoint() did that, when called for the end-of-recovery checkpoint by the startup process. In the passing, fix the scratch space allocation in InitXLogInsert to go to the right memory context. Also update the comment at InitXLOGAccess, which hasn't been totally accurate since hot standby was introduced (in a hot standby backend, InitXLOGAccess isn't called at backend startup). Reported by Michael Paquier	2014-11-28 09:31:53 +02:00
Simon Riggs	aedccb1f6f	action_at_recovery_target recovery config option action_at_recovery_target = pause \| promote \| shutdown Petr Jelinek Reviewed by Muhammad Asif Naeem, Fujji Masao and Simon Riggs	2014-11-25 20:13:30 +00:00
Heikki Linnakangas	0bd624d63b	Distinguish XLOG_FPI records generated for hint-bit updates. Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images generated for hint bit updates, when checksums are enabled. The new record type is replayed exactly the same as XLOG_FPI, but allows them to be tallied separately e.g. in pg_xlogdump.	2014-11-24 11:09:08 +02:00
Heikki Linnakangas	8f5dcb56cb	Fix bogus comments in XLogRecordAssemble Pointed out by Michael Paquier	2014-11-21 12:15:27 +02:00
Heikki Linnakangas	2c03216d83	Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.	2014-11-20 18:46:41 +02:00
Andres Freund	d3586fc8aa	Ensure unlogged tables are reset even if crash recovery errors out. Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund	2014-11-15 01:19:26 +01:00
Alvaro Herrera	51f9ea25dc	Allow interrupting GetMultiXactIdMembers This function has a loop which can lead to uninterruptible process "stalls" (actually infinite loops) when some bugs are triggered. Avoid that unpleasant situation by adding a check for interrupts in a place that shouldn't degrade performance in the normal case. Backpatch to 9.3. Older branches have an identical loop here, but the aforementioned bugs are only a problem starting in 9.3 so there doesn't seem to be any point in backpatching any further.	2014-11-14 15:14:01 -03:00
Heikki Linnakangas	81c4508196	Fix race condition between hot standby and restoring a full-page image. There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.	2014-11-13 20:02:37 +02:00
Heikki Linnakangas	34402ae351	Fix XLogReadBufferForRedoExtended to get cleanup lock when asked to do so.	2014-11-13 17:54:20 +02:00
Heikki Linnakangas	7250d8535b	Fix building with WAL_DEBUG. Now that the backup blocks are appended to the WAL record in xloginsert.c, XLogInsert doesn't see them anymore and cannot remove them from the version reconstructed for xlog_outdesc. This makes running with wal_debug=on more expensive, as we now make (unnecessary) temporary copies of the backup blocks, but it doesn't seem worth convoluting the code to keep that optimization. Reported by Alvaro Herrera.	2014-11-07 23:09:31 +02:00
Alvaro Herrera	7516f52594	BRIN: Block Range Indexes BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.	2014-11-07 16:38:14 -03:00
Fujii Masao	5332b8cec5	Prevent the unnecessary creation of .ready file for the timeline history file. Previously .ready file was created for the timeline history file at the end of an archive recovery even when WAL archiving was not enabled. This creation is unnecessary and causes .ready file to remain infinitely. This commit changes an archive recovery so that it creates .ready file for the timeline history file only when WAL archiving is enabled. Backpatch to all supported versions.	2014-11-06 21:24:40 +09:00
Heikki Linnakangas	2076db2aea	Move the backup-block logic from XLogInsert to a new file, xloginsert.c. xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.	2014-11-06 13:55:36 +02:00
Heikki Linnakangas	5028f22f6e	Switch to CRC-32C in WAL and other places. The old algorithm was found to not be the usual CRC-32 algorithm, used by Ethernet et al. We were using a non-reflected lookup table with code meant for a reflected lookup table. That's a strange combination that AFAICS does not correspond to any bit-wise CRC calculation, which makes it difficult to reason about its properties. Although it has worked well in practice, seems safer to use a well-known algorithm. Since we're changing the algorithm anyway, we might as well choose a different polynomial. The Castagnoli polynomial has better error-correcting properties than the traditional CRC-32 polynomial, even if we had implemented it correctly. Another reason for picking that is that some new CPUs have hardware support for calculating CRC-32C, but not CRC-32, let alone our strange variant of it. This patch doesn't add any support for such hardware, but a future patch could now do that. The old algorithm is kept around for tsquery and pg_trgm, which use the values in indexes that need to remain compatible so that pg_upgrade works. While we're at it, share the old lookup table for CRC-32 calculation between hstore, ltree and core. They all use the same table, so might as well.	2014-11-04 11:39:48 +02:00
Robert Haas	6cb4afff33	Avoid setup work for invalidation messages at start-of-(sub)xact. Instead of initializing a new TransInvalidationInfo for every transaction or subtransaction, we can just do it for those transactions or subtransactions that actually need to queue invalidation messages. That also avoids needing to free those entries at the end of a transaction or subtransaction that does not generate any invalidation messages, which is by far the common case. Patch by me. Review by Simon Riggs and Andres Freund.	2014-10-29 12:35:19 -04:00
Fujii Masao	c7371c4a60	Prevent the already-archived WAL file from being archived again. Previously the archive recovery always created .ready file for the last WAL file of the old timeline at the end of recovery even when it's restored from the archive and has .done file. That is, there was the case where the WAL file had both .ready and .done files. This caused the already-archived WAL file to be archived again. This commit prevents the archive recovery from creating .ready file for the last WAL file if it has .done file, in order to prevent it from being archived again. This bug was added when cascading replication feature was introduced, i.e., the commit `5286105800`. So, back-patch to 9.2, where cascading replication was added. Reviewed by Michael Paquier	2014-10-23 16:21:27 +09:00
Andres Freund	5e5b65f359	Don't duplicate log_checkpoint messages for both of restart and checkpoints. The duplication originated in `cdd46c765`, where restartpoints were introduced. In LogCheckpointStart's case the duplication actually lead to the compiler's format string checking not to be effective because the format string wasn't constant. Arguably these messages shouldn't be elog(), but ereport() style messages. That'd even allow to translate the messages... But as there's more mistakes of that kind in surrounding code, it seems better to change that separately.	2014-10-21 01:01:56 +02:00
Andres Freund	11abd6c90f	Renumber CHECKPOINT_* flags. Commit `7dbb606938` added a new CHECKPOINT_FLUSH_ALL flag. As that commit needed to be backpatched I didn't change the numeric values of the existing flags as that could lead to nastly problems if any external code issued checkpoints. That's not a concern on master, so renumber them there. Also add a comment about CHECKPOINT_FLUSH_ALL above CreateCheckPoint().	2014-10-21 00:20:08 +02:00
Andres Freund	7dbb606938	Flush unlogged table's buffers when copying or moving databases. CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source database directory on the filesystem level. To ensure the on disk state is consistent they block out users of the affected database and force a checkpoint to flush out all data to disk. Unfortunately, up to now, that checkpoint didn't flush out dirty buffers from unlogged relations. That bug means there could be leftover dirty buffers in either the template database, or the database in its old location. Leading to problems when accessing relations in an inconsistent state; and to possible problems during shutdown in the SET TABLESPACE case because buffers belonging files that don't exist anymore are flushed. This was reported in bug #10675 by Maxim Boguk. Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and Fujii Masao. Backpatch to 9.1 where unlogged tables were introduced.	2014-10-20 23:43:46 +02:00
Peter Eisentraut	b7a08c8028	Message improvements	2014-10-12 01:06:35 -04:00
Heikki Linnakangas	5fa6c81a43	Remove num_xloginsert_locks GUC, replace with a #define I left the GUC in place for the beta period, so that people could experiment with different values. No-one's come up with any data that a different value would be better under some circumstances, so rather than try to document to users what the GUC, let's just hard-code the current value, 8.	2014-10-01 16:42:26 +03:00
Andres Freund	ef8863844b	Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE. As noted in http://bugs.debian.org/763098 there is a conflict between postgres' definition of CACHE_LINE_SIZE and the definition by various *bsd platforms. It's debatable who has the right to define such a name, but postgres' use was only introduced in `375d8526f2` (9.4), so it seems like a good idea to rename it. Discussion: 20140930195756.GC27407@msg.df7cb.de Per complaint of Christoph Berg in the above email, although he's not the original bug reporter. Backpatch to 9.4 where the define was introduced.	2014-10-01 12:17:03 +02:00
Andres Freund	6ba4ecbf47	Remove most volatile qualifiers from xlog.c For the reason outlined in `df4077cda2` also remove volatile qualifiers from xlog.c. Some of these uses of volatile have been added after noticing problems back when spinlocks didn't imply compiler barriers. So they are a good test - in fact removing the volatiles breaks when done without the barriers in spinlocks present. Several uses of volatile remain where they are explicitly used to access shared memory without locks. These locations are ok with slightly out of date data, but removing the volatile might lead to the variables never being reread from memory. These uses could also be replaced by barriers, but that's a separate change of doubtful value.	2014-09-22 23:35:08 +02:00
Andres Freund	604f7956b9	Improve code around the recently added rm_identify rmgr callback. There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2: * append_init() in heapdesc.c was ugly and required that rm_identify return values are only valid till the next call. Instead just add a couple more switch() cases for the INIT_PAGE cases. Now the returned value will always be valid. * a couple rm_identify() callbacks missed masking xl_info with ~XLR_INFO_MASK. * pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar string. * append_init() was called when id=NULL - which should never actually happen. But it's better to be careful.	2014-09-22 17:49:34 +02:00

... 3 4 5 6 7 ...

1654 Commits