postgresql

mirror of https://git.postgresql.org/git/postgresql.git synced 2024-08-10 02:23:23 +02:00

Author	SHA1	Message	Date
Tom Lane	be44ed27b8	Improve index AMs' opclass validation procedures. The amvalidate functions added in commit `65c5fcd353` were on the crude side. Improve them in a few ways: * Perform signature checking for operators and support functions. * Apply more thorough checks for missing operators and functions, where possible. * Instead of reporting problems as ERRORs, report most problems as INFO messages and make the amvalidate function return FALSE. This allows more than one problem to be discovered per run. * Report object names rather than OIDs, and work a bit harder on making the messages understandable. Also, remove a few more opr_sanity regression test queries that are now superseded by the amvalidate checks.	2016-01-21 19:47:15 -05:00
Tom Lane	b99551832e	Add defenses against putting expanded objects into Const nodes. Putting a reference to an expanded-format value into a Const node would be a bad idea for a couple of reasons. It'd be possible for the supposedly immutable Const to change value, if something modified the referenced variable ... in fact, if the Const's reference were R/W, any function that has the Const as argument might itself change it at runtime. Also, because datumIsEqual() is pretty simplistic, the Const might fail to compare equal to other Consts that it should compare equal to, notably including copies of itself. This could lead to unexpected planner behavior, such as "could not find pathkey item to sort" errors or inferior plans. I have not been able to find any way to get an expanded value into a Const within the existing core code; but Paul Ramsey was able to trigger the problem by writing a datatype input function that returns an expanded value. The best fix seems to be to establish a rule that varlena values being placed into Const nodes should be passed through pg_detoast_datum(). That will do nothing (and cost little) in normal cases, but it will flatten expanded values and thereby avoid the above problems. Also, it will convert short-header or compressed values into canonical format, which will avoid possible unexpected lack-of-equality issues for those cases too. And it provides a last-ditch defense against putting a toasted value into a Const, which we already knew was dangerous, cf commit `2b0c86b665`. (In the light of this discussion, I'm no longer sure that that commit provided 100% protection against such cases, but this fix should do it.) The test added in commit `65c3d05e18` to catch datatype input functions with unstable results would fail for functions that returned expanded values; but it seems a bit uncharitable to deem a result unstable just because it's expressed in expanded form, so revise the coding so that we check for bitwise equality only after applying pg_detoast_datum(). That's a sufficient condition anyway given the new rule about detoasting when forming a Const. Back-patch to 9.5 where the expanded-object facility was added. It's possible that this should go back further; but in the absence of clear evidence that there's any live bug in older branches, I'll refrain for now.	2016-01-21 12:56:08 -05:00
Fujii Masao	38710a374e	Remove unused argument from ginInsertCleanup() It's an oversight in commit `dc943ad`.	2016-01-22 01:22:56 +09:00
Simon Riggs	c80b31d557	Refactor headers to split out standby defs Jeff Janes	2016-01-20 18:51:34 -08:00
Simon Riggs	978b2f65aa	Speedup 2PC by skipping two phase state files in normal path 2PC state info is written only to WAL at PREPARE, then read back from WAL at COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr checkpoint cycle will be written to disk in the same form as previously. Crash recovery path is not altered. Measured performance gains of 50-100% for short 2PC transactions by completely avoiding writing files and fsyncing. Other optimizations still available, further patches in related areas expected. Stas Kelvich and heavily edited by Simon Riggs Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas, a concrete example of how Postgres-XC has fed back ideas into PostgreSQL. Reviewed by Michael Paquier, Jeff Janes and Andres Freund Performance testing by Jesper Pedersen	2016-01-20 18:40:44 -08:00
Simon Riggs	422a55a687	Refactor to create generic WAL page read callback Previously we didn’t have a generic WAL page read callback function, surprisingly. Logical decoding has logical_read_local_xlog_page(), which was actually generic, so move that to xlogfunc.c and rename to read_local_xlog_page(). Maintain logical_read_local_xlog_page() so existing callers still work. As requested by Michael Paquier, Alvaro Herrera and Andres Freund	2016-01-20 17:18:58 -08:00
Robert Haas	45be99f8cd	Support parallel joins, and make related improvements. The core innovation of this patch is the introduction of the concept of a partial path; that is, a path which if executed in parallel will generate a subset of the output rows in each process. Gathering a partial path produces an ordinary (complete) path. This allows us to generate paths for parallel joins by joining a partial path for one side (which at the baserel level is currently always a Partial Seq Scan) to an ordinary path on the other side. This is subject to various restrictions at present, especially that this strategy seems unlikely to be sensible for merge joins, so only nested loops and hash joins paths are generated. This also allows an Append node to be pushed below a Gather node in the case of a partitioned table. Testing revealed that early versions of this patch made poor decisions in some cases, which turned out to be caused by the fact that the original cost model for Parallel Seq Scan wasn't very good. So this patch tries to make some modest improvements in that area. There is much more to be done in the area of generating good parallel plans in all cases, but this seems like a useful step forward. Patch by me, reviewed by Dilip Kumar and Amit Kapila.	2016-01-20 14:40:26 -05:00
Robert Haas	a7de3dc5c3	Support multi-stage aggregation. Aggregate nodes now have two new modes: a "partial" mode where they output the unfinalized transition state, and a "finalize" mode where they accept unfinalized transition states rather than individual values as input. These new modes are not used anywhere yet, but they will be necessary for parallel aggregation. The infrastructure also figures to be useful for cases where we want to aggregate local data and remote data via the FDW interface, and want to bring back partial aggregates from the remote side that can then be combined with locally generated partial aggregates to produce the final value. It may also be useful even when neither FDWs nor parallelism are in play, as explained in the comments in nodeAgg.c. David Rowley and Simon Riggs, reviewed by KaiGai Kohei, Heikki Linnakangas, Haribabu Kommi, and me.	2016-01-20 13:46:50 -05:00
Tom Lane	dbe2328959	Fix assorted inconsistencies in GIN opclass support function declarations. GIN had some minor issues too, mostly using "internal" where something else would be more appropriate. I went with the same approach as in `9ff60273e3`, namely preferring the opclass' indexed datatype for arguments that receive an operator RHS value, even if that's not necessarily what they really are. Again, this is with an eye to having a uniform rule for ginvalidate() to check support function signatures.	2016-01-19 22:32:22 -05:00
Alvaro Herrera	948c97958b	Add two HyperLogLog functions New functions initHyperLogLogError() and freeHyperLogLog() simplify using this module from elsewhere. Author: Tomáš Vondra Review: Peter Geoghegan	2016-01-19 17:40:15 -03:00
Tom Lane	9ff60273e3	Fix assorted inconsistencies in GiST opclass support function declarations. The conventions specified by the GiST SGML documentation were widely ignored. For example, the strategy-number argument for "consistent" and "distance" functions is specified to be a smallint, but most of the built-in support functions declared it as an integer, and for that matter the core code passed it using Int32GetDatum not Int16GetDatum. None of that makes any real difference at runtime, but it's quite confusing for newcomers to the code, and it makes it very hard to write an amvalidate() function that checks support function signatures. So let's try to instill some consistency here. Another similar issue is that the "query" argument is not of a single well-defined type, but could have different types depending on the strategy (corresponding to search operators with different righthand-side argument types). Some of the functions threw up their hands and declared the query argument as being of "internal" type, which surely isn't right ("any" would have been more appropriate); but the majority position seemed to be to declare it as being of the indexed data type, corresponding to a search operator with both input types the same. So I've specified a convention that that's what to do always. Also, the result of the "union" support function actually must be of the index's storage type, but the documentation suggested declaring it to return "internal", and some of the functions followed that. Standardize on telling the truth, instead. Similarly, standardize on declaring the "same" function's inputs as being of the storage type, not "internal". Also, somebody had forgotten to add the "recheck" argument to both the documentation of the "distance" support function and all of their SQL declarations, even though the C code was happily using that argument. Clean that up too. Fix up some other omissions in the docs too, such as documenting that union's second input argument is vestigial. So far as the errors in core function declarations go, we can just fix pg_proc.h and bump catversion. Adjusting the erroneous declarations in contrib modules is more debatable: in principle any change in those scripts should involve an extension version bump, which is a pain. However, since these changes are purely cosmetic and make no functional difference, I think we can get away without doing that.	2016-01-19 12:04:36 -05:00
Tom Lane	65c5fcd353	Restructure index access method API to hide most of it at the C level. This patch reduces pg_am to just two columns, a name and a handler function. All the data formerly obtained from pg_am is now provided in a C struct returned by the handler function. This is similar to the designs we've adopted for FDWs and tablesample methods. There are multiple advantages. For one, the index AM's support functions are now simple C functions, making them faster to call and much less error-prone, since the C compiler can now check function signatures. For another, this will make it far more practical to define index access methods in installable extensions. A disadvantage is that SQL-level code can no longer see attributes of index AMs; in particular, some of the crosschecks in the opr_sanity regression test are no longer possible from SQL. We've addressed that by adding a facility for the index AM to perform such checks instead. (Much more could be done in that line, but for now we're content if the amvalidate functions more or less replace what opr_sanity used to do.) We might also want to expose some sort of reporting functionality, but this patch doesn't do that. Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily editorialized on by me.	2016-01-17 19:36:59 -05:00
Tom Lane	8d290c8ec6	Re-pgindent a few files. In preparation for landing index AM interface changes.	2016-01-17 19:13:18 -05:00
Magnus Hagander	cf7dfbf2d6	Fix minor typo in comment Tatsuro Yamada	2016-01-15 10:24:37 +01:00
Simon Riggs	e63bb4549a	Add new user fn pg_current_xlog_flush_location() Tomas Vondra, reviewed by Michael Paquier and Amit Kapila Minor edits by me	2016-01-12 07:54:52 +00:00
Tom Lane	26d538dc93	Clean up some lack-of-STRICT issues in the core code, too. A scan for missed proisstrict markings in the core code turned up these functions: brin_summarize_new_values pg_stat_reset_single_table_counters pg_stat_reset_single_function_counters pg_create_logical_replication_slot pg_create_physical_replication_slot pg_drop_replication_slot The first three of these take OID, so a null argument will normally look like a zero to them, resulting in "ERROR: could not open relation with OID 0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX functions. The other three will dump core on a null argument, though this is mitigated by the fact that they won't do so until after checking that the caller is superuser or has rolreplication privilege. In addition, the pg_logical_slot_get/peek[_binary]_changes family was intentionally marked nonstrict, but failed to make nullness checks on all the arguments; so again a null-pointer-dereference crash is possible but only for superusers and rolreplication users. Add the missing ARGISNULL checks to the latter functions, and mark the former functions as strict in pg_proc. Make that change in the back branches too, even though we can't force initdb there, just so that installations initdb'd in future won't have the issue. Since none of these bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX functions hardly misbehave at all), it seems sufficient to do this. In addition, fix some order-of-operations oddities in the slot_get_changes family, mostly cosmetic, but not the part that moves the function's last few operations into the PG_TRY block. As it stood, there was significant risk for an error to exit without clearing historical information from the system caches. The slot_get_changes bugs go back to 9.4 where that code was introduced. Back-patch appropriate subsets of the pg_proc changes into all active branches, as well.	2016-01-09 16:58:32 -05:00
Simon Riggs	687f2cd7a0	Avoid pin scan for replay of XLOG_BTREE_VACUUM Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the “pin scan” that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here. The current commit still performs the pin scan for toast indexes, though this can also be avoided if we recheck scans on toast indexes. Later patch will address this. No tests included. Manual tests using an additional patch to view WAL records and their timing have shown the change in WAL records and their handling has successfully reduced replication delay.	2016-01-09 10:10:08 +00:00
Magnus Hagander	2650486ebc	Fix typo in comment Tatsuro Yamada	2016-01-08 08:54:40 +01:00
Alvaro Herrera	b1a9bad9e7	pgstat: add WAL receiver status view & SRF This new view provides insight into the state of a running WAL receiver in a HOT standby node. The information returned includes the PID of the WAL receiver process, its status (stopped, starting, streaming, etc), start LSN and TLI, last received LSN and TLI, timestamp of last message send and receipt, latest end-of-WAL LSN and time, and the name of the slot (if any). Access to the detailed data is only granted to superusers; others only get the PID. Author: Michael Paquier Reviewer: Haribabu Kommi	2016-01-07 16:21:19 -03:00
Alvaro Herrera	a967613911	Windows: Make pg_ctl reliably detect service status pg_ctl is using isatty() to verify whether the process is running in a terminal, and if not it sends its output to Windows' Event Log ... which does the wrong thing when the output has been redirected to a pipe, as reported in bug #13592. To fix, make pg_ctl use the code we already have to detect service-ness: in the master branch, move src/backend/port/win32/security.c to src/port (with suitable tweaks so that it runs properly in backend and frontend environments); pg_ctl already has access to pgport so it Just Works. In older branches, that's likely to cause trouble, so instead duplicate the required code in pg_ctl.c. Author: Michael Paquier Bug report and diagnosis: Egon Kocjan Backpatch: all supported branches	2016-01-07 11:59:08 -03:00
Alvaro Herrera	abb1733922	Add scale(numeric) Author: Marko Tiikkaja	2016-01-05 19:02:13 -03:00
Tom Lane	ea0d494dae	Make the to_reg*() functions accept text not cstring. Using cstring as the input type was a poor decision, because that's not really a full-fledged type. In particular, it lacks implicit coercions from text or varchar, meaning that usages like to_regproc('foo'\|\|'bar') wouldn't work; basically the only case that did work without explicit casting was a simple literal constant argument. The lack of field complaints about this suggests that hardly anyone is using these functions, so hopefully fixing it won't cause much of a compatibility problem. They've only been there since 9.4, anyway. Petr Korobeinikov	2016-01-05 13:02:43 -05:00
Alvaro Herrera	efa318bcfa	Make pg_shseclabel available in early backend startup While the in-core authentication mechanism doesn't need to access pg_shseclabel at all, it's reasonable to think that an authentication hook will want to look at the label for the role logging in, or for rows in other catalogs used during the authentication phase of startup. Catalog version bumped, because this changes the "is nailed" status for pg_shseclabel. Author: Adam Brightwell	2016-01-05 14:50:53 -03:00
Bruce Momjian	ee94300446	Update copyright for 2016 Backpatch certain files through 9.1	2016-01-02 13:33:40 -05:00
Tom Lane	0dab5ef39b	Fix ALTER OPERATOR to update dependencies properly. Fix an oversight in commit `321eed5f0f`: replacing an operator's selectivity functions needs to result in a corresponding update in pg_depend. We have a function that can handle that, but it was not called by AlterOperator(). To fix this without enlarging pg_operator.h's #include list beyond what clients can safely include, split off the function definitions into a new file pg_operator_fn.h, similarly to what we've done for some other catalog header files. It's not entirely clear whether any client-side code needs to include pg_operator.h, but it seems prudent to assume that there is some such code somewhere.	2015-12-31 17:37:31 -05:00
Joe Conway	241448b23a	Rename (new\|old)estCommitTs to (new\|old)estCommitTsXid The variables newestCommitTs and oldestCommitTs sound as if they are timestamps, but in fact they are the transaction Ids that correspond to the newest and oldest timestamps rather than the actual timestamps. Rename these variables to reflect that they are actually xids: to wit newestCommitTsXid and oldestCommitTsXid respectively. Also modify related code in a similar fashion, particularly the user facing output emitted by pg_controldata and pg_resetxlog. Complaint and patch by me, review by Tom Lane and Alvaro Herrera. Backpatch to 9.5 where these variables were first introduced.	2015-12-28 12:34:11 -08:00
Tom Lane	6efbded6e4	Allow omitting one or both boundaries in an array slice specifier. Omitted boundaries represent the upper or lower limit of the corresponding array subscript. This allows simpler specification of many common use-cases. (Revised version of commit `9246af6799`) YUriy Zhuravlev	2015-12-22 21:05:29 -05:00
Robert Haas	0ba3f3bc65	Comment improvements for abbreviated keys. Peter Geoghegan and Robert Haas	2015-12-22 13:57:18 -05:00
Robert Haas	ccd8f97922	postgres_fdw: Consider requesting sorted data so we can do a merge join. When use_remote_estimate is enabled, consider adding ORDER BY to the query we sending to the remote server so that we can use that ordered data for a merge join. Commit `f18c944b61` arranges to push down the query pathkeys, which seems like the case mostly likely to be a win, but testing shows this can sometimes win, too. For a regular table, we know which indexes are present and therefore test whether the ordering provided by each such index is useful. Here, we take the opposite approach: guess what orderings would be useful if they could be generated cheaply, and then ask the remote side what those will cost. Ashutosh Bapat, with very substantial cosmetic revisions by me. Also reviewed by Rushabh Lathia.	2015-12-22 13:46:40 -05:00
Teodor Sigaev	bbbd807097	Revert `9246af6799` because I miss too much. Patch is returned to commitfest process.	2015-12-18 21:35:22 +03:00
Teodor Sigaev	9246af6799	Allow to omit boundaries in array subscript Allow to omiy lower or upper or both boundaries in array subscript for selecting slice of array. Author: YUriy Zhuravlev	2015-12-18 15:18:58 +03:00
Tom Lane	66d947b9d3	Adjust behavior of single-user -j mode for better initdb error reporting. Previously, -j caused the entire input file to be read in and executed as a single command string. That's undesirable, not least because any error causes the entire file to be regurgitated as the "failing query". Some experimentation suggests a better rule: end the command string when we see a semicolon immediately followed by two newlines, ie, an empty line after a query. This serves nicely to break up the existing examples such as information_schema.sql and system_views.sql. A limitation is that it's no longer possible to write such a sequence within a string literal or multiline comment in a file meant to be read with -j; but there are no instances of such a problem within the data currently used by initdb. (If someone does make such a mistake in future, it'll be obvious because they'll get an unterminated-literal or unterminated-comment syntax error.) Other than that, there shouldn't be any negative consequences; you're not forced to end statements that way, it's just a better idea in most cases. In passing, remove src/include/tcop/tcopdebug.h, which is dead code because it's not included anywhere, and hasn't been for more than ten years. One of the debug-support symbols it purported to describe has been unreferenced for at least the same amount of time, and the other is removed by this commit on the grounds that it was useless: forcing -j mode all the time would have broken initdb. The lack of complaints about that, or about the missing inclusion, shows that no one has tried to use TCOP_DONTUSENEWLINE in many years.	2015-12-17 19:34:15 -05:00
Alvaro Herrera	756e7b4c9d	Rework internals of changing a type's ownership This is necessary so that REASSIGN OWNED does the right thing with composite types, to wit, that it also alters ownership of the type's pg_class entry -- previously, the pg_class entry remained owned by the original user, which caused later other failures such as the new owner's inability to use ALTER TYPE to rename an attribute of the affected composite. Also, if the original owner is later dropped, the pg_class entry becomes owned by a non-existant user which is bogus. To fix, create a new routine AlterTypeOwner_oid which knows whether to pass the request to ATExecChangeOwner or deal with it directly, and use that in shdepReassignOwner rather than calling AlterTypeOwnerInternal directly. AlterTypeOwnerInternal is now simpler in that it only modifies the pg_type entry and recurses to handle a possible array type; higher-level tasks are handled by either AlterTypeOwner directly or AlterTypeOwner_oid. I took the opportunity to add a few more objects to the test rig for REASSIGN OWNED, so that more cases are exercised. Additional ones could be added for superuser-only-ownable objects (such as FDWs and event triggers) but I didn't want to push my luck by adding a new superuser to the tests on a backpatchable bug fix. Per bug #13666 reported by Chris Pacejo. Backpatch to 9.5. (I would back-patch this all the way back, except that it doesn't apply cleanly in 9.4 and earlier because `59367fdf9` wasn't backpatched. If we decide that we need this in earlier branches too, we should backpatch both.)	2015-12-17 14:25:41 -03:00
Tom Lane	2ec477dc81	Cope with Readline's failure to track SIGWINCH events outside of input. It emerges that libreadline doesn't notice terminal window size change events unless they occur while collecting input. This is easy to stumble over if you resize the window while using a pager to look at query output, but it can be demonstrated without any pager involvement. The symptom is that queries exceeding one line are misdisplayed during subsequent input cycles, because libreadline has the wrong idea of the screen dimensions. The safest, simplest way to fix this is to call rl_reset_screen_size() just before calling readline(). That causes an extra ioctl(TIOCGWINSZ) for every command; but since it only happens when reading from a tty, the performance impact should be negligible. A more valid objection is that this still leaves a tiny window during entry to readline() wherein delivery of SIGWINCH will be missed; but the practical consequences of that are probably negligible. In any case, there doesn't seem to be any good way to avoid the race, since readline exposes no functions that seem safe to call from a generic signal handler --- rl_reset_screen_size() certainly isn't. It turns out that we also need an explicit rl_initialize() call, else rl_reset_screen_size() dumps core when called before the first readline() call. rl_reset_screen_size() is not present in old versions of libreadline, so we need a configure test for that. (rl_initialize() is present at least back to readline 4.0, so we won't bother with a test for it.) We would need a configure test anyway since libedit's emulation of libreadline doesn't currently include such a function. Fortunately, libedit seems not to have any corresponding bug. Merlin Moncure, adjusted a bit by me	2015-12-16 16:59:35 -05:00
Robert Haas	6150a1b08a	Move buffer I/O and content LWLocks out of the main tranche. Move the content lock directly into the BufferDesc, so that locking and pinning a buffer touches only one cache line rather than two. Adjust the definition of BufferDesc slightly so that this doesn't make the BufferDesc any larger than one cache line (at least on platforms where a spinlock is only 1 or 2 bytes). We can't fit the I/O locks into the BufferDesc and stay within one cache line, so move those to a completely separate tranche. This leaves a relatively limited number of LWLocks in the main tranche, so increase the padding of those remaining locks to a full cache line, rather than allowing adjacent locks to share a cache line, hopefully reducing false sharing. Performance testing shows that these changes make little difference on laptop-class machines, but help significantly on larger servers, especially those with more than 2 sockets. Andres Freund, originally based on an earlier patch by Simon Riggs. Review and cosmetic adjustments (including heavy rewriting of the comments) by me.	2015-12-15 13:32:54 -05:00
Robert Haas	3fed417452	Provide a way to predefine LWLock tranche IDs. It's a bit cumbersome to use LWLockNewTrancheId(), because the returned value needs to be shared between backends so that each backend can call LWLockRegisterTranche() with the correct ID. So, for built-in tranches, use a hard-coded value instead. This is motivated by an upcoming patch adding further built-in tranches. Andres Freund and Robert Haas	2015-12-15 11:48:19 -05:00
Stephen Frost	833728d4c8	Handle policies during DROP OWNED BY DROP OWNED BY handled GRANT-based ACLs but was not removing roles from policies. Fix that by having DROP OWNED BY remove the role specified from the list of roles the policy (or policies) apply to, or the entire policy (or policies) if it only applied to the role specified. As with ACLs, the DROP OWNED BY caller must have permission to modify the policy or a WARNING is thrown and no change is made to the policy.	2015-12-11 16:12:25 -05:00
Tom Lane	4fcf48450d	Get rid of the planner's LateralJoinInfo data structure. I originally modeled this data structure on SpecialJoinInfo, but after commit `acfcd45cac` that looks like a pretty poor decision. All we really need is relid sets identifying laterally-referenced rels; and most of the time, what we want to know about includes indirect lateral references, a case the LateralJoinInfo data was unsuited to compute with any efficiency. The previous commit redefined RelOptInfo.lateral_relids as the transitive closure of lateral references, so that it easily supports checking indirect references. For the places where we really do want just direct references, add a new RelOptInfo field direct_lateral_relids, which is easily set up as a copy of lateral_relids before we perform the transitive closure calculation. Then we can just drop lateral_info_list and LateralJoinInfo and the supporting code. This makes the planner's handling of lateral references noticeably more efficient, and shorter too. Such a change can't be back-patched into stable branches for fear of breaking extensions that might be looking at the planner's data structures; but it seems not too late to push it into 9.5, so I've done so.	2015-12-11 15:52:38 -05:00
Tom Lane	acfcd45cac	Still more fixes for planner's handling of LATERAL references. More fuzz testing by Andreas Seltenreich exposed that the planner did not cope well with chains of lateral references. If relation X references Y laterally, and Y references Z laterally, then we will have to scan X on the inside of a nestloop with Z, so for all intents and purposes X is laterally dependent on Z too. The planner did not understand this and would generate intermediate joins that could not be used. While that was usually harmless except for wasting some planning cycles, under the right circumstances it would lead to "failed to build any N-way joins" or "could not devise a query plan" planner failures. To fix that, convert the existing per-relation lateral_relids and lateral_referencers relid sets into their transitive closures; that is, they now show all relations on which a rel is directly or indirectly laterally dependent. This not only fixes the chained-reference problem but allows some of the relevant tests to be made substantially simpler and faster, since they can be reduced to simple bitmap manipulations instead of searches of the LateralJoinInfo list. Also, when a PlaceHolderVar that is due to be evaluated at a join contains lateral references, we should treat those references as indirect lateral dependencies of each of the join's base relations. This prevents us from trying to join any individual base relations to the lateral reference source before the join is formed, which again cannot work. Andreas' testing also exposed another oversight in the "dangerous PlaceHolderVar" test added in commit `85e5e222b1`. Simply rejecting unsafe join paths in joinpath.c is insufficient, because in some cases we will end up rejecting all possible paths for a particular join, again leading to "could not devise a query plan" failures. The restriction has to be known also to join_is_legal and its cohort functions, so that they will not select a join for which that will happen. I chose to move the supporting logic into joinrels.c where the latter functions are. Back-patch to 9.3 where LATERAL support was introduced.	2015-12-11 14:22:20 -05:00
Alvaro Herrera	69e7235c93	Fix commit timestamp initialization This module needs explicit initialization in order to replay WAL records in recovery, but we had broken this recently following changes to make other (stranger) scenarios work correctly. To fix, rework the initialization sequence so that it always takes place before WAL replay commences for both master and standby. I could have gone for a more localized fix that just added a "startup" call for the master server, but it seemed better to restructure the existing callers as well so that the whole thing made more sense. As a drawback, there is more control logic in xlog.c now than previously, but doing otherwise meant passing down the ControlFile flag, which seemed uglier as a whole. This also meant adding a check to not re-execute ActivateCommitTs if it had already been called. Reported by Fujii Masao. Backpatch to 9.5.	2015-12-11 14:30:43 -03:00
Andres Freund	e3f4cfc7aa	Fix bug leading to restoring unlogged relations from empty files. At the end of crash recovery, unlogged relations are reset to the empty state, using their init fork as the template. The init fork is copied to the main fork without going through shared buffers. Unfortunately WAL replay so far has not necessarily flushed writes from shared buffers to disk at that point. In normal crash recovery, and before the introduction of 'fast promotions' in `fd4ced523` / 9.3, the END_OF_RECOVERY checkpoint flushes the buffers out in time. But with fast promotions that's not the case anymore. To fix, force WAL writes targeting the init fork to be flushed immediately (using the new FlushOneBuffer() function). In 9.5+ that flush can centrally be triggered from the code dealing with restoring full page writes (XLogReadBufferForRedoExtended), in earlier releases that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay function. Backpatch to 9.1, even if this currently is only known to trigger in 9.3+. Flushing earlier is more robust, and it is advantageous to keep the branches similar. Typical symptoms of this bug are errors like 'ERROR: index "..." contains unexpected zero page at block 0' shortly after promoting a node. Reported-By: Thom Brown Author: Andres Freund and Michael Paquier Discussion: 20150326175024.GJ451@alap3.anarazel.de Backpatch: 9.1-	2015-12-10 16:29:26 +01:00
Robert Haas	b287df70e4	Allow EXPLAIN (ANALYZE, VERBOSE) to display per-worker statistics. The original parallel sequential scan commit included only very limited changes to the EXPLAIN output. Aggregated totals from all workers were displayed, but there was no way to see what each individual worker did or to distinguish the effort made by the workers from the effort made by the leader. Per a gripe by Thom Brown (and maybe others). Patch by me, reviewed by Amit Kapila.	2015-12-09 13:21:19 -05:00
Kevin Grittner	25c5392330	Improve performance in freeing memory contexts The single linked list of memory contexts could result in O(N^2) performance to free a set of contexts if they were not freed in reverse order of creation. In many cases the reverse order was used, but there were some significant exceptions that caused real- world performance problems. Rather than requiring all callers to care about the order in which contexts were freed, and hunting down and changing all existing cases where the wrong order was used, we add one pointer per memory context so that the implementation details are not so visible. Jan Wieck	2015-12-08 17:32:49 -06:00
Robert Haas	385f337c9f	Allow foreign and custom joins to handle EvalPlanQual rechecks. Commit `e7cb7ee145` provided basic infrastructure for allowing a foreign data wrapper or custom scan provider to replace a join of one or more tables with a scan. However, this infrastructure failed to take into account the need for possible EvalPlanQual rechecks, and ExecScanFetch would fail an assertion (or just overwrite memory) if such a check was attempted for a plan containing a pushed-down join. To fix, adjust the EPQ machinery to skip some processing steps when scanrelid == 0, making those the responsibility of scan's recheck method, which also has the responsibility in this case of correctly populating the relevant slot. To allow foreign scans to gain control in the right place to make use of this new facility, add a new, optional RecheckForeignScan method. Also, allow a foreign scan to have a child plan, which can be used to correctly populate the slot (or perhaps for something else, but this is the only use currently envisioned). KaiGai Kohei, reviewed by Robert Haas, Etsuro Fujita, and Kyotaro Horiguchi.	2015-12-08 12:31:03 -05:00
Tom Lane	edca44b152	Simplify LATERAL-related calculations within add_paths_to_joinrel(). While convincing myself that commit `7e19db0c09` would solve both of the problems recently reported by Andreas Seltenreich, I realized that add_paths_to_joinrel's handling of LATERAL restrictions could be made noticeably simpler and faster if we were to retain the minimum possible parameterization for each joinrel (that is, the set of relids supplying unsatisfied lateral references in it). We already retain that for baserels, in RelOptInfo.lateral_relids, so we can use that field for joinrels too. I re-pgindent'd the files touched here, which affects some unrelated comments. This is, I believe, just a minor optimization not a bug fix, so no back-patch.	2015-12-07 18:56:17 -05:00
Tom Lane	7e19db0c09	Fix another oversight in checking if a join with LATERAL refs is legal. It was possible for the planner to decide to join a LATERAL subquery to the outer side of an outer join before the outer join itself is completed. Normally that's fine because of the associativity rules, but it doesn't work if the subquery contains a lateral reference to the inner side of the outer join. In such a situation the outer join must be done first. join_is_legal() missed this consideration and would allow the join to be attempted, but the actual path-building code correctly decided that no valid join path could be made, sometimes leading to planner errors such as "failed to build any N-way joins". Per report from Andreas Seltenreich. Back-patch to 9.3 where LATERAL support was added.	2015-12-07 17:42:11 -05:00
Alvaro Herrera	820ddb2c2f	Further tweak commit_timestamp behavior As pointed out by Fujii Masao, we weren't quite there on a standby behaving sanely: first because we were failing to acquire the correct state in the case where no XLOG_PARAMETER_CHANGE message was sent (because a checkpoint had already happened after the setting was changed in the master, and then the standby was restarted); and second because promoting the standby with the feature enabled failed to activate it if the master had the feature disabled. This patch fixes both those misbehaviors hopefully without re-introducing any old problems. Also change the hint emitted in a standby together with the error message about the feature being disabled, to make it point out that the place to chance the setting is the master. Otherwise, if the setting is already enabled in the standby, it is very confusing to have it say that the setting must be enabled ... Authors: Álvaro Herrera, Petr Jelínek. Backpatch to 9.5.	2015-12-03 19:22:31 -03:00
Tom Lane	ec7eef6b11	Avoid caching expression state trees for domain constraints across queries. In commit `8abb3cda0d` I attempted to cache the expression state trees constructed for domain CHECK constraints for the life of the backend (assuming the domain's constraints don't get redefined). However, this turns out not to work very well, because execQual.c will run those state trees with ecxt_per_query_memory pointing to a query-lifespan context, and in some situations we'll end up with pointers into that context getting stored into the state trees. This happens in particular with SQL-language functions, as reported by Emre Hasegeli, but there are many other cases. To fix, keep only the expression plan trees for domain CHECK constraints in the typcache's data structure, and revert to performing ExecInitExpr (at least) once per query to set up expression state trees in the query's context. Eventually it'd be nice to undo this, but that will require some careful thought about memory management for expression state trees, and it seems far too late for any such redesign in 9.5. This way is still much more efficient than what happened before `8abb3cda0`.	2015-11-29 18:18:42 -05:00
Tom Lane	8d32717b6b	Avoid doing encoding conversions by double-conversion via MULE_INTERNAL. Previously, we did many conversions for Cyrillic and Central European single-byte encodings by converting to a related MULE_INTERNAL coding scheme before converting to the destination. This seems unnecessarily inefficient. Moreover, if the conversion encounters an untranslatable character, the error message will confusingly complain about failure to convert to or from MULE_INTERNAL, rather than the user-visible encodings. Worse still, this approach results in some completely unnecessary conversion failures; there are cases where the chosen MULE subset lacks characters that exist in both of the user-visible encodings, causing a conversion failure that need not occur. This patch fixes the first two of those deficiencies by introducing a new local2local() conversion support subroutine for direct conversion between any two single-byte character sets, and adding new conversion tables where needed. However, I generated the new conversion tables by testing PG 9.5's behavior, so that the actual conversion behavior is bug-compatible with previous releases; the only user-visible behavior change is that the error messages for conversion failures are saner. Changes in the conversion behavior will probably ensue after discussion. Interestingly, although this approach requires more tables, the .so files actually end up smaller (at least on my x86_64 machine); the tables are smaller than the management code needed for double conversion. Per a complaint from Albe Laurenz.	2015-11-28 13:42:27 -05:00
Teodor Sigaev	92e38182d7	COPY (INSERT/UPDATE/DELETE .. RETURNING ..) Attached is a patch for being able to do COPY (query) without a CTE. Author: Marko Tiikkaja Review: Michael Paquier	2015-11-27 19:11:22 +03:00
Tom Lane	00cdd83521	Adopt the GNU convention for handling tar-archive members exceeding 8GB. The POSIX standard for tar headers requires archive member sizes to be printed in octal with at most 11 digits, limiting the representable file size to 8GB. However, GNU tar and apparently most other modern tars support a convention in which oversized values can be stored in base-256, allowing any practical file to be a tar member. Adopt this convention to remove two limitations: * pg_dump with -Ft output format failed if the contents of any one table exceeded 8GB. * pg_basebackup failed if the data directory contained any file exceeding 8GB. (This would be a fatal problem for installations configured with a table segment size of 8GB or more, and it has also been seen to fail when large core dump files exist in the data directory.) File sizes under 8GB are still printed in octal, so that no compatibility issues are created except in cases that would have failed entirely before. In addition, this patch fixes several bugs in the same area: * In 9.3 and later, we'd defined tarCreateHeader's file-size argument as size_t, which meant that on 32-bit machines it would write a corrupt tar header for file sizes between 4GB and 8GB, even though no error was raised. This broke both "pg_dump -Ft" and pg_basebackup for such cases. * pg_restore from a tar archive would fail on tables of size between 4GB and 8GB, on machines where either "size_t" or "unsigned long" is 32 bits. This happened even with an archive file not affected by the previous bug. * pg_basebackup would fail if there were files of size between 4GB and 8GB, even on 64-bit machines. * In 9.3 and later, "pg_basebackup -Ft" failed entirely, for any file size, on 64-bit big-endian machines. In view of these potential data-loss bugs, back-patch to all supported branches, even though removal of the documented 8GB limit might otherwise be considered a new feature rather than a bug fix.	2015-11-21 20:21:31 -05:00
Tom Lane	074c5cfbfb	Fix handling of inherited check constraints in ALTER COLUMN TYPE (again). The previous way of reconstructing check constraints was to do a separate "ALTER TABLE ONLY tab ADD CONSTRAINT" for each table in an inheritance hierarchy. However, that way has no hope of reconstructing the check constraints' own inheritance properties correctly, as pointed out in bug #13779 from Jan Dirk Zijlstra. What we should do instead is to do a regular "ALTER TABLE", allowing recursion, at the topmost table that has a particular constraint, and then suppress the work queue entries for inherited instances of the constraint. Annoyingly, we'd tried to fix this behavior before, in commit `5ed6546cf`, but we failed to notice that it wasn't reconstructing the pg_constraint field values correctly. As long as I'm touching pg_get_constraintdef_worker anyway, tweak it to always schema-qualify the target table name; this seems like useful backup to the protections installed by commit `5f173040`. In HEAD/9.5, get rid of get_constraint_relation_oids, which is now unused. (I could alternatively have modified it to also return conislocal, but that seemed like a pretty single-purpose API, so let's not pretend it has some other use.) It's unused in the back branches as well, but I left it in place just in case some third-party code has decided to use it. In HEAD/9.5, also rename pg_get_constraintdef_string to pg_get_constraintdef_command, as the previous name did nothing to explain what that entry point did differently from others (and its comment was equally useless). Again, that change doesn't seem like material for back-patching. I did a bit of re-pgindenting in tablecmds.c in HEAD/9.5, as well. Otherwise, back-patch to all supported branches.	2015-11-20 14:55:47 -05:00
Robert Haas	bc4996e61b	Make ALTER .. SET SCHEMA do nothing, instead of throwing an ERROR. This was already true for CREATE EXTENSION, but historically has not been true for other object types. Therefore, this is a backward incompatibility. Per discussion on pgsql-hackers, everyone seems to agree that the new behavior is better. Marti Raudsepp, reviewed by Haribabu Kommi and myself	2015-11-19 10:49:25 -05:00
Robert Haas	166b61a88e	Avoid aggregating worker instrumentation multiple times. Amit Kapila, per design ideas from me.	2015-11-18 12:35:25 -05:00
Robert Haas	e93b62985f	Remove volatile qualifiers from bufmgr.c and freelist.c Prior to commit `0709b7ee72`, access to variables within a spinlock-protected critical section had to be done through a volatile pointer, but that should no longer be necessary. Review by Andres Freund	2015-11-16 18:50:06 -05:00
Robert Haas	fe702a7b3f	Move each SLRU's lwlocks to a separate tranche. This makes it significantly easier to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output. It's also arguably better from a modularity standpoint, since lwlock.c no longer needs to know anything about the LWLock needs of the higher-level SLRU facility. Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.	2015-11-12 14:59:09 -05:00
Robert Haas	a05dc4d7fd	Provide readfuncs support for custom scans. Commit `a0d9f6e434` added this support for all other plan node types; this fills in the gap. Since TextOutCustomScan complicates this and is pretty well useless, remove it. KaiGai Kohei, with some modifications by me.	2015-11-12 07:40:31 -05:00
Robert Haas	80558c1f5a	Generate parallel sequential scan plans in simple cases. Add a new flag, consider_parallel, to each RelOptInfo, indicating whether a plan for that relation could conceivably be run inside of a parallel worker. Right now, we're pretty conservative: for example, it might be possible to defer applying a parallel-restricted qual in a worker, and later do it in the leader, but right now we just don't try to parallelize access to that relation. That's probably the right decision in most cases, anyway. Using the new flag, generate parallel sequential scan plans for plain baserels, meaning that we now have parallel sequential scan in PostgreSQL. The logic here is pretty unsophisticated right now: the costing model probably isn't right in detail, and we can't push joins beneath Gather nodes, so the number of plans that can actually benefit from this is pretty limited right now. Lots more work is needed. Nevertheless, it seems time to enable this functionality so that all this code can actually be tested easily by users and developers. Note that, if you wish to test this functionality, it will be necessary to set max_parallel_degree to a value greater than the default of 0. Once a few more loose ends have been tidied up here, we might want to consider changing the default value of this GUC, but I'm leaving it alone for now. Along the way, fix a bug in cost_gather: the previous coding thought that a Gather node's transfer overhead should be costed on the basis of the relation size rather than the number of tuples that actually need to be passed off to the leader. Patch by me, reviewed in earlier versions by Amit Kapila.	2015-11-11 09:02:52 -05:00
Robert Haas	f0661c4e8c	Make sequential scans parallel-aware. In addition, this path fills in a number of missing bits and pieces in the parallel infrastructure. Paths and plans now have a parallel_aware flag indicating whether whatever parallel-aware logic they have should be engaged. It is believed that we will need this flag for a number of path/plan types, not just sequential scans, which is why the flag is generic rather than part of the SeqScan structures specifically. Also, execParallel.c now gives parallel nodes a chance to initialize their PlanState nodes from the DSM during parallel worker startup. Amit Kapila, with a fair amount of adjustment by me. Review of previous patch versions by Haribabu Kommi and others.	2015-11-11 08:57:52 -05:00
Tom Lane	c5e86ea932	Add "xid <> xid" and "xid <> int4" operators. The corresponding "=" operators have been there a long time, and not having their negators is a bit of a nuisance. Michael Paquier	2015-11-07 16:40:15 -05:00
Robert Haas	6e71dd7ce9	Modify tqueue infrastructure to support transient record types. Commit `4a4e6893aa`, which introduced this mechanism, failed to account for the fact that the RECORD pseudo-type uses transient typmods that are only meaningful within a single backend. Transferring such tuples without modification between two cooperating backends does not work. This commit installs a system for passing the tuple descriptors over the same shm_mq being used to send the tuples themselves. The two sides might not assign the same transient typmod to any given tuple descriptor, so we must also substitute the appropriate receiver-side typmod for the one used by the sender. That adds some CPU overhead, but still seems better than being unable to pass records between cooperating parallel processes. Along the way, move the logic for handling multiple tuple queues from tqueue.c to nodeGather.c; tqueue.c now provides a TupleQueueReader, which reads from a single queue, rather than a TupleQueueFunnel, which potentially reads from multiple queues. This change was suggested previously as a way to make sure that nodeGather.c rather than tqueue.c had policy control over the order in which to read from queues, but it wasn't clear to me until now how good an idea it was. typmod mapping needs to be performed separately for each queue, and it is much simpler if the tqueue.c code handles that and leaves multiplexing multiple queues to higher layers of the stack.	2015-11-06 16:58:45 -05:00
Robert Haas	a76ef15d9f	Add sort support routine for the UUID data type. This introduces a simple encoding scheme to produce abbreviated keys: pack as many bytes of each UUID as will fit into a Datum. On little-endian machines, a byteswap is also performed; the abbreviated comparator can therefore just consist of a simple 3-way unsigned integer comparison. The purpose of this change is to speed up sorting data on a column of type UUID. Peter Geoghegan	2015-11-06 12:14:35 -05:00
Robert Haas	64b2e7ad91	Pass extra data to bgworkers, and use this to fix parallel contexts. Up until now, the total amount of data that could be passed to a background worker at startup was one datum, which can be a small as 4 bytes on some systems. That's enough to pass a dsm_handle or an array index, but not much else. Add a bgw_extra flag to the BackgroundWorker struct, allowing up to 128 bytes to be passed to a new worker on any platform. Use this to fix a problem I recently discovered with the parallel context machinery added in 9.5: the master assigns each worker an array index, and each worker subsequently assigns itself an array index, and there's nothing to guarantee that the two sets of indexes match, leading to chaos. Normally, I would not back-patch the change to add bgw_extra, since it is basically a feature addition. However, since 9.5 is still in beta and there seems to be no other sensible way to repair the broken parallel context machinery, back-patch to 9.5. Existing background worker code can ignore the bgw_extra field without a problem, but might need to be recompiled since the structure size has changed. Report and patch by me. Review by Amit Kapila.	2015-11-05 12:13:56 -05:00
Tom Lane	d894941663	Allow postgres_fdw to ship extension funcs/operators for remote execution. The user can whitelist specified extension(s) in the foreign server's options, whereupon we will treat immutable functions and operators of those extensions as candidates to be sent for remote execution. Whitelisting an extension in this way basically promises that the extension exists on the remote server and behaves compatibly with the local instance. We have no way to prove that formally, so we have to rely on the user to get it right. But this seems like something that people can usually get right in practice. We might in future allow functions and operators to be whitelisted individually, but extension granularity is a very convenient special case, so it got done first. The patch as-committed lacks any regression tests, which is unfortunate, but introducing dependencies on other extensions for testing purposes would break "make installcheck" scenarios, which is worse. I have some ideas about klugy ways around that, but it seems like material for a separate patch. For the moment, leave the problem open. Paul Ramsey, hacked up a bit more by me	2015-11-03 18:42:18 -05:00
Robert Haas	1efc7e5382	Fix problems with ParamListInfo serialization mechanism. Commit `d1b7c1ffe7` introduced a mechanism for serializing a ParamListInfo structure to be passed to a parallel worker. However, this mechanism failed to handle external expanded values, as pointed out by Noah Misch. Repair. Moreover, plpgsql_param_fetch requires adjustment because the serialization mechanism needs it to skip evaluating unused parameters just as we would do when it is called from copyParamList, but params == estate->paramLI in that case. To fix, make the bms_is_member test in that function unconditional. Finally, have setup_param_list set a new ParamListInfo field, paramMask, to the parameters actually used in the expression, so that we don't try to fetch those that are not needed when serializing a parameter list. This isn't necessary for correctness, but it makes the performance of the parallel executor code comparable to what we do for cases involving cursors. Design suggestions and extensive review by Noah Misch. Patch by me.	2015-11-02 18:11:29 -05:00
Tom Lane	12c9a04008	Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match ending at the current point in the string, rather than one starting at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)	2015-10-30 19:14:19 -04:00
Robert Haas	3a1f8611f2	Update parallel executor support to reuse the same DSM. Commit `b0b0d84b3d` purported to make it possible to relaunch workers using the same parallel context, but it had an unpleasant race condition: we might reinitialize after the workers have sent their last control message but before they have dettached the DSM, leaving to crashes. Repair by introducing a new ParallelContext operation, ReinitializeParallelDSM. Adjust execParallel.c to use this new support, so that we can rescan a Gather node by relaunching workers but without needing to recreate the DSM. Amit Kapila, with some adjustments by me. Extracted from latest parallel sequential scan patch.	2015-10-30 10:44:54 +01:00
Robert Haas	8538a63070	Make Gather node projection-capable. The original Gather code failed to mark a Gather node as not able to do projection, but it couldn't, even though it did call initialize its projection info via ExecAssignProjectionInfo. There doesn't seem to be any good reason for this node not to have projection capability, so clean things up so that it does. Without this, plans using Gather nodes might need to carry extra Result nodes to do projection.	2015-10-28 00:27:58 +01:00
Alvaro Herrera	531d21b75f	Cleanup commit timestamp module activaction, again Further tweak commit_ts.c so that on a standby the state is completely consistent with what that in the master, rather than behaving differently in the cases that the settings differ. Now in standby and master the module should always be active or inactive in lockstep. Author: Petr Jelínek, with some further tweaks by Álvaro Herrera. Backpatch to 9.5, where commit timestamps were introduced. Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com	2015-10-27 15:06:50 -03:00
Tom Lane	d435542583	Fix incorrect translation of minus-infinity datetimes for json/jsonb. Commit `bda76c1c8c` caused both plus and minus infinity to be rendered as "infinity", which is not only wrong but inconsistent with the pre-9.4 behavior of to_json(). Fix that by duplicating the coding in date_out/timestamp_out/timestamptz_out more closely. Per bug #13687 from Stepan Perlov. Back-patch to 9.4, like the previous commit. In passing, also re-pgindent json.c, since it had gotten a bit messed up by recent patches (and I was already annoyed by indentation-related problems in back-patching this fix ...)	2015-10-20 11:07:04 -07:00
Robert Haas	a1c466c5dd	Fix incorrect comment in plannodes.h Etsuro Fujita	2015-10-20 11:11:35 -04:00
Robert Haas	ee7ca559fc	Add a C API for parallel heap scans. Using this API, one backend can set up a ParallelHeapScanDesc to which multiple backends can then attach. Each tuple in the relation will be returned to exactly one of the scanning backends. Only forward scans are supported, and rescans must be carefully coordinated. This is not exposed to the planner or executor yet. The original version of this code was written by me. Amit Kapila reviewed it, tested it, and improved it, including adding support for synchronized scans, per review comments from Jeff Davis. Extensive testing of this and related patches was performed by Haribabu Kommi. Final cleanup of this patch by me.	2015-10-16 17:33:18 -04:00
Robert Haas	b0b0d84b3d	Allow a parallel context to relaunch workers. This may allow some callers to avoid the overhead involved in tearing down a parallel context and then setting up a new one, which means releasing the DSM and then allocating and populating a new one. I suspect we'll want to revise the Gather node to make use of this new capability, but even if not it may be useful elsewhere and requires very little additional code.	2015-10-16 17:18:05 -04:00
Tom Lane	538b3b8b35	Improve memory-usage accounting in regular-expression compiler. This code previously counted the number of NFA states it created, and complained if a limit was exceeded, so as to prevent bizarre regex patterns from consuming unreasonable time or memory. That's fine as far as it went, but the code paid no attention to how many arcs linked those states. Since regexes can be contrived that have O(N) states but will need O(N^2) arcs after fixempties() processing, it was still possible to blow out memory, and take a long time doing it too. To fix, modify the bookkeeping to count space used by both states and arcs. I did not bother with including the "color map" in the accounting; it can only grow to a few megabytes, which is not a lot in comparison to what we're allowing for states+arcs (about 150MB on 64-bit machines or half that on 32-bit machines). Looking at some of the larger real-world regexes captured in the Tcl regression test suite suggests that the most that is likely to be needed for regexes found in the wild is under 10MB, so I believe that the current limit has enough headroom to make it okay to keep it as a hard-wired limit. In connection with this, redefine REG_ETOOBIG as meaning "regular expression is too complex"; the previous wording of "nfa has too many states" was already somewhat inapropos because of the error code's use for stack depth overrun, and it was not very user-friendly either. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Tom Lane	579840ca05	Fix O(N^2) performance problems in regular-expression compiler. Change the singly-linked in-arc and out-arc lists to be doubly-linked, so that arc deletion is constant time rather than having worst-case time proportional to the number of other arcs on the connected states. Modify the bulk arc transfer operations copyins(), copyouts(), moveins(), moveouts() so that they use a sort-and-merge algorithm whenever there's more than a small number of arcs to be copied or moved. The previous method is O(N^2) in the number of arcs involved, because it performs duplicate checking independently for each copied arc. The new method may change the ordering of existing arcs for the destination state, but nothing really cares about that. Provide another bulk arc copying method mergeins(), which is unused as of this commit but is needed for the next one. It basically is like copyins(), but the source arcs might not all come from the same state. Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort() call. These changes greatly improve the performance of regex compilation for large or complex regexes, at the cost of extra space for arc storage during compilation. The original tradeoff was probably fine when it was made, but now we care more about speed and less about memory consumption. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Robert Haas	78652a3332	Remove cautions about using volatile from spin.h. Commit `0709b7ee72` obsoleted this comment but neglected to update it. Thomas Munro	2015-10-16 14:06:22 -04:00
Robert Haas	bfc78d7196	Rewrite interaction of parallel mode with parallel executor support. In the previous coding, before returning from ExecutorRun, we'd shut down all parallel workers. This was dead wrong if ExecutorRun was called with a non-zero tuple count; it had the effect of truncating the query output. To fix, give ExecutePlan control over whether to enter parallel mode, and have it refuse to do so if the tuple count is non-zero. Rewrite the Gather logic so that it can cope with being called outside parallel mode. Commit `7aea8e4f2d` is largely to blame for this problem, though this patch modifies some subsequently-committed code which relied on the guarantees it purported to make.	2015-10-16 11:56:02 -04:00
Robert Haas	816e336f12	Mark more functions parallel-restricted or parallel-unsafe. Commit `7aea8e4f2d` was overoptimistic about the degree of safety associated with running various functions in parallel mode. Functions that take a table name or OID as an argument are at least parallel-restricted, because the table might be temporary, and we currently don't allow parallel workers to touch temporary tables. Functions that take a query as an argument are outright unsafe, because the query could be anything, including a parallel-unsafe query. Also, the queue of pending notifications is backend-private, so adding to it from a worker doesn't behave correctly. We could fix this by transferring the worker's queue of pending notifications to the master during worker cleanup, but that seems like more trouble than it's worth for now. In addition to adjusting the pg_proc.h markings, also add an explicit check for this in async.c.	2015-10-16 11:49:31 -04:00
Robert Haas	82b37765c7	Fix a problem with parallel workers being unable to restore role. check_role() tries to verify that the user has permission to become the requested role, but this is inappropriate in a parallel worker, which needs to exactly recreate the master's authorization settings. So skip the check in that case. This fixes a bug in commit `924bcf4f16`.	2015-10-16 11:37:19 -04:00
Robert Haas	2ad5c27bb5	Don't send protocol messages to a shm_mq that no longer exists. Commit `2bd9e412f9` introduced a mechanism for relaying protocol messages from a background worker to another backend via a shm_mq. However, there was no provision for shutting down the communication channel. Therefore, a protocol message sent late in the shutdown sequence, such as a DEBUG message resulting from cranking up log_min_messages, could crash the server. To fix, install an on_dsm_detach callback that disables sending messages to the shm_mq when the associated DSM is detached.	2015-10-16 09:42:33 -04:00
Robert Haas	5fc4c26db5	Allow FDWs to push down quals without breaking EvalPlanQual rechecks. This fixes a long-standing bug which was discovered while investigating the interaction between the new join pushdown code and the EvalPlanQual machinery: if a ForeignScan appears on the inner side of a paramaterized nestloop, an EPQ recheck would re-return the original tuple even if it no longer satisfied the pushed-down quals due to changed parameter values. This fix adds a new member to ForeignScan and ForeignScanState and a new argument to make_foreignscan, and requires changes to FDWs which push down quals to populate that new argument with a list of quals they have chosen to push down. Therefore, I'm only back-patching to 9.5, even though the bug is not new in 9.5. Etsuro Fujita, reviewed by me and by Kyotaro Horiguchi.	2015-10-15 13:00:40 -04:00
Tom Lane	869f693a36	On Windows, ensure shared memory handle gets closed if not being used. Postmaster child processes that aren't supposed to be attached to shared memory were not bothering to close the shared memory mapping handle they inherit from the postmaster process. That's mostly harmless, since the handle vanishes anyway when the child process exits -- but the syslogger process, if used, doesn't get killed and restarted during recovery from a backend crash. That meant that Windows doesn't see the shared memory mapping as becoming free, so it doesn't delete it and the postmaster is unable to create a new one, resulting in failure to recover from crashes whenever logging_collector is turned on. Per report from Dmitry Vasilyev. It's a bit astonishing that we'd not figured this out long ago, since it's been broken from the very beginnings of out native Windows support; probably some previously-unexplained trouble reports trace to this. A secondary problem is that on Cygwin (perhaps only in older versions?), exec() may not detach from the shared memory segment after all, in which case these child processes did remain attached to shared memory, posing the risk of an unexpected shared memory clobber if they went off the rails somehow. That may be a long-gone bug, but we can deal with it now if it's still live, by detaching within the infrastructure introduced here to deal with closing the handle. Back-patch to all supported branches. Tom Lane and Amit Kapila	2015-10-13 11:21:33 -04:00
Robert Haas	bfb54ff15a	Make abbreviated key comparisons for text a bit cheaper. If we do some byte-swapping while abbreviating, we can do comparisons using integer arithmetic rather than memcmp. Peter Geoghegan, reviewed and slightly revised by me.	2015-10-09 15:06:06 -04:00
Robert Haas	db0f6cad48	Remove set_latch_on_sigusr1 flag. This flag has proven to be a recipe for bugs, and it doesn't seem like it can really buy anything in terms of performance. So let's just always set the process latch when we receive SIGUSR1 instead of trying to do it only when needed. Per my recent proposal on pgsql-hackers.	2015-10-09 14:31:04 -04:00
Robert Haas	c171818b27	Add BSWAP64 macro. This is like BSWAP32, but for 64-bit values. Since we've got two of them now and they have use cases (like sortsupport) beyond CRCs, move the definitions to their own header file. Peter Geoghegan	2015-10-08 13:01:36 -04:00
Robert Haas	fd5eaad715	Correct pg_indent to pgindent in various comments. David Christensen	2015-10-08 12:27:54 -04:00
Bruce Momjian	b852dc4cbd	docs: clarify JSONB operator descriptions No catalog bump as the catalog changes are for SQL operator comments. Backpatch through 9.5	2015-10-07 09:06:49 -04:00
Tom Lane	7e2a18a916	Perform an immediate shutdown if the postmaster.pid file is removed. The postmaster now checks every minute or so (worst case, at most two minutes) that postmaster.pid is still there and still contains its own PID. If not, it performs an immediate shutdown, as though it had received SIGQUIT. The original goal behind this change was to ensure that failed buildfarm runs would get fully cleaned up, even if the test scripts had left a postmaster running, which is not an infrequent occurrence. When the buildfarm script removes a test postmaster's $PGDATA directory, its next check on postmaster.pid will fail and cause it to exit. Previously, manual intervention was often needed to get rid of such orphaned postmasters, since they'd block new test postmasters from obtaining the expected socket address. However, by checking postmaster.pid and not something else, we can provide additional robustness: manual removal of postmaster.pid is a frequent DBA mistake, and now we can at least limit the damage that will ensue if a new postmaster is started while the old one is still alive. Back-patch to all supported branches, since we won't get the desired improvement in buildfarm reliability otherwise.	2015-10-06 17:15:52 -04:00
Stephen Frost	4158cc3793	Do not write out WCOs in Query The WithCheckOptions list in Query are only populated during rewrite and do not need to be written out or read in as part of a Query structure. Further, move WithCheckOptions to the bottom and add comments to clarify that it is only populated during rewrite. Back-patch to 9.5 with a catversion bump, as we are still in alpha.	2015-10-05 07:38:58 -04:00
Stephen Frost	088c83363a	ALTER TABLE .. FORCE ROW LEVEL SECURITY To allow users to force RLS to always be applied, even for table owners, add ALTER TABLE .. FORCE ROW LEVEL SECURITY. row_security=off overrides FORCE ROW LEVEL SECURITY, to ensure pg_dump output is complete (by default). Also add SECURITY_NOFORCE_RLS context to avoid data corruption when ALTER TABLE .. FORCE ROW SECURITY is being used. The SECURITY_NOFORCE_RLS security context is used only during referential integrity checks and is only considered in check_enable_rls() after we have already checked that the current user is the owner of the relation (which should always be the case during referential integrity checks). Back-patch to 9.5 where RLS was added.	2015-10-04 21:05:08 -04:00
Tom Lane	a31e64d065	Fix some issues in new hashtable size calculations in nodeHash.c. Limit the size of the hashtable pointer array to not more than MaxAllocSize, per reports from Kouhei Kaigai and others of "invalid memory alloc request size" failures. There was discussion of allowing the array to get larger than that by using the "huge" palloc API, but so far no proof that that is actually a good idea, and at this point in the 9.5 cycle major changes from old behavior don't seem like the way to go. Fix a rather serious secondary bug in the new code, which was that it didn't ensure nbuckets remained a power of 2 when recomputing it for the multiple-batch case. Clean up sloppy division of labor between ExecHashIncreaseNumBuckets and its sole call site.	2015-10-04 14:06:50 -04:00
Peter Eisentraut	6390c8c654	Group cluster_name and update_process_title settings together	2015-10-04 12:29:36 -04:00
Noah Misch	3cb0a7e75a	Make BYPASSRLS behave like superuser RLS bypass. Specifically, make its effect independent from the row_security GUC, and make it affect permission checks pertinent to views the BYPASSRLS role owns. The row_security GUC thereby ceases to change successful-query behavior; it can only make a query fail with an error. Back-patch to 9.5, where BYPASSRLS was introduced.	2015-10-03 20:19:57 -04:00
Tom Lane	b63fc28776	Add recursion depth protections to regular expression matching. Some of the functions in regex compilation and execution recurse, and therefore could in principle be driven to stack overflow. The Tcl crew has seen this happen in practice in duptraverse(), though their fix was to put in a hard-wired limit on the number of recursive levels, which is not too appetizing --- fortunately, we have enough infrastructure to check the actually available stack. Greg Stark has also seen it in other places while fuzz testing on a machine with limited stack space. Let's put guards in to prevent crashes in all these places. Since the regex code would leak memory if we simply threw elog(ERROR), we have to introduce an API that checks for stack depth without throwing such an error. Fortunately that's not difficult.	2015-10-02 14:51:58 -04:00
Alvaro Herrera	f12e814b88	Fix commit_ts for standby Module initialization was still not completely correct after commit `6b61955135`, per crash report from Takashi Ohnishi. To fix, instead of trying to monkey around with the value of the GUC setting directly, add a separate boolean flag that enables the feature on a standby, but only for the startup (recovery) process, when it sees that its master server has the feature enabled. Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp Also change the deactivation routine to delete all segment files rather than leaving the last one around. (This doesn't need separate WAL-logging, because on recovery we execute the same deactivation routine anyway.) In passing, clean up the code structure somewhat, particularly so that xlog.c doesn't know so much about when to activate/deactivate the feature. Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion. Back-patch to 9.5, where commit_ts was introduced.	2015-10-01 15:06:55 -03:00
Robert Haas	3bd909b220	Add a Gather executor node. A Gather executor node runs any number of copies of a plan in an equal number of workers and merges all of the results into a single tuple stream. It can also run the plan itself, if the workers are unavailable or haven't started up yet. It is intended to work with the Partial Seq Scan node which will be added in future commits. It could also be used to implement parallel query of a different sort by itself, without help from Partial Seq Scan, if the single_copy mode is used. In that mode, a worker executes the plan, and the parallel leader does not, merely collecting the worker's results. So, a Gather node could be inserted into a plan to split the execution of that plan across two processes. Nested Gather nodes aren't currently supported, but we might want to add support for that in the future. There's nothing in the planner to actually generate Gather nodes yet, so it's not quite time to break out the champagne. But we're getting close. Amit Kapila. Some designs suggestions were provided by me, and I also reviewed the patch. Single-copy mode, documentation, and other minor changes also by me.	2015-09-30 19:23:36 -04:00
Alvaro Herrera	6b61955135	Code review for transaction commit timestamps There are three main changes here: 1. No longer cause a start failure in a standby if the feature is disabled in postgresql.conf but enabled in the master. This reverts one part of commit 4f3924d9cd43; what we keep is the ability of the standby to activate/deactivate the module (which includes creating and removing segments as appropriate) during replay of such actions in the master. 2. Replay WAL records affecting commitTS even if the feature is disabled. This means the standby will always have the same state as the master after replay. 3. Have COMMIT PREPARE record the transaction commit time as well. We were previously only applying it in the normal transaction commit path. Author: Petr Jelínek Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com Additionally, I cleaned up nearby code related to replication origins, which I found a bit hard to follow, and fixed a couple of typos. Backpatch to 9.5, where this code was introduced. Per bug reports from Fujii Masao and subsequent discussion.	2015-09-29 14:40:56 -03:00
Robert Haas	d1b7c1ffe7	Parallel executor support. This code provides infrastructure for a parallel leader to start up parallel workers to execute subtrees of the plan tree being executed in the master. User-supplied parameters from ParamListInfo are passed down, but PARAM_EXEC parameters are not. Various other constructs, such as initplans, subplans, and CTEs, are also not currently shared. Nevertheless, there's enough here to support a basic implementation of parallel query, and we can lift some of the current restrictions as needed. Amit Kapila and Robert Haas	2015-09-28 21:55:57 -04:00
Alvaro Herrera	17f5831c81	Fix "sesssion" typo It was introduced alongside replication origins, by commit `5aa2350426`, so backpatch to 9.5. Pointed out by Fujii Masao	2015-09-28 19:13:42 -03:00
Andres Freund	aa29c1ccd9	Remove legacy multixact truncation support. In 9.5 and master there is no need to support legacy truncation. This is just committed separately to make it easier to backpatch the WAL logged multixact truncation to 9.3 and 9.4 if we later decide to do so. I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to 0xD087 to avoid 9.5 reusing a value that has been in use on master while keeping the numbers increasing between major versions. Discussion: 20150621192409.GA4797@alap3.anarazel.de Backpatch: 9.5	2015-09-26 19:04:25 +02:00
Andres Freund	4f627f8973	Rework the way multixact truncations work. The fact that multixact truncations are not WAL logged has caused a fair share of problems. Amongst others it requires to do computations during recovery while the database is not in a consistent state, delaying truncations till checkpoints, and handling members being truncated, but offset not. We tried to put bandaids on lots of these issues over the last years, but it seems time to change course. Thus this patch introduces WAL logging for multixact truncations. This allows: 1) to perform the truncation directly during VACUUM, instead of delaying it to the checkpoint. 2) to avoid looking at the offsets SLRU for truncation during recovery, we can just use the master's values. 3) simplify a fair amount of logic to keep in memory limits straight, this has gotten much easier During the course of fixing this a bunch of additional bugs had to be fixed: 1) Data was not purged from memory the member's SLRU before deleting segments. This happened to be hard or impossible to hit due to the interlock between checkpoints and truncation. 2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but that doesn't work for offsets that haven't yet been flushed to disk. Add code to flush the SLRUs to fix. Not pretty, but it feels slightly safer to only make decisions based on actual on-disk state. 3) find_multixact_start() could be called concurrently with a truncation and thus fail. Via SetOffsetVacuumLimit() that could lead to a round of emergency vacuuming. The problem remains in pg_get_multixact_members(), but that's quite harmless. For now this is going to only get applied to 9.5+, leaving the issues in the older branches in place. It is quite possible that we need to backpatch at a later point though. For the case this gets backpatched we need to handle that an updated standby may be replaying WAL from a not-yet upgraded primary. We have to recognize that situation and use "old style" truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to before, this now happens in the startup process, when replaying a checkpoint record, instead of the checkpointer. Doing truncation in the restartpoint is incorrect, they can happen much later than the original checkpoint, thereby leading to wraparound. To avoid "multixact_redo: unknown op code 48" errors standbys would have to be upgraded before primaries. A later patch will bump the WAL page magic, and remove the legacy truncation codepaths. Legacy truncation support is just included to make a possible future backpatch easier. Discussion: 20150621192409.GA4797@alap3.anarazel.de Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro Backpatch: 9.5 for now	2015-09-26 19:04:25 +02:00
Tom Lane	39df0f150c	Allow planner to use expression-index stats for function calls in WHERE. Previously, a function call appearing at the top level of WHERE had a hard-wired selectivity estimate of 0.3333333, a kludge conveniently dated in the source code itself to July 1992. The expectation at the time was that somebody would soon implement estimator support functions analogous to those for operators; but no such code has appeared, nor does it seem likely to in the near future. We do have an alternative solution though, at least for immutable functions on single relations: creating an expression index on the function call will allow ANALYZE to gather stats about the function's selectivity. But the code in clause_selectivity() failed to make use of such data even if it exists. Refactor so that that will happen. I chose to make it try this technique for any clause type for which clause_selectivity() doesn't have a special case, not just functions. To avoid adding unnecessary overhead in the common case where we don't learn anything new, make selfuncs.c provide an API that hooks directly to examine_variable() and then var_eq_const(), rather than the previous coding which laboriously constructed an OpExpr only so that it could be expensively deconstructed again. I preserved the behavior that the default estimate for a function call is 0.3333333. (For any other expression node type, it's 0.5, as before.) I had originally thought to make the default be 0.5 across the board, but changing a default estimate that's survived for twenty-three years seems like something not to do without a lot more testing than I care to put into it right now. Per a complaint from Jehan-Guillaume de Rorthais. Back-patch into 9.5, but not further, at least for the moment.	2015-09-24 18:35:46 -04:00
Teodor Sigaev	dc943ad952	Allow autoanalyze to add pages deleted from pending list to FSM Commit `e956808328` introduces adding pages to FSM for ordinary insert, but autoanalyze was able just cleanup pending list without adding to FSM. Also fix double call of IndexFreeSpaceMapVacuum() during ginvacuumcleanup() Report from Fujii Masao Patch by me Review by Jeff Janes	2015-09-23 15:33:51 +03:00
Noah Misch	7f11724bd6	Remove the SECURITY_ROW_LEVEL_DISABLED security context bit. This commit's parent made superfluous the bit's sole usage. Referential integrity checks have long run as the subject table's owner, and that now implies RLS bypass. Safe use of the bit was tricky, requiring strict control over the SQL expressions evaluating therein. Back-patch to 9.5, where the bit was introduced. Based on a patch by Stephen Frost.	2015-09-20 20:47:17 -04:00
Noah Misch	537bd178c7	Remove the row_security=force GUC value. Every query of a single ENABLE ROW SECURITY table has two meanings, with the row_security GUC selecting between them. With row_security=force available, every function author would have been advised to either set the GUC locally or test both meanings. Non-compliance would have threatened reliability and, for SECURITY DEFINER functions, security. Authors already face an obligation to account for search_path, and we should not mimic that example. With this change, only BYPASSRLS roles need exercise the aforementioned care. Back-patch to 9.5, where the row_security GUC was introduced. Since this narrows the domain of pg_db_role_setting.setconfig and pg_proc.proconfig, one might bump catversion. A row_security=force setting in one of those columns will elicit a clear message, so don't.	2015-09-20 20:45:41 -04:00
Robert Haas	4a4e6893aa	Glue layer to connect the executor to the shm_mq mechanism. The shm_mq mechanism was built to send error (and notice) messages and tuples between backends. However, shm_mq itself only deals in raw bytes. Since commit `2bd9e412f9`, we have had infrastructure for one message to redirect protocol messages to a queue and for another backend to parse them and do useful things with them. This commit introduces a somewhat analogous facility for tuples by adding a new type of DestReceiver, DestTupleQueue, which writes each tuple generated by a query into a shm_mq, and a new TupleQueueFunnel facility which reads raw tuples out of the queue and reconstructs the HeapTuple format expected by the executor. The TupleQueueFunnel abstraction supports reading from multiple tuple streams at the same time, but only in round-robin fashion. Someone could imaginably want other policies, but this should be good enough to meet our short-term needs related to parallel query, and we can always extend it later. This also makes one minor addition to the shm_mq API that didn' seem worth breaking out as a separate patch. Extracted from Amit Kapila's parallel sequential scan patch. This code was originally written by me, and then it was revised by Amit, and then it was revised some more by me.	2015-09-18 21:56:58 -04:00
Tom Lane	d9c0c728af	Fix low-probability memory leak in regex execution. After an internal failure in shortest() or longest() while pinning down the exact location of a match, find() forgot to free the DFA structure before returning. This is pretty unlikely to occur, since we just successfully ran the "search" variant of the DFA; but it could happen, and it would result in a session-lifespan memory leak since this code uses malloc() directly. Problem seems to have been aboriginal in Spencer's library, so back-patch all the way. In passing, correct a thinko in a comment I added awhile back about the meaning of the "ntree" field. I happened across these issues while comparing our code to Tcl's version of the library.	2015-09-18 13:55:17 -04:00
Robert Haas	8dd401aa07	Add new function planstate_tree_walker. ExplainPreScanNode knows how to iterate over a generic tree of plan states; factor that logic out into a separate walker function so that other code, such as upcoming patches for parallel query, can also use it. Patch by me, reviewed by Tom Lane.	2015-09-17 11:27:06 -04:00
Teodor Sigaev	22f519c92a	Fix bug introduced by microvacuum for GiST Commit `013ebc0a7b` introduces microvacuum for GiST, deletetion of tuple marked LP_DEAD uses IndexPageMultiDelete while recovery code uses IndexPageTupleDelete in loop. This causes a difference in offset numbers of tuples to delete. Patch introduces usage of IndexPageMultiDelete in GiST except gistplacetopage() where only one tuple is deleted at once. That also slightly improve performance, because IndexPageMultiDelete is more effective. Patch changes WAL format, so bump wal page magic. Bug report from Jeff Janes Diagnostic and patch by Anastasia Lubennikova and me	2015-09-17 14:22:37 +03:00
Robert Haas	7aea8e4f2d	Determine whether it's safe to attempt a parallel plan for a query. Commit `924bcf4f16` introduced a framework for parallel computation in PostgreSQL that makes most but not all built-in functions safe to execute in parallel mode. In order to have parallel query, we'll need to be able to determine whether that query contains functions (either built-in or user-defined) that cannot be safely executed in parallel mode. This requires those functions to be labeled, so this patch introduces an infrastructure for that. Some functions currently labeled as safe may need to be revised depending on how pending issues related to heavyweight locking under paralllelism are resolved. Parallel plans can't be used except for the case where the query will run to completion. If portal execution were suspended, the parallel mode restrictions would need to remain in effect during that time, but that might make other queries fail. Therefore, this patch introduces a framework that enables consideration of parallel plans only when it is known that the plan will be run to completion. This probably needs some refinement; for example, at bind time, we do not know whether a query run via the extended protocol will be execution to completion or run with a limited fetch count. Having the client indicate its intentions at bind time would constitute a wire protocol break. Some contexts in which parallel mode would be safe are not adjusted by this patch; the default is not to try parallel plans except from call sites that have been updated to say that such plans are OK. This commit doesn't introduce any parallel paths or plans; it just provides a way to determine whether they could potentially be used. I'm committing it on the theory that the remaining parallel sequential scan patches will also get committed to this release, hopefully in the not-too-distant future. Robert Haas and Amit Kapila. Reviewed (in earlier versions) by Noah Misch.	2015-09-16 15:38:47 -04:00
Tom Lane	b44d92b67b	Sync regex code with Tcl 8.6.4. Sync our regex code with upstream changes since last time we did this, which was Tcl 8.5.11 (see commit `08fd6ff37f`). The only functional change here is to disbelieve that an octal escape is three digits long if it would exceed \377. That's a bug fix, but it's a minor one and could change the interpretation of working regexes, so don't back-patch. In addition to that, s/INFINITY/DUPINF/ to eliminate the risk of collisions with <math.h>'s macro, and s/LOCAL/NOPROP/ because that also seems like an unnecessarily collision-prone macro name. There were some other cosmetic changes in their copy that I did not adopt, notably a rather half-hearted attempt at renaming some of the C functions in a more verbose style. (I'm not necessarily against the concept, but renaming just a few functions in the package is not an improvement.)	2015-09-16 15:25:25 -04:00
Tom Lane	ad584a08c1	Remove no-longer-used T_PrivGrantee node tag. Oversight in commit `31eae6028e`, which replaced PrivGrantee nodes with RoleSpec nodes. Spotted by Yugo Nagata.	2015-09-16 10:48:11 -04:00
Stephen Frost	22eaf35c1d	RLS refactoring This refactors rewrite/rowsecurity.c to simplify the handling of the default deny case (reducing the number of places where we check for and add the default deny policy from three to one) by splitting up the retrival of the policies from the application of them. This also allowed us to do away with the policy_id field. A policy_name field was added for WithCheckOption policies and is used in error reporting, when available. Patch by Dean Rasheed, with various mostly cosmetic changes by me. Back-patch to 9.5 where RLS was introduced to avoid unnecessary differences, since we're still in alpha, per discussion with Robert.	2015-09-15 15:49:31 -04:00
Fujii Masao	05ec71eea2	Fix comment regarding the meaning of infinity for timeline history entry Michael Paquier	2015-09-15 23:38:01 +09:00
Robert Haas	a7212a9997	Install lwlocknames.h even in vpath builds. Per buildfarm member crake.	2015-09-11 16:45:41 -04:00
Robert Haas	2ccc4e972e	Fix build problems in commit `aa65de042f`. The previous way didn't work for vpath builds, and make distprep was busted too. Reported off-list by Andres Freund.	2015-09-11 14:56:17 -04:00
Robert Haas	aa65de042f	When trace_lwlocks is used, identify individual lwlocks by name. Naming the individual lwlocks seems like something that may be useful for other types of debugging, monitoring, or instrumentation output, but this commit just implements it for the specific case of trace_lwlocks. Patch by me, reviewed by Amit Kapila and Kyotaro Horiguchi	2015-09-11 14:01:39 -04:00
Teodor Sigaev	013ebc0a7b	Microvacuum for GIST Mark index tuple as dead if it's pointed by kill_prior_tuple during ordinary (search) scan and remove it during insert process if there is no enough space for new tuple to insert. This improves select performance because index will not return tuple marked as dead and improves insert performance because it reduces number of page split. Anastasia Lubennikova <a.lubennikova@postgrespro.ru> with minor editorialization by me	2015-09-09 18:43:37 +03:00
Fujii Masao	96f6a0cb41	Remove files signaling a standby promotion request at postmaster startup This commit makes postmaster forcibly remove the files signaling a standby promotion request. Otherwise, the existence of those files can trigger a promotion too early, whether a user wants that or not. This removal of files is usually unnecessary because they can exist only during a few moments during a standby promotion. However there is a race condition: if pg_ctl promote is executed and creates the files during a promotion, the files can stay around even after the server is brought up to new master. Then, if new standby starts by using the backup taken from that master, the files can exist at the server startup and should be removed in order to avoid an unexpected promotion. Back-patch to 9.1 where promote signal file was introduced. Problem reported by Feike Steenbergen. Original patch by Michael Paquier, modified by me. Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org	2015-09-09 22:51:44 +09:00
Alvaro Herrera	1aba62ec63	Allow per-tablespace effective_io_concurrency Per discussion, nowadays it is possible to have tablespaces that have wildly different I/O characteristics from others. Setting different effective_io_concurrency parameters for those has been measured to improve performance. Author: Julien Rouhaud Reviewed by: Andres Freund	2015-09-08 12:51:42 -03:00
Andres Freund	c314ead5be	Add ability to reserve WAL upon slot creation via replication protocol. Since `6fcd885` it is possible to immediately reserve WAL when creating a slot via pg_create_physical_replication_slot(). Extend the replication protocol to allow that as well. Although, in contrast to the SQL interface, it is possible to update the reserved location via the replication interface, it is still useful being able to reserve upon creation there. Otherwise the logic in ReplicationSlotReserveWal() has to be repeated in slot employing clients. Author: Michael Paquier Discussion: CAB7nPqT0Wc1W5mdYGeJ_wbutbwNN+3qgrFR64avXaQCiJMGaYA@mail.gmail.com	2015-09-06 13:30:57 +02:00
Heikki Linnakangas	c80b5f66c6	Fix misc typos. Oskari Saarenmaa. Backpatch to stable branches where applicable.	2015-09-05 11:35:49 +03:00
Tom Lane	c5454f99c4	Fix subtransaction cleanup after an outer-subtransaction portal fails. Formerly, we treated only portals created in the current subtransaction as having failed during subtransaction abort. However, if the error occurred while running a portal created in an outer subtransaction (ie, a cursor declared before the last savepoint), that has to be considered broken too. To allow reliable detection of which ones those are, add a bookkeeping field to struct Portal that tracks the innermost subtransaction in which each portal has actually been executed. (Without this, we'd end up failing portals containing functions that had called the subtransaction, thereby breaking plpgsql exception blocks completely.) In addition, when we fail an outer-subtransaction Portal, transfer its resources into the subtransaction's resource owner, so that they're released early in cleanup of the subxact. This fixes a problem reported by Jim Nasby in which a function executed in an outer-subtransaction cursor could cause an Assert failure or crash by referencing a relation created within the inner subtransaction. The proximate cause of the Assert failure is that AtEOSubXact_RelationCache assumed it could blow away a relcache entry without first checking that the entry had zero refcount. That was a bad idea on its own terms, so add such a check there, and to the similar coding in AtEOXact_RelationCache. This provides an independent safety measure in case there are still ways to provoke the situation despite the Portal-level changes. This has been broken since subtransactions were invented, so back-patch to all supported branches. Tom Lane and Michael Paquier	2015-09-04 13:37:14 -04:00
Robert Haas	4aec49899e	Assorted code review for recent ProcArrayLock patch. Post-commit review by Andres Freund discovered a couple of concurrency bugs in the original patch: specifically, if the leader cleared a follower's XID before it reached PGSemaphoreLock, the semaphore would be left in the wrong state; and if another process did PGSemaphoreUnlock for some unrelated reason, we might resume execution before the fact that our XID was cleared was globally visible. Also, improve the wording of some comments, rename nextClearXidElem to firstClearXidElem in PROC_HDR for clarity, and drop some volatile qualifiers that aren't necessary. Amit Kapila, reviewed and slightly revised by me.	2015-09-03 13:19:15 -04:00
Teodor Sigaev	30bb26b5e0	Allow usage of huge maintenance_work_mem for GIN build. Currently, in-memory posting list during GIN build process is limited 1GB because of using repalloc. The patch replaces call of repalloc to repalloc_huge. It increases limit of posting list from 180 millions (1GB / sizeof(ItemPointerData)) to 4 billions limited by maxcount/count fields in GinEntryAccumulator and subsequent calls. Check added. Also, fix accounting of allocatedMemory during build to prevent integer overflow with maintenance_work_mem > 4GB. Robert Abraham <robert.abraham86@googlemail.com> with additions by me	2015-09-02 20:08:58 +03:00
Tom Lane	123c9d2fc1	Clean up icc + ia64 situation. Some googling turned up multiple sources saying that older versions of icc do not accept gcc-compatible asm blocks on IA64, though asm does work on x86[_64]. This is apparently fixed as of icc version 12.0 or so, but that doesn't help us much; if we have to carry the extra implementation anyway, we may as well just use it for icc rather than add a compiler version test. Hence, revert commit `2c713d6ea2` (though I separated the icc code from the gcc code completely, producing what seems cleaner code). Document the state of affairs more explicitly, both in s_lock.h and postgres.c, and make some cosmetic adjustments around the IA64 code in s_lock.h.	2015-08-31 18:10:04 -04:00
Tom Lane	cf25b2a2f9	Allow icc to use the same atomics infrastructure as gcc. The atomics headers were written under the impression that icc doesn't handle gcc-style asm blocks, but this is demonstrably false on x86_[64], because s_lock.h has done it that way for more than a decade. (The jury is still out on whether this also works on ia64, so I'm leaving ia64-related code alone for the moment.) Treat gcc and icc the same in these headers. This is less code and it should improve the results for icc, because we hadn't gotten around to providing icc-specific implementations for most of the atomics.	2015-08-31 16:30:12 -04:00
Tom Lane	f333204bbc	Actually, it's not that hard to merge the Windows pqsignal code ... ... just need to typedef sigset_t and provide sigemptyset/sigfillset, which are easy enough.	2015-08-31 15:52:56 -04:00
Tom Lane	a65e086453	Remove support for Unix systems without the POSIX signal APIs. Remove configure's checks for HAVE_POSIX_SIGNALS, HAVE_SIGPROCMASK, and HAVE_SIGSETJMP. These APIs are required by the Single Unix Spec v2 (POSIX 1997), which we generally consider to define our minimum required set of Unix APIs. Moreover, no buildfarm member has reported not having them since 2012 or before, which means that even if the code is still live somewhere, it's untested --- and we've made plenty of signal-handling changes of late. So just take these APIs as given and save the cycles for configure probes for them. However, we can't remove as much C code as I'd hoped, because the Windows port evidently still uses the non-POSIX code paths for signal masking. Since we're largely emulating these BSD-style APIs for Windows anyway, it might be a good thing to switch over to POSIX-like notation and thereby remove a few more #ifdefs. But I'm not in a position to code or test that. In the meantime, we can at least make things a bit more transparent by testing for WIN32 explicitly in these places.	2015-08-31 12:56:10 -04:00
Tom Lane	0f19d0f12f	Remove long-dead support for platforms without sig_atomic_t. C89 requires <signal.h> to define sig_atomic_t, and there is no evidence in the buildfarm that any supported platforms don't comply. Remove the configure test to stop wasting build cycles on a purely historical issue. (Once upon a time, we cared about supporting C89-compliant compilers on machines with pre-C89 system headers, but that use-case has been dead for quite a few years.) I have some other fixes planned in this area, but let's start with this to see if the buildfarm produces any surprising results.	2015-08-31 01:36:46 -04:00
Tom Lane	c41a1215f0	Fix s_lock.h PPC assembly code to be compatible with native AIX assembler. On recent AIX it's necessary to configure gcc to use the native assembler (because the GNU assembler hasn't been updated to handle AIX 6+). This caused PG builds to fail with assembler syntax errors, because we'd try to compile s_lock.h's gcc asm fragment for PPC, and that assembly code relied on GNU-style local labels. We can't substitute normal labels because it would fail in any file containing more than one inlined use of tas(). Fortunately, that code is stable enough, and the PPC ISA is simple enough, that it doesn't seem like too much of a maintenance burden to just hand-code the branch offsets, removing the need for any labels. Note that the AIX assembler only accepts "$" for the location counter pseudo-symbol. The usual GNU convention is "."; but it appears that all versions of gas for PPC also accept "$", so in theory this patch will not break any other PPC platforms. This has been reported by a few people, but Steve Underwood gets the credit for being the first to pursue the problem far enough to understand why it was failing. Thanks also to Noah Misch for additional testing.	2015-08-29 16:09:25 -04:00
Tom Lane	7b5ef8f2d0	Limit the verbosity of memory context statistics dumps. We had a report from Stefan Kaltenbrunner of a case in which postmaster log files overran available disk space because multiple backends spewed enormous context stats dumps upon hitting an out-of-memory condition. Given the lack of similar reports, this isn't a common problem, but it still seems worth doing something about. However, we don't want to just blindly truncate the output, because that might prevent diagnosis of OOM problems. What seems like a workable compromise is to limit the dump to 100 child contexts per parent, and summarize the space used within any additional child contexts. That should help because practical cases where the dump gets long will typically be huge numbers of siblings under the same parent context; while the additional debugging value from seeing details about individual siblings beyond 100 will not be large, we hope. Anyway it doesn't take much code or memory space to do this, so let's try it like this and see how things go. Since the summarization mechanism requires passing totals back up anyway, I took the opportunity to add a "grand total" line to the end of the printout.	2015-08-25 13:09:48 -04:00
Tom Lane	44ed65a545	Avoid use of float arithmetic in bipartite_match.c. Since the distances used in this algorithm are small integers (not more than the size of the U set, in fact), there is no good reason to use float arithmetic for them. Use short ints instead: they're smaller, faster, and require no special portability assumptions. Per testing by Greg Stark, which disclosed that the code got into an infinite loop on VAX for lack of IEEE-style float infinities. We don't really care all that much whether Postgres can run on a VAX anymore, but there seems sufficient reason to change this code anyway. In passing, make a few other small adjustments to make the code match usual Postgres coding style a bit better.	2015-08-23 13:02:18 -04:00
Alvaro Herrera	8c3d63c521	Remove ExecGetScanType function This became unused in `a191a169d6`.	2015-08-21 14:11:58 -03:00
Stephen Frost	3c99788797	Rename 'cmd' to 'cmd_name' in CreatePolicyStmt To avoid confusion, rename CreatePolicyStmt's 'cmd' to 'cmd_name', parse_policy_command's 'cmd' to 'polcmd', and AlterPolicy's 'cmd_datum' to 'polcmd_datum', per discussion with Noah and as a follow-up to his correction of copynodes/equalnodes handling of the CreatePolicyStmt 'cmd' field. Back-patch to 9.5 where the CreatePolicyStmt was introduced, as we are still only in alpha.	2015-08-21 08:22:22 -04:00
Simon Riggs	47167b7907	Reduce lock levels for ALTER TABLE SET autovacuum storage options Reduce lock levels down to ShareUpdateExclusiveLock for all autovacuum-related relation options when setting them using ALTER TABLE. Add infrastructure to allow varying lock levels for relation options in later patches. Setting multiple options together uses the highest lock level required for any option. Works for both main and toast tables. Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional regression tests from myself	2015-08-14 14:19:28 +01:00
Heikki Linnakangas	36e863bbd4	Run autoheader to add a few missing #defines to pg_config.h.in. These are emitted by the new ax_pthread.m4 script version. They are not used for anything in PostgreSQL, but let's keep the generated header file up-to-date. Andres Freund	2015-08-13 14:37:46 +03:00
Alvaro Herrera	ccc4c07499	Close some holes in BRIN page assignment In some corner cases, it is possible for the BRIN index relation to be extended by brin_getinsertbuffer but the new page not be used immediately for anything by its callers; when this happens, the page is initialized and the FSM is updated (by brin_getinsertbuffer) with the info about that page, but these actions are not WAL-logged. A later index insert/update can use the page, but since the page is already initialized, the initialization itself is not WAL-logged then either. Replay of this sequence of events causes recovery to fail altogether. There is a related corner case within brin_getinsertbuffer itself, in which we extend the relation to put a new index tuple there, but later find out that we cannot do so, and do not return the buffer; the page obtained from extension is not even initialized. The resulting page is lost forever. To fix, shuffle the code so that initialization is not the responsibility of brin_getinsertbuffer anymore, in normal cases; instead, the initialization is done by its callers (brin_doinsert and brin_doupdate) once they're certain that the page is going to be used. When either those functions determine that the new page cannot be used, before bailing out they initialize the page as an empty regular page, enter it in FSM and WAL-log all this. This way, the page is usable for future index insertions, and WAL replay doesn't find trying to insert tuples in pages whose initialization didn't make it to the WAL. The same strategy is used in brin_getinsertbuffer when it cannot return the new page. Additionally, add a new step to vacuuming so that all pages of the index are scanned; whenever an uninitialized page is found, it is initialized as empty and WAL-logged. This closes the hole that the relation is extended but the system crashes before anything is WAL-logged about it. We also take this opportunity to update the FSM, in case it has gotten out of date. Thanks to Heikki Linnakangas for finding the problem that kicked some additional analysis of BRIN page assignment code. Backpatch to 9.5, where BRIN was introduced. Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org	2015-08-12 14:20:38 -03:00
Tom Lane	68fa28f771	Postpone extParam/allParam calculations until the very end of planning. Until now we computed these Param ID sets at the end of subquery_planner, but that approach depends on subquery_planner returning a concrete Plan tree. We would like to switch over to returning one or more Paths for a subquery, and in that representation the necessary details aren't fully fleshed out (not to mention that we don't really want to do this work for Paths that end up getting discarded). Hence, refactor so that we can compute the param ID sets at the end of planning, just before set_plan_references is run. The main change necessary to make this work is that we need to capture the set of outer-level Param IDs available to the current query level before exiting subquery_planner, since the outer levels' plan_params lists are transient. (That's not going to pose a problem for returning Paths, since all the work involved in producing that data is part of expression preprocessing, which will continue to happen before Paths are produced.) On the plus side, this change gets rid of several existing kluges. Eventually I'd like to get rid of SS_finalize_plan altogether in favor of doing this work during set_plan_references, but that will require some complex rejiggering because SS_finalize_plan needs to visit subplans and initplans before the main plan. So leave that idea for another day.	2015-08-11 23:48:37 -04:00
Alvaro Herrera	4901b2f495	Don't include rel.h when relcache.h is sufficient Trivial change to reduce exposure of rel.h.	2015-08-11 13:03:14 -03:00
Andres Freund	6fcd88511f	Allow pg_create_physical_replication_slot() to reserve WAL. When creating a physical slot it's often useful to immediately reserve the current WAL position instead of only doing after the first feedback message arrives. That e.g. allows slots to guarantee that all the WAL for a base backup will be available afterwards. Logical slots already have to reserve WAL during creation, so generalize that logic into being usable for both physical and logical slots. Catversion bump because of the new parameter. Author: Gurjeet Singh Reviewed-By: Andres Freund Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com	2015-08-11 12:34:31 +02:00
Andres Freund	093d0c83c1	Introduce macros determining if a replication slot is physical or logical. These make the code a bit easier to read, and make it easier to add a more explicit notion of a slot's type at some point in the future. Author: Gurjeet Singh Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com	2015-08-11 12:32:48 +02:00
Tom Lane	1f64ec6fd2	Accept alternate spellings of __sparcv7 and __sparcv8. Apparently some versions of gcc prefer __sparc_v7__ and __sparc_v8__. Per report from Waldemar Brodkorb.	2015-08-10 17:34:51 -04:00
Andres Freund	3f811c2d6f	Add confirmed_flush column to pg_replication_slots. There's no reason not to expose both restart_lsn and confirmed_flush since they have rather distinct meanings. The former is the oldest WAL still required and valid for both physical and logical slots, whereas the latter is the location up to which a logical slot's consumer has confirmed receiving data. Most of the time a slot will require older WAL (i.e. restart_lsn) than the confirmed position (i.e. confirmed_flush_lsn). Author: Marko Tiikkaja, editorialized by me Discussion: 559D110B.1020109@joh.to	2015-08-10 13:28:18 +02:00
Andres Freund	5a33650f24	Attempt to work around a 32bit xlc compiler bug from a different place. In `de6fd1c8` I moved the the work around from 53f73879 into the aix template. The previous location was removed in the former commit, and I thought that it would be nice to emit a warning when running configure. That didn't turn out to work because at the point the template is included we don't know whether we're compiling a 32/64 bit binary and it's possible to install compilers for both on a 64 bit kernel/OS. So go back to a less ambitious approach and define PG_FORCE_DISABLE_INLINE in port/aix.h, without emitting a warning. We could try a more fancy approach, but it doesn't seem worth it. This requires moving the check for PG_FORCE_DISABLE_INLINE in c.h to after including the system headers included from therein which isn't perfect, as it seems slightly more robust to include all system headers in a similar environment. Oh well. Discussion: 20150807132000.GC13310@awork2.anarazel.de	2015-08-08 01:19:02 +02:00
Andres Freund	4eda0a6470	Don't include low level locking code from frontend code. Some frontend code like e.g. pg_xlogdump or pg_resetxlog, has to use backend headers. Unfortunately until now that code includes most of the locking code. It's generally not nice to expose such low level details, but `de6fd1c898` made that a hard problem. We fall back to defining 'inline' away if the compiler doesn't support it - that can cause linker errors like on buildfarm animal pademelon if a inline function references backend only code. To fix that problem separate definitions from lock.h that are required from frontend code into lockdefs.h and use it in the relevant places. I've only removed the minimal amount of necessary definitions for now - it might turn out that we want more for other reasons. To avoid such details being exposed again put some checks against being included from frontend code into atomics.h, lock.h, lwlock.h and s_lock.h. It's otherwise fairly easy to indirectly include these headers. Discussion: 20150806070902.GE12214@awork2.anarazel.de	2015-08-07 15:10:56 +02:00
Tom Lane	cde35cf4ae	Fix eclass_useful_for_merging to give valid results for appendrel children. Formerly, this function would always return "true" for an appendrel child relation, because it would think that the appendrel parent was a potential join target for the child. In principle that should only lead to some inefficiency in planning, but fuzz testing by Andreas Seltenreich disclosed that it could lead to "could not find pathkey item to sort" planner errors in odd corner cases. Specifically, we would think that all columns of a child table's multicolumn index were interesting pathkeys, causing us to generate a MergeAppend path that sorts by all the columns. However, if any of those columns weren't actually used above the level of the appendrel, they would not get added to that rel's targetlist, which would result in being unable to resolve the MergeAppend's sort keys against its targetlist during createplan.c. Backpatch to 9.3. In older versions, columns of an appendrel get added to its targetlist even if they're not mentioned above the scan level, so that the failure doesn't occur. It might be worth back-patching this fix to older versions anyway, but I'll refrain for the moment.	2015-08-06 20:14:53 -04:00
Robert Haas	0e141c0fbb	Reduce ProcArrayLock contention by removing backends in batches. When a write transaction commits, it must clear its XID advertised via the ProcArray, which requires that we hold ProcArrayLock in exclusive mode in order to prevent concurrent processes running GetSnapshotData from seeing inconsistent results. When many processes try to commit at once, ProcArrayLock must change hands repeatedly, with each concurrent process trying to commit waking up to acquire the lock in turn. To make things more efficient, when more than one backend is trying to commit a write transaction at the same time, have just one of them acquire ProcArrayLock in exclusive mode and clear the XIDs of all processes in the group. Benchmarking reveals that this is much more efficient at very high client counts. Amit Kapila, heavily revised by me, with some review also from Pavan Deolasee.	2015-08-06 12:02:12 -04:00
Andres Freund	3a145757a0	Improve includes introduced in the replication origins patch. pg_resetxlog.h contained two superfluous includes, origin.h superfluously depended on logical.h, and pg_xlogdump's rmgrdesc.h only indirectly included origin.h. Backpatch: 9.5, where replication origins were introduced.	2015-08-06 12:41:46 +02:00
Noah Misch	b8fe12a836	Reconcile nodes/*funcs.c with recent work. A few of the discrepancies had semantic significance, but I did not track down the resulting user-visible bugs, if any. Back-patch to 9.5, where all but one discrepancy appeared. The _equalCreateEventTrigStmt() situation dates to 9.3 but does not affect semantics. catversion bump due to readfuncs.c field order changes.	2015-08-05 20:44:27 -04:00
Alvaro Herrera	2834855cb9	Fix BRIN to use SnapshotAny during summarization For correctness of summarization results, it is critical that the snapshot used during the summarization scan is able to see all tuples that are live to all transactions -- including tuples inserted or deleted by in-progress transactions. Otherwise, it would be possible for a transaction to insert a tuple, then idle for a long time while a concurrent transaction executes summarization of the range: this would result in the inserted value not being considered in the summary. Previously we were trying to use a MVCC snapshot in conjunction with adding a "placeholder" tuple in the index: the snapshot would see all committed tuples, and the placeholder tuple would catch insertions by any new inserters. The hole is that prior insertions by transactions that are still in progress by the time the MVCC snapshot was taken were ignored. Kevin Grittner reported this as a bogus error message during vacuum with default transaction isolation mode set to repeatable read (because the error report mentioned a function name not being invoked during), but the problem is larger than that. To fix, tweak IndexBuildHeapRangeScan to have a new mode that behaves the way we need using SnapshotAny visibility rules. This change simplifies the BRIN code a bit, mainly by removing large comments that were mistaken. Instead, rely on the SnapshotAny semantics to provide what it needs. (The business about a placeholder tuple needs to remain: that covers the case that a transaction inserts a a tuple in a page that summarization already scanned.) Discussion: https://www.postgresql.org/message-id/20150731175700.GX2441@postgresql.org In passing, remove a couple of unused declarations from brin.h and reword a comment to be proper English. This part submitted by Kevin Grittner. Backpatch to 9.5, where BRIN was introduced.	2015-08-05 16:20:50 -03:00
Andres Freund	de6fd1c898	Rely on inline functions even if that causes warnings in older compilers. So far we have worked around the fact that some very old compilers do not support 'inline' functions by only using inline functions conditionally (or not at all). Since such compilers are very rare by now, we have decided to rely on inline functions from 9.6 onwards. To avoid breaking these old compilers inline is defined away when not supported. That'll cause "function x defined but not used" type of warnings, but since nobody develops on such compilers anymore that's ok. This change in policy will allow us to more easily employ inline functions. I chose to remove code previously conditional on PG_USE_INLINE as it seemed confusing to have code dependent on a define that's always defined. Blacklisting of compilers, like in `c53f73879f`, now has to be done differently. A platform template can define PG_FORCE_DISABLE_INLINE to force inline to be defined empty. Discussion: 20150701161447.GB30708@awork2.anarazel.de	2015-08-05 18:19:52 +02:00
Andres Freund	073082bbb1	Fix comment atomics.h. I appear to accidentally have switched the comments for pg_atomic_write_u32 and pg_atomic_read_u32 around. Also fix some minor typos I found while fixing. Noticed-By: Amit Kapila Backpatch: 9.5	2015-08-05 13:06:04 +02:00
Tom Lane	8ea3e7a75c	Fix bogus "out of memory" reports in tuplestore.c. The tuplesort/tuplestore memory management logic assumed that the chunk allocation overhead for its memtuples array could not increase when increasing the array size. This is and always was true for tuplesort, but we (I, I think) blindly copied that logic into tuplestore.c without noticing that the assumption failed to hold for the much smaller array elements used by tuplestore. Given rather small work_mem, this could result in an improper complaint about "unexpected out-of-memory situation", as reported by Brent DeSpain in bug #13530. The easiest way to fix this is just to increase tuplestore's initial array size so that the assumption holds. Rather than relying on magic constants, though, let's export a #define from aset.c that represents the safe allocation threshold, and make tuplestore's calculation depend on that. Do the same in tuplesort.c to keep the logic looking parallel, even though tuplesort.c isn't actually at risk at present. This will keep us from breaking it if we ever muck with the allocation parameters in aset.c. Back-patch to all supported versions. The error message doesn't occur pre-9.3, not so much because the problem can't happen as because the pre-9.3 tuplestore code neglected to check for it. (The chance of trouble is a great deal larger as of 9.3, though, due to changes in the array-size-increasing strategy.) However, allowing LACKMEM() to become true unexpectedly could still result in less-than-desirable behavior, so let's patch it all the way back.	2015-08-04 18:18:46 -04:00
Heikki Linnakangas	804163bc25	Share transition state between different aggregates when possible. If there are two different aggregates in the query with same inputs, and the aggregates have the same initial condition and transition function, only calculate the state value once, and only call the final functions separately. For example, AVG(x) and SUM(x) aggregates have the same transition function, which accumulates the sum and number of input tuples. For a query like "SELECT AVG(x), SUM(x) FROM x", we can therefore accumulate the state function only once, which gives a nice speedup. David Rowley, reviewed and edited by me.	2015-08-04 17:53:10 +03:00
Tom Lane	d73d14c271	Fix incorrect order of lock file removal and failure to close() sockets. Commit `c9b0cbe98b` accidentally broke the order of operations during postmaster shutdown: it resulted in removing the per-socket lockfiles after, not before, postmaster.pid. This creates a race-condition hazard for a new postmaster that's started immediately after observing that postmaster.pid has disappeared; if it sees the socket lockfile still present, it will quite properly refuse to start. This error appears to be the explanation for at least some of the intermittent buildfarm failures we've seen in the pg_upgrade test. Another problem, which has been there all along, is that the postmaster has never bothered to close() its listen sockets, but has just allowed them to close at process death. This creates a different race condition for an incoming postmaster: it might be unable to bind to the desired listen address because the old postmaster is still incumbent. This might explain some odd failures we've seen in the past, too. (Note: this is not related to the fact that individual backends don't close their client communication sockets. That behavior is intentional and is not changed by this patch.) Fix by adding an on_proc_exit function that closes the postmaster's ports explicitly, and (in 9.3 and up) reshuffling the responsibility for where to unlink the Unix socket files. Lock file unlinking can stay where it is, but teach it to unlink the lock files in reverse order of creation.	2015-08-02 14:55:03 -04:00
Andres Freund	7039760114	Fix issues around the "variable" support in the lwlock infrastructure. The lwlock scalability work introduced two race conditions into the lwlock variable support provided for xlog.c. First, and harmlessly on most platforms, it set/read the variable without the spinlock in some places. Secondly, due to the removal of the spinlock, it was possible that a backend missed changes to the variable's state if it changed in the wrong moment because checking the lock's state, the variable's state and the queuing are not protected by a single spinlock acquisition anymore. To fix first move resetting the variable's from LWLockAcquireWithVar to WALInsertLockRelease, via a new function LWLockReleaseClearVar. That prevents issues around waiting for a variable's value to change when a new locker has acquired the lock, but not yet set the value. Secondly re-check that the variable hasn't changed after enqueing, that prevents the issue that the lock has been released and already re-acquired by the time the woken up backend checks for the lock's state. Reported-By: Jeff Janes Analyzed-By: Heikki Linnakangas Reviewed-By: Heikki Linnakangas Discussion: 5592DB35.2060401@iki.fi Backpatch: 9.5, where the lwlock scalability went in	2015-08-02 18:41:23 +02:00
Alvaro Herrera	e8e86fbc8b	Fix volatility marking of commit timestamp functions They are marked stable, but since they act on instantaneous state and it is possible to consult state of transactions as they commit, the results could change mid-query. They need to be marked volatile, and this commit does so. There would normally be a catversion bump here, but this is so much a niche feature and I don't believe there's real damage from the incorrect marking, that I refrained. Backpatch to 9.5, where commit timestamps where introduced. Per note from Fujii Masao.	2015-07-30 15:19:49 -03:00
Joe Conway	632cd9f892	Create new ParseExprKind for use by policy expressions. Policy USING and WITH CHECK expressions were using EXPR_KIND_WHERE for parse analysis, which results in inappropriate ERROR messages when the expression contains unsupported constructs such as aggregates. Create a new ParseExprKind called EXPR_KIND_POLICY and tailor the related messages to fit. Reported by Noah Misch. Reviewed by Dean Rasheed, Alvaro Herrera, and Robert Haas. Back-patch to 9.5 where RLS was introduced.	2015-07-29 15:40:24 -07:00
Joe Conway	d824e2800f	Disallow converting a table to a view if row security is present. When DefineQueryRewrite() is about to convert a table to a view, it checks the table for features unavailable to views. For example, it rejects tables having triggers. It omits to reject tables having relrowsecurity or a pg_policy record. Fix that. To faciliate the repair, invent relation_has_policies() which indicates the presence of policies on a relation even when row security is disabled for that relation. Reported by Noah Misch. Patch by me, review by Stephen Frost. Back-patch to 9.5 where RLS was introduced.	2015-07-28 16:24:01 -07:00
Joe Conway	f781a0f1d8	Create a pg_shdepend entry for each role in TO clause of policies. CreatePolicy() and AlterPolicy() omit to create a pg_shdepend entry for each role in the TO clause. Fix this by creating a new shared dependency type called SHARED_DEPENDENCY_POLICY and assigning it to each role. Reported by Noah Misch. Patch by me, reviewed by Alvaro Herrera. Back-patch to 9.5 where RLS was introduced.	2015-07-28 16:01:53 -07:00
Joe Conway	1e2bd43b31	Bump catversion so that HEAD is beyond 9.5 As pointed out by Tom, since HEAD has progressed beyond 9.5 in terms of its catalog, we need to be sure catversion of HEAD is advanced beyond that of 9.5. Corrects my mistake in the pg_stats view commit `cfa928ff`.	2015-07-28 13:59:23 -07:00
Joe Conway	7b4bfc87d5	Plug RLS related information leak in pg_stats view. The pg_stats view is supposed to be restricted to only show rows about tables the user can read. However, it sometimes can leak information which could not otherwise be seen when row level security is enabled. Fix that by not showing pg_stats rows to users that would be subject to RLS on the table the row is related to. This is done by creating/using the newly introduced SQL visible function, row_security_active(). Along the way, clean up three call sites of check_enable_rls(). The second argument of that function should only be specified as other than InvalidOid when we are checking as a different user than the current one, as in when querying through a view. These sites were passing GetUserId() instead of InvalidOid, which can cause the function to return incorrect results if the current user has the BYPASSRLS privilege and row_security has been set to OFF. Additionally fix a bug causing RI Trigger error messages to unintentionally leak information when RLS is enabled, and other minor cleanup and improvements. Also add WITH (security_barrier) to the definition of pg_stats. Bumped CATVERSION due to new SQL functions and pg_stats view definition. Back-patch to 9.5 where RLS was introduced. Reported by Yaroslav. Patch by Joe Conway and Dean Rasheed with review and input by Michael Paquier and Stephen Frost.	2015-07-28 13:21:22 -07:00
Andres Freund	426746b930	Remove ssl renegotiation support. While postgres' use of SSL renegotiation is a good idea in theory, it turned out to not work well in practice. The specification and openssl's implementation of it have lead to several security issues. Postgres' use of renegotiation also had its share of bugs. Additionally OpenSSL has a bunch of bugs around renegotiation, reported and open for years, that regularly lead to connections breaking with obscure error messages. We tried increasingly complex workarounds to get around these bugs, but we didn't find anything complete. Since these connection breakages often lead to hard to debug problems, e.g. spuriously failing base backups and significant latency spikes when synchronous replication is used, we have decided to change the default setting for ssl renegotiation to 0 (disabled) in the released backbranches and remove it entirely in 9.5 and master. Author: Andres Freund Discussion: 20150624144148.GQ4797@alap3.anarazel.de Backpatch: 9.5 and master, 9.0-9.4 get a different patch	2015-07-28 22:06:31 +02:00
Robert Haas	6f2871f12e	Centralize decision-making about where to get a backend's PGPROC. This code was originally written as part of parallel query effort, but it seems to have independent value, because if we make one decision about where to get a PGPROC when we allocate and then put it back on a different list at backend-exit time, bad things happen. This isn't just a theoretical risk; we fixed an actual problem of this type in commit `e280c630a8`.	2015-07-28 14:51:57 -04:00
Heikki Linnakangas	5533a272dd	Don't assume that 'char' is signed. On some platforms, notably ARM and PowerPC, 'char' is unsigned by default. This fixes an assertion failure at WAL replay on such platforms. Reported by Noah Misch. Backpatch to 9.5, where this was broken.	2015-07-27 21:51:25 +03:00
Heikki Linnakangas	023430abf7	Fix handling of all-zero pages in SP-GiST vacuum. SP-GiST initialized an all-zeros page at vacuum, but that was not WAL-logged, which is not safe. You might get a torn page write, when it gets flushed to disk, and end-up with a half-initialized index page. To fix, leave it in the all-zeros state, and add it to the FSM. It will be initialized when reused. Also don't set the page-deleted flag when recycling an empty page. That was also not WAL-logged, and a torn write of that would cause the page to have an invalid checksum. Backpatch to 9.2, where SP-GiST indexes were added.	2015-07-27 12:28:21 +03:00
Tom Lane	dd7a8f66ed	Redesign tablesample method API, and do extensive code review. The original implementation of TABLESAMPLE modeled the tablesample method API on index access methods, which wasn't a good choice because, without specialized DDL commands, there's no way to build an extension that can implement a TSM. (Raw inserts into system catalogs are not an acceptable thing to do, because we can't undo them during DROP EXTENSION, nor will pg_upgrade behave sanely.) Instead adopt an API more like procedural language handlers or foreign data wrappers, wherein the only SQL-level support object needed is a single handler function identified by having a special return type. This lets us get rid of the supporting catalog altogether, so that no custom DDL support is needed for the feature. Adjust the API so that it can support non-constant tablesample arguments (the original coding assumed we could evaluate the argument expressions at ExecInitSampleScan time, which is undesirable even if it weren't outright unsafe), and discourage sampling methods from looking at invisible tuples. Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable within and across queries, as required by the SQL standard, and deal more honestly with methods that can't support that requirement. Make a full code-review pass over the tablesample additions, and fix assorted bugs, omissions, infelicities, and cosmetic issues (such as failure to put the added code stanzas in a consistent ordering). Improve EXPLAIN's output of tablesample plans, too. Back-patch to 9.5 so that we don't have to support the original API in production.	2015-07-25 14:39:00 -04:00
Andres Freund	c1ca3a19df	Fix bug around assignment expressions containing indirections. Handling of assigned-to expressions with indirection (e.g. set f1[1] = 3) was broken for ON CONFLICT DO UPDATE. The problem was that ParseState was consulted to determine if an INSERT-appropriate or UPDATE-appropriate behavior should be used when transforming expressions with indirections. When the wrong path was taken the old row was substituted with NULL, leading to wrong results.. To fix remove p_is_update and only use p_is_insert to decide how to transform the assignment expression, and uset p_is_insert while parsing the on conflict statement. This isn't particularly pretty, but it's not any worse than before. Author: Peter Geoghegan, slightly edited by me Discussion: CAM3SWZS8RPvA=KFxADZWw3wAHnnbxMxDzkEC6fNaFc7zSm411w@mail.gmail.com Backpatch: 9.5, where the feature was introduced	2015-07-24 11:52:07 +02:00
Tom Lane	434873806a	Fix some oversights in BRIN patch. Remove HeapScanDescData.rs_initblock, which wasn't being used for anything in the final version of the patch. Fix IndexBuildHeapScan so that it supports syncscan again; the patch broke synchronous scanning for index builds by forcing rs_startblk to zero even when the caller did not care about that and had asked for syncscan. Add some commentary and usage defenses to heap_setscanlimits(). Fix heapam so that asking for rs_numblocks == 0 does what you would reasonably expect. As coded it amounted to requesting a whole-table scan, because those "--x <= 0" tests on an unsigned variable would behave surprisingly.	2015-07-21 13:38:24 -04:00
Alvaro Herrera	149b1dd840	Fix omission of OCLASS_TRANSFORM in object_classes[] This was forgotten in `cac7658205` (and its fixup `ad89a5d115`). Since it seems way too easy to miss this, this commit also introduces a mechanism to enforce that the array is consistent with the enum. Problem reported independently by Robert Haas and Jaimin Pan. Patches proposed by Jaimin Pan, Jim Nasby, Michael Paquier and myself, though I didn't use any of these and instead went with a cleaner approach suggested by Tom Lane. Backpatch to 9.5. Discussion: https://www.postgresql.org/message-id/CA+Tgmoa6SgDaxW_n_7SEhwBAc=mniYga+obUj5fmw4rU9_mLvA@mail.gmail.com https://www.postgresql.org/message-id/29788.1437411581@sss.pgh.pa.us	2015-07-21 13:20:53 +02:00
Heikki Linnakangas	13f2db2ffb	Handle AT_ReAddComment in test_ddl_deparse, and add a catch-all default. In the passing, also move AT_ReAddComment to more logical position in the enum, after all the Constraint-related subcommands. This fixes a compiler warning, added by commit `e42375fc`. Backpatch to 9.5, like that patch.	2015-07-20 10:25:26 +03:00
Andrew Dunstan	e02d44b8a7	Support JSON negative array subscripts everywhere Previously, there was an inconsistency across json/jsonb operators that operate on datums containing JSON arrays -- only some operators supported negative array count-from-the-end subscripting. Specifically, only a new-to-9.5 jsonb deletion operator had support (the new "jsonb - integer" operator). This inconsistency seemed likely to be counter-intuitive to users. To fix, allow all places where the user can supply an integer subscript to accept a negative subscript value, including path-orientated operators and functions, as well as other extraction operators. This will need to be called out as an incompatibility in the 9.5 release notes, since it's possible that users are relying on certain established extraction operators changed here yielding NULL in the event of a negative subscript. For the json type, this requires adding a way of cheaply getting the total JSON array element count ahead of time when parsing arrays with a negative subscript involved, necessitating an ad-hoc lex and parse. This is followed by a "conversion" from a negative subscript to its equivalent positive-wise value using the count. From there on, it's as if a positive-wise value was originally provided. Note that there is still a minor inconsistency here across jsonb deletion operators. Unlike the aforementioned new "-" deletion operator that accepts an integer on its right hand side, the new "#-" path orientated deletion variant does not throw an error when it appears like an array subscript (input that could be recognized by as an integer literal) is being used on an object, which is wrong-headed. The reason for not being stricter is that it could be the case that an object pair happens to have a key value that looks like an integer; in general, these two possibilities are impossible to differentiate with rhs path text[] argument elements. However, we still don't allow the "#-" path-orientated deletion operator to perform array-style subscripting. Rather, we just return the original left operand value in the event of a negative subscript (which seems analogous to how the established "jsonb/json #> text[]" path-orientated operator may yield NULL in the event of an invalid subscript). In passing, make SetArrayPath() stricter about not accepting cases where there is trailing non-numeric garbage bytes rather than a clean NUL byte. This means, for example, that strings like "10e10" are now not accepted as an array subscript of 10 by some new-to-9.5 path-orientated jsonb operators (e.g. the new #- operator). Finally, remove dead code for jsonb subscript deletion; arguably, this should have been done in commit `b81c7b409`. Peter Geoghegan and Andrew Dunstan	2015-07-17 21:13:47 -04:00
Robert Haas	a04bb65f70	Add new function pg_notification_queue_usage. This tells you what fraction of NOTIFY's queue is currently filled. Brendan Jurd, reviewed by Merlin Moncure and Gurjeet Singh. A few further tweaks by me.	2015-07-17 09:12:03 -04:00
Heikki Linnakangas	321eed5f0f	Add ALTER OPERATOR command, for changing selectivity estimator functions. Other options cannot be changed, as it's not totally clear if cached plans would need to be invalidated if one of the other options change. Selectivity estimator functions only change plan costs, not correctness of plans, so those should be safe. Original patch by Uriy Zhuravlev, heavily edited by me.	2015-07-14 18:17:55 +03:00
Heikki Linnakangas	e42375fc81	Retain comments on indexes and constraints at ALTER TABLE ... TYPE ... When a column's datatype is changed, ATExecAlterColumnType() rebuilds all the affected indexes and constraints, and the comments from the old indexes/constraints were not carried over. To fix, create a synthetic COMMENT ON command in the work queue, to re-add any comments on constraints. For indexes, there's a comment field in IndexStmt that is used. This fixes bug #13126, reported by Kirill Simonov. Original patch by Michael Paquier, reviewed by Petr Jelinek and me. This bug is present in all versions, but only backpatch to 9.5. Given how minor the issue is, it doesn't seem worth the work and risk to backpatch further than that.	2015-07-14 11:40:22 +03:00
Fujii Masao	6ba365aa46	Fix obsolete comment regarding NOTICE message level. By default NOTICE message is not sent to server log because the default value of log_min_messages is WARNING since 8.4. Pavel Stehule	2015-07-09 22:52:36 +09:00
Noah Misch	1e700e0fa0	Given a gcc-compatible xlc compiler, prefer xlc-style atomics. This evades a ppc64le "IBM XL C/C++ for Linux" compiler bug. Back-patch to 9.5, where the atomics facility was introduced.	2015-07-08 20:44:21 -04:00
Noah Misch	0d32d2e693	Finish generic-xlc.h draft atomics implementation. Back-patch to 9.5, where commit `b64d92f1a5` introduced this file.	2015-07-08 20:44:21 -04:00
Noah Misch	be8b06c364	Revoke support for strxfrm() that write past the specified array length. This formalizes a decision implicit in commit `4ea51cdfe8` and adds clean detection of affected systems. Vendor updates are available for each such known bug. Back-patch to 9.5, where the aforementioned commit first appeared.	2015-07-08 20:44:21 -04:00
Tom Lane	10fb48d66d	Add an optional missing_ok argument to SQL function current_setting(). This allows convenient checking for existence of a GUC from SQL, which is particularly useful when dealing with custom variables. David Christensen, reviewed by Jeevan Chalke	2015-07-02 16:41:07 -04:00
Heikki Linnakangas	7261172430	Remove obsolete heap_formtuple/modifytuple/deformtuple functions. These variants used the old-style 'n'/' ' NULL indicators. The new-style functions have been available since version 8.1. That should be long enough that if there is still any old external code using these functions, they can just switch to the new functions without worrying about backwards compatibility Peter Geoghegan	2015-07-02 21:21:23 +03:00
Heikki Linnakangas	7931622d1d	Fix name of argument to pg_stat_file. It's called "missing_ok" in the docs and in the C code. I refrained from doing a catversion bump for this, because the name of an input argument is just documentation, it has no effect on any callers. Michael Paquier	2015-07-02 12:15:13 +03:00
Fujii Masao	fb174687f7	Make use of xlog_internal.h's macros in WAL-related utilities. Commit `179cdd09` added macros to check if a filename is a WAL segment or other such file. However there were still some instances of the strlen + strspn combination to check for that in WAL-related utilities like pg_archivecleanup. Those checks can be replaced with the macros. This patch makes use of the macros in those utilities and which would make the code a bit easier to read. Back-patch to 9.5. Michael Paquier	2015-07-02 10:35:38 +09:00
Tom Lane	cf8d65de10	Stamp HEAD as 9.6devel. Let the hacking begin ...	2015-06-30 14:01:15 -04:00
Heikki Linnakangas	302ac7f271	Add assertion to check the special size is sane before dereferencing it. This seems useful to catch errors of the sort I just fixed, where PageGetSpecialPointer is called before initializing the page.	2015-06-30 13:44:04 +03:00
Tom Lane	f78329d594	Stamp 9.5alpha1.	2015-06-29 15:42:18 -04:00
Tom Lane	cbc8d65639	Code + docs review for escaping of option values (commit `11a020eb6`). Avoid memory leak from incorrect choice of how to free a StringInfo (resetStringInfo doesn't do it). Now that pg_split_opts doesn't scribble on the optstr, mark that as "const" for clarity. Attach the commentary in protocol.sgml to the right place, and add documentation about the user-visible effects of this change on postgres' -o option and libpq's PGOPTIONS option.	2015-06-29 12:42:52 -04:00
Andres Freund	07cb8b02ab	Replace ia64 S_UNLOCK compiler barrier with a full memory barrier. _Asm_sched_fence() is just a compiler barrier, not a memory barrier. But spinlock release on IA64 needs, at the very least, release semantics. Use a full barrier instead. This might be the cause for the occasional failures on buildfarm member anole. Discussion: 20150629101108.GB17640@alap3.anarazel.de	2015-06-29 14:53:32 +02:00
Tom Lane	62d16c7fc5	Improve design and implementation of pg_file_settings view. As first committed, this view reported on the file contents as they were at the last SIGHUP event. That's not as useful as reporting on the current contents, and what's more, it didn't work right on Windows unless the current session had serviced at least one SIGHUP. Therefore, arrange to re-read the files when pg_show_all_settings() is called. This requires only minor refactoring so that we can pass changeVal = false to set_config_option() so that it won't actually apply any changes locally. In addition, add error reporting so that errors that would prevent the configuration files from being loaded, or would prevent individual settings from being applied, are visible directly in the view. This makes the view usable for pre-testing whether edits made in the config files will have the desired effect, before one actually issues a SIGHUP. I also added an "applied" column so that it's easy to identify entries that are superseded by later entries; this was the main use-case for the original design, but it seemed unnecessarily hard to use for that. Also fix a 9.4.1 regression that allowed multiple entries for a PGC_POSTMASTER variable to cause bogus complaints in the postmaster log. (The issue here was that commit `bf007a27ac` unintentionally reverted `3e3f65973a`, which suppressed any duplicate entries within ParseConfigFp. However, since the original coding of the pg_file_settings view depended on such suppression not happening, we couldn't have fixed this issue now without first doing something with pg_file_settings. Now we suppress duplicates by marking them "ignored" within ProcessConfigFileInternal, which doesn't hide them in the view.) Lesser changes include: Drive the view directly off the ConfigVariable list, instead of making a basically-equivalent second copy of the data. There's no longer any need to hang onto the data permanently, anyway. Convert show_all_file_settings() to do its work in one call and return a tuplestore; this avoids risks associated with assuming that the GUC state will hold still over the course of query execution. (I think there were probably latent bugs here, though you might need something like a cursor on the view to expose them.) Arrange to run SIGHUP processing in a short-lived memory context, to forestall process-lifespan memory leaks. (There is one known leak in this code, in ProcessConfigDirectory; it seems minor enough to not be worth back-patching a specific fix for.) Remove mistaken assignment to ConfigFileLineno that caused line counting after an include_dir directive to be completely wrong. Add missed failure check in AlterSystemSetConfigFile(). We don't really expect ParseConfigFp() to fail, but that's not an excuse for not checking.	2015-06-28 18:06:14 -04:00
Heikki Linnakangas	cb2acb1081	Add missing_ok option to the SQL functions for reading files. This makes it possible to use the functions without getting errors, if there is a chance that the file might be removed or renamed concurrently. pg_rewind needs to do just that, although this could be useful for other purposes too. (The changes to pg_rewind to use these functions will come in a separate commit.) The read_binary_file() function isn't very well-suited for extensions.c's purposes anymore, if it ever was. So bite the bullet and make a copy of it in extension.c, tailored for that use case. This seems better than the accidental code reuse, even if it's a some more lines of code. Michael Paquier, with plenty of kibitzing by me.	2015-06-28 21:35:46 +03:00
Kevin Grittner	604e99396d	Add opaque declaration of HTAB to tqual.h. Commit `b89e151054` added the ResolveCminCmaxDuringDecoding declaration to tqual.h, which uses an HTAB parameter, without declaring HTAB. It accidentally fails to fail to build with current sources because a declaration happens to be included, directly or indirectly, in all source files that currently use tqual.h before tqual.h is first included, but we shouldn't count on that. Since an opaque declaration is enough here, just use that, as was done in snapmgr.h. Backpatch to 9.4, where the HTAB reference was added to tqual.h.	2015-06-27 09:55:06 -05:00
Alvaro Herrera	7d60b2af34	Fix DDL command collection for TRANSFORM Commit `b488c580ae`, which added the DDL command collection feature, neglected to update the code that commit `cac7658205` had previously added two weeks earlier for the TRANSFORM feature. Reported by Michael Paquier.	2015-06-26 18:17:54 -03:00
Robert Haas	8f15f74a44	Be more conservative about removing tablespace "symlinks". Don't apply rmtree(), which will gleefully remove an entire subtree, and don't even apply unlink() unless it's symlink or a directory, the only things that we expect to find. Amit Kapila, with minor tweaks by me, per extensive discussions involving Andrew Dunstan, Fujii Masao, and Heikki Linnakangas, at least some of whom also reviewed the code.	2015-06-26 15:53:13 -04:00
Robert Haas	5ca611841b	Improve handling of CustomPath/CustomPlan(State) children. Allow CustomPath to have a list of paths, CustomPlan a list of plans, and CustomPlanState a list of planstates known to the core system, so that custom path/plan providers can more reasonably use this infrastructure for nodes with multiple children. KaiGai Kohei, per a design suggestion from Tom Lane, with some further kibitzing by me.	2015-06-26 09:40:47 -04:00
Tom Lane	5d1ff6bd55	Fix the logic for putting relations into the relcache init file. Commit `f3b5565dd4` was a couple of bricks shy of a load; specifically, it missed putting pg_trigger_tgrelid_tgname_index into the relcache init file, because that index is not used by any syscache. However, we have historically nailed that index into cache for performance reasons. The upshot was that load_relcache_init_file always decided that the init file was busted and silently ignored it, resulting in a significant hit to backend startup speed. To fix, reinstantiate RelationIdIsInInitFile() as a wrapper around RelationSupportsSysCache(), which can know about additional relations that should be in the init file despite being unknown to syscache.c. Also install some guards against future mistakes of this type: make write_relcache_init_file Assert that all nailed relations get written to the init file, and make load_relcache_init_file emit a WARNING if it takes the "wrong number of nailed relations" exit path. Now that we remove the init files during postmaster startup, that case should never occur in the field, even if we are starting a minor-version update that added or removed rels from the nailed set. So the warning shouldn't ever be seen by end users, but it will show up in the regression tests if somebody breaks this logic. Back-patch to all supported branches, like the previous commit.	2015-06-25 14:39:05 -04:00
Andrew Dunstan	41d798a139	Fix comment in fmgr.h to refer to actual function used. FunctionLookup() is long gone if it ever existed, and fmgr_info() is what's now used, so the comments now reflect that.	2015-06-15 23:21:03 -04:00
Fujii Masao	b5fe62038f	Make postmaster restart archiver soon after it dies, even during recovery. After the archiver dies, postmaster tries to start a new one immediately. But previously this could happen only while server was running normally even though archiving was enabled always (i.e., archive_mode was set to always). So the archiver running during recovery could not restart soon after it died. This is an oversight in commit `ffd3774`. This commit changes reaper(), postmaster's signal handler to cleanup after a child process dies, so that it tries to a new archiver even during recovery if necessary. Patch by me. Review by Alvaro Herrera.	2015-06-12 23:11:51 +09:00
Andrew Dunstan	908e234733	Rename jsonb - text[] operator to #- to avoid ambiguity. Following recent discussion on -hackers. The underlying function is also renamed to jsonb_delete_path. The regression tests now don't need ugly type casts to avoid the ambiguity, so they are also removed. Catalog version bumped.	2015-06-11 10:06:58 -04:00
Kevin Grittner	870681017a	Fix typo in comment. Backpatch to 9.4 to minimize possible conflicts.	2015-06-10 17:03:56 -05:00
Fujii Masao	ea9c4c1e4a	Fix typo in comment. David Rowley	2015-06-10 15:26:02 +09:00
Tom Lane	f3b5565dd4	Use a safer method for determining whether relcache init file is stale. When we invalidate the relcache entry for a system catalog or index, we must also delete the relcache "init file" if the init file contains a copy of that rel's entry. The old way of doing this relied on a specially maintained list of the OIDs of relations present in the init file: we made the list either when reading the file in, or when writing the file out. The problem is that when writing the file out, we included only rels present in our local relcache, which might have already suffered some deletions due to relcache inval events. In such cases we correctly decided not to overwrite the real init file with incomplete data --- but we still used the incomplete initFileRelationIds list for the rest of the current session. This could result in wrong decisions about whether the session's own actions require deletion of the init file, potentially allowing an init file created by some other concurrent session to be left around even though it's been made stale. Since we don't support changing the schema of a system catalog at runtime, the only likely scenario in which this would cause a problem in the field involves a "vacuum full" on a catalog concurrently with other activity, and even then it's far from easy to provoke. Remarkably, this has been broken since 2002 (in commit `7863404417`), but we had never seen a reproducible test case until recently. If it did happen in the field, the symptoms would probably involve unexpected "cache lookup failed" errors to begin with, then "could not open file" failures after the next checkpoint, as all accesses to the affected catalog stopped working. Recovery would require manually removing the stale "pg_internal.init" file. To fix, get rid of the initFileRelationIds list, and instead consult syscache.c's list of relations used in catalog caches to decide whether a relation is included in the init file. This should be a tad more efficient anyway, since we're replacing linear search of a list with ~100 entries with a binary search. It's a bit ugly that the init file contents are now so directly tied to the catalog caches, but in practice that won't make much difference. Back-patch to all supported branches.	2015-06-07 15:32:09 -04:00
Tom Lane	3f59be836c	Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans. When the inner side of a nestloop SEMI or ANTI join is an indexscan that uses all the join clauses as indexquals, it can be presumed that both matched and unmatched outer rows will be processed very quickly: for matched rows, we'll stop after fetching one row from the indexscan, while for unmatched rows we'll have an indexscan that finds no matching index entries, which should also be quick. The planner already knew about this, but it was nonetheless charging for at least one full run of the inner indexscan, as a consequence of concerns about the behavior of materialized inner scans --- but those concerns don't apply in the fast case. If the inner side has low cardinality (many matching rows) this could make an indexscan plan look far more expensive than it actually is. To fix, rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we don't add the inner scan cost until we've inspected the indexquals, and then we can add either the full-run cost or just the first tuple's cost as appropriate. Experimentation with this fix uncovered another problem: add_path and friends were coded to disregard cheap startup cost when considering parameterized paths. That's usually okay (and desirable, because it thins the path herd faster); but in this fast case for SEMI/ANTI joins, it could result in throwing away the desired plain indexscan path in favor of a bitmap scan path before we ever get to the join costing logic. In the many-matching-rows cases of interest here, a bitmap scan will do a lot more work than required, so this is a problem. To fix, add a per-relation flag consider_param_startup that works like the existing consider_startup flag, but applies to parameterized paths, and set it for relations that are the inside of a SEMI or ANTI join. To make this patch reasonably safe to back-patch, care has been taken to avoid changing the planner's behavior except in the very narrow case of SEMI/ANTI joins with inner indexscans. There are places in compare_path_costs_fuzzily and add_path_precheck that are not terribly consistent with the new approach, but changing them will affect planner decisions at the margins in other cases, so we'll leave that for a HEAD-only fix. Back-patch to 9.3; before that, the consider_startup flag didn't exist, meaning that the second aspect of the patch would be too invasive. Per a complaint from Peter Holzer and analysis by Tomas Vondra.	2015-06-03 11:59:10 -04:00
Andrew Dunstan	37def42245	Rename jsonb_replace to jsonb_set and allow it to add new values The function is given a fourth parameter, which defaults to true. When this parameter is true, if the last element of the path is missing in the original json, jsonb_set creates it in the result and assigns it the new value. If it is false then the function does nothing unless all elements of the path are present, including the last. Based on some original code from Dmitry Dolgov, heavily modified by me. Catalog version bumped.	2015-05-31 20:34:10 -04:00
Tom Lane	1c8c656b3c	Check that all aliases of a built-in function have same leakproof property. opr_sanity.sql has a test checking that relevant properties of built-in functions match when the same C function is referenced by multiple pg_proc entries. The test neglected to check proleakproof, though, and when I added that condition it exposed that xideqint4 hadn't been updated to match xideq. So fix that as well, and in consequence bump catversion. This isn't very critical, so no need to worry about fixing back branches.	2015-05-29 13:26:21 -04:00
Tom Lane	da33a3894e	Revert exporting of internal GUC variable "data_directory". This undoes a poorly-thought-out choice in commit `970a18687f`, namely to export guc.c's internal variable data_directory. The authoritative variable so far as C code is concerned is DataDir; there is no reason for anything except specific bits of GUC code to look at the GUC variable. After yesterday's commits fixing the fsync-on-restart patch, the only remaining misuse of data_directory was in AlterSystemSetConfigFile(), which would be much better off just using a relative path anyhow: it's less code and it doesn't break if the DBA moves the data directory of a running system, which is a case we've taken some pains over in the past. This is mostly cosmetic, so no need for a back-patch (and I'd be hesitant to remove a global variable in stable branches anyway).	2015-05-29 11:57:33 -04:00
Tom Lane	d8179b001a	Fix fsync-at-startup code to not treat errors as fatal. Commit `2ce439f337` introduced a rather serious regression, namely that if its scan of the data directory came across any un-fsync-able files, it would fail and thereby prevent database startup. Worse yet, symlinks to such files also caused the problem, which meant that crash restart was guaranteed to fail on certain common installations such as older Debian. After discussion, we agreed that (1) failure to start is worse than any consequence of not fsync'ing is likely to be, therefore treat all errors in this code as nonfatal; (2) we should not chase symlinks other than those that are expected to exist, namely pg_xlog/ and tablespace links under pg_tblspc/. The latter restriction avoids possibly fsync'ing a much larger part of the filesystem than intended, if the user has left random symlinks hanging about in the data directory. This commit takes care of that and also does some code beautification, mainly moving the relevant code into fd.c, which seems a much better place for it than xlog.c, and making sure that the conditional compilation for the pre_sync_fname pass has something to do with whether pg_flush_data works. I also relocated the call site in xlog.c down a few lines; it seems a bit silly to be doing this before ValidateXLOGDirectoryStructure(). The similar logic in initdb.c ought to be made to match this, but that change is noncritical and will be dealt with separately. Back-patch to all active branches, like the prior commit. Abhijit Menon-Sen and Tom Lane	2015-05-28 17:33:03 -04:00
Bruce Momjian	befa3e648c	Revert 9.5 pgindent changes to atomics directory files This is because there are many __asm__ blocks there that pgindent messes up. Also configure pgindent to skip that directory in the future.	2015-05-24 21:45:01 -04:00
Tom Lane	23116d5437	Add a bit more commentary about regex's colormap tree data structure. Per an off-list question from Piotr Stefaniak.	2015-05-24 12:40:38 -04:00
Tom Lane	91e79260f6	Remove no-longer-required function declarations. Remove a bunch of "extern Datum foo(PG_FUNCTION_ARGS);" declarations that are no longer needed now that PG_FUNCTION_INFO_V1(foo) provides that. Some of these were evidently missed in commit `e7128e8dbb`, but others were cargo-culted in in code added since then. Possibly that can be blamed in part on the fact that we'd not fixed relevant documentation examples, which I've now done.	2015-05-24 12:20:23 -04:00
Bruce Momjian	807b9e0dff	pgindent run for 9.5	2015-05-23 21:35:49 -04:00
Tom Lane	821b821a24	Still more fixes for lossy-GiST-distance-functions patch. Fix confusion in documentation, substantial memory leakage if float8 or float4 are pass-by-reference, and assorted comments that were obsoleted by commit `98edd617f3`.	2015-05-23 15:22:25 -04:00
Andres Freund	631d749007	Remove the new UPSERT command tag and use INSERT instead. Previously, INSERT with ON CONFLICT DO UPDATE specified used a new command tag -- UPSERT. It was introduced out of concern that INSERT as a command tag would be a misrepresentation for ON CONFLICT DO UPDATE, as some affected rows may actually have been updated. Alvaro Herrera noticed that the implementation of that new command tag was incomplete; in subsequent discussion we concluded that having it doesn't provide benefits that are in line with the compatibility breaks it requires. Catversion bump due to the removal of PlannedStmt->isUpsert. Author: Peter Geoghegan Discussion: 20150520215816.GI5885@postgresql.org	2015-05-23 00:58:45 +02:00
Andrew Dunstan	5302760a50	Unpack jbvBinary objects passed to pushJsonbValue pushJsonbValue was accepting jbvBinary objects passed as WJB_ELEM or WJB_VALUE data. While this succeeded, when those objects were later encountered in attempting to convert the result to Jsonb, errors occurred. With this change we ghuarantee that a JSonbValue constructed from calls to pushJsonbValue does not contain any jbvBinary objects. This cures a problem observed with jsonb_delete. This means callers of pushJsonbValue no longer need to perform this unpacking themselves. A subsequent patch will perform some cleanup in that area. The error was not triggered by any 9.4 code, but this is a publicly visible routine, and so the error could be exercised by third party code, therefore backpatch to 9.4. Bug report from Peter Geoghegan, fix by me.	2015-05-22 10:21:41 -04:00
Heikki Linnakangas	7cbee7c0a1	At promotion, don't leave behind a partial segment on the old timeline. With commit `de768844`, a copy of the partial segment was archived with the .partial suffix, but the original file was still left in pg_xlog, so it didn't actually solve the problems with archiving the partial segment that it was supposed to solve. With this patch, the partial segment is renamed rather than copied, so we only archive it with the .partial suffix. Also be more robust in detecting if the last segment is already being archived. Previously I used XLogArchiveIsBusy() for that, but that's not quite right. With archive_mode='always', there might be a .ready file for it, and we don't want to rename it to .partial in that case. The old segment is needed until we're fully committed to the new timeline, i.e. until we've written the end-of-recovery WAL record and updated the min recovery point and timeline in the control file. So move the renaming later in the startup sequence, after all that's been done.	2015-05-22 11:04:33 +03:00
Tom Lane	c5dd8ead40	More fixes for lossy-GiST-distance-functions patch. Paul Ramsey reported that commit `35fcb1b3d0` induced a core dump on commuted ORDER BY expressions, because it was assuming that the indexorderby expression could be found verbatim in the relevant equivalence class, but it wasn't there. We really don't need anything that complicated anyway; for the data types likely to be used for index ORDER BY operators in the foreseeable future, the exprType() of the ORDER BY expression will serve fine. (The case where we'd have to work harder is where the ORDER BY expression's result is only binary-compatible with the declared input type of the ordering operator; long before worrying about that, one would need to get rid of GiST's hard-wired assumption that said datatype is float8.) Aside from fixing that crash and adding a regression test for the case, I did some desultory code review: nodeIndexscan.c was likewise overthinking how hard it ought to work to identify the datatype of the ORDER BY expressions. Add comments explaining how come nodeIndexscan.c can get away with simplifying assumptions about NULLS LAST ordering and no backward scan. Revert no-longer-needed changes of find_ec_member_for_tle(); while the new definition was no worse than the old, it wasn't better either, and it might cause back-patching pain. Revert entirely bogus additions to genam.h.	2015-05-21 19:47:48 -04:00
Tom Lane	d4b538ea36	Improve packing/alignment annotation for ItemPointerData. We want this struct to be exactly a series of 3 int16 words, no more and no less. Historically, at least, some ARM compilers preferred to pad it to 8 bytes unless coerced. Our old way of doing that was just to use __attribute__((packed)), but as pointed out by Piotr Stefaniak, that does too much: it also licenses the compiler to give the struct only byte-alignment. We don't want that because it adds access overhead, possibly quite significant overhead. According to the GCC manual, what we want requires also specifying __attribute__((align(2))). It's not entirely clear if all the relevant compilers accept this pragma as well, but we can hope the buildfarm will tell us if not. We can also add a static assertion that should fire if the compiler padded the struct. Since the combination of these pragmas should define exactly what we want on any compiler that accepts them, let's try using them wherever we think they exist, not only for __arm__. (This is likely to expose that the conditional definitions in c.h are inadequate, but finding that out would be a good thing.) The immediate motivation for this is that the current definition of ExecRowMark allows its curCtid field to be misaligned. It is not clear whether there are any other uses of ItemPointerData with a similar hazard. We could change the definition of ExecRowMark if this doesn't work, but it would be far better to have a future-proof fix. Piotr Stefaniak, some further hacking by me	2015-05-21 17:21:46 -04:00
Heikki Linnakangas	fa60fb63e5	Fix more typos in comments. Patch by CharSyam, plus a few more I spotted with grep.	2015-05-20 19:45:43 +03:00
Heikki Linnakangas	4fc72cc7bb	Collection of typo fixes. Use "a" and "an" correctly, mostly in comments. Two error messages were also fixed (they were just elogs, so no translation work required). Two function comments in pg_proc.h were also fixed. Etsuro Fujita reported one of these, but I found a lot more with grep. Also fix a few other typos spotted while grepping for the a/an typos. For example, "consists out of ..." -> "consists of ...". Plus a "though"/ "through" mixup reported by Euler Taveira. Many of these typos were in old code, which would be nice to backpatch to make future backpatching easier. But much of the code was new, and I didn't feel like crafting separate patches for each branch. So no backpatching.	2015-05-20 16:56:22 +03:00
Tom Lane	0c071936e9	Revert error-throwing wrappers for the printf family of functions. This reverts commit `16304a0134`, except for its changes in src/port/snprintf.c; as well as commit `cac18a76bb` which is no longer needed. Fujii Masao reported that the previous commit caused failures in psql on OS X, since if one exits the pager program early while viewing a query result, psql sees an EPIPE error from fprintf --- and the wrapper function thought that was reason to panic. (It's a bit surprising that the same does not happen on Linux.) Further discussion among the security list concluded that the risk of other such failures was far too great, and that the one-size-fits-all approach to error handling embodied in the previous patch is unlikely to be workable. This leaves us again exposed to the possibility of the type of failure envisioned in CVE-2015-3166. However, that failure mode is strictly hypothetical at this point: there is no concrete reason to believe that an attacker could trigger information disclosure through the supposed mechanism. In the first place, the attack surface is fairly limited, since so much of what the backend does with format strings goes through stringinfo.c or psprintf(), and those already had adequate defenses. In the second place, even granting that an unprivileged attacker could control the occurrence of ENOMEM with some precision, it's a stretch to believe that he could induce it just where the target buffer contains some valuable information. So we concluded that the risk of non-hypothetical problems induced by the patch greatly outweighs the security risks. We will therefore revert, and instead undertake closer analysis to identify specific calls that may need hardening, rather than attempt a universal solution. We have kept the portion of the previous patch that improved snprintf.c's handling of errors when it calls the platform's sprintf(). That seems to be an unalloyed improvement. Security: CVE-2015-3166	2015-05-19 18:19:38 -04:00
Andres Freund	0740cbd759	Refactor ON CONFLICT index inference parse tree representation. Defer lookup of opfamily and input type of a of a user specified opclass until the optimizer selects among available unique indexes; and store the opclass in the parse analyzed tree instead. The primary reason for doing this is that for rule deparsing it's easier to use the opclass than the previous representation. While at it also rename a variable in the inference code to better fit it's purpose. This is separate from the actual fixes for deparsing to make review easier.	2015-05-19 21:21:27 +02:00
Tom Lane	0b28ea79c0	Avoid collation dependence in indexes of system catalogs. No index in template0 should have collation-dependent ordering, especially not indexes on shared catalogs. For most textual columns we avoid this issue by using type "name" (which sorts per strcmp()). However there are a few indexed columns that we'd prefer to use "text" for, and for that, the default opclass text_ops is unsafe. Fortunately, text_pattern_ops is safe (it sorts per memcmp()), and it has no real functional disadvantage for our purposes. So change the indexes on pg_seclabel.provider and pg_shseclabel.provider to use text_pattern_ops. In passing, also mark pg_replication_origin.roname as using text_pattern_ops --- for some reason it was labeled varchar_pattern_ops which is just wrong, even though it accidentally worked. Add regression test queries to catch future errors of these kinds. We still can't do anything about the misdeclared pg_seclabel and pg_shseclabel indexes in back branches :-(	2015-05-19 11:47:42 -04:00
Tom Lane	afee04352b	Revert "Change pg_seclabel.provider and pg_shseclabel.provider to type "name"." This reverts commit `b82a7be603`. There is a better (less invasive) way to fix it, which I will commit next.	2015-05-19 10:40:04 -04:00
Tom Lane	b82a7be603	Change pg_seclabel.provider and pg_shseclabel.provider to type "name". These were "text", but that's a bad idea because it has collation-dependent ordering. No index in template0 should have collation-dependent ordering, especially not indexes on shared catalogs. There was general agreement that provider names don't need to be longer than other identifiers, so we can fix this at a small waste of table space by changing from text to name. There's no way to fix the problem in the back branches, but we can hope that security labels don't yet have widespread-enough usage to make it urgent to fix. There needs to be a regression sanity test to prevent us from making this same mistake again; but before putting that in, we'll need to get rid of similar brain fade in the recently-added pg_replication_origin catalog. Note: for lack of a suitable testing environment, I've not really exercised this change. I trust the buildfarm will show up any mistakes.	2015-05-18 20:07:53 -04:00
Tom Lane	4db485e75b	Put back a backwards-compatible version of sampling support functions. Commit `83e176ec18` removed the longstanding support functions for block sampling without any consideration of the impact this would have on third-party FDWs. The new API is not notably more functional for FDWs than the old, so forcing them to change doesn't seem like a good thing. We can provide the old API as a wrapper (more or less) around the new one for a minimal amount of extra code.	2015-05-18 18:34:37 -04:00
Noah Misch	16304a0134	Add error-throwing wrappers for the printf family of functions. All known standard library implementations of these functions can fail with ENOMEM. A caller neglecting to check for failure would experience missing output, information exposure, or a crash. Check return values within wrappers and code, currently just snprintf.c, that bypasses the wrappers. The wrappers do not return after an error, so their callers need not check. Back-patch to 9.0 (all supported versions). Popular free software standard library implementations do take pains to bypass malloc() in simple cases, but they risk ENOMEM for floating point numbers, positional arguments, large field widths, and large precisions. No specification demands such caution, so this commit regards every call to a printf family function as a potential threat. Injecting the wrappers implicitly is a compromise between patch scope and design goals. I would prefer to edit each call site to name a wrapper explicitly. libpq and the ECPG libraries would, ideally, convey errors to the caller rather than abort(). All that would be painfully invasive for a back-patched security fix, hence this compromise. Security: CVE-2015-3166	2015-05-18 10:02:31 -04:00
Noah Misch	cac18a76bb	Permit use of vsprintf() in PostgreSQL code. The next commit needs it. Back-patch to 9.0 (all supported versions).	2015-05-18 10:02:31 -04:00
Tom Lane	424661913c	Fix failure to copy IndexScan.indexorderbyops in copyfuncs.c. This oversight results in a crash at executor startup if the plan has been copied. outfuncs.c was missed as well. While we could probably have taught both those files to cope with the originally chosen representation of an Oid array, it would have been painful, not least because there'd be no easy way to verify the array length. An Oid List is far easier to work with. And AFAICS, there is no particular notational benefit to using an array rather than a list in the existing parts of the patch either. So just change it to a list. Error in commit `35fcb1b3d0`, which is new, so no need for back-patch.	2015-05-17 21:22:12 -04:00
Andres Freund	f3d3118532	Support GROUPING SETS, CUBE and ROLLUP. This SQL standard functionality allows to aggregate data by different GROUP BY clauses at once. Each grouping set returns rows with columns grouped by in other sets set to NULL. This could previously be achieved by doing each grouping as a separate query, conjoined by UNION ALLs. Besides being considerably more concise, grouping sets will in many cases be faster, requiring only one scan over the underlying data. The current implementation of grouping sets only supports using sorting for input. Individual sets that share a sort order are computed in one pass. If there are sets that don't share a sort order, additional sort & aggregation steps are performed. These additional passes are sourced by the previous sort step; thus avoiding repeated scans of the source data. The code is structured in a way that adding support for purely using hash aggregation or a mix of hashing and sorting is possible. Sorting was chosen to be supported first, as it is the most generic method of implementation. Instead of, as in an earlier versions of the patch, representing the chain of sort and aggregation steps as full blown planner and executor nodes, all but the first sort are performed inside the aggregation node itself. This avoids the need to do some unusual gymnastics to handle having to return aggregated and non-aggregated tuples from underlying nodes, as well as having to shut down underlying nodes early to limit memory usage. The optimizer still builds Sort/Agg node to describe each phase, but they're not part of the plan tree, but instead additional data for the aggregation node. They're a convenient and preexisting way to describe aggregation and sorting. The first (and possibly only) sort step is still performed as a separate execution step. That retains similarity with existing group by plans, makes rescans fairly simple, avoids very deep plans (leading to slow explains) and easily allows to avoid the sorting step if the underlying data is sorted by other means. A somewhat ugly side of this patch is having to deal with a grammar ambiguity between the new CUBE keyword and the cube extension/functions named cube (and rollup). To avoid breaking existing deployments of the cube extension it has not been renamed, neither has cube been made a reserved keyword. Instead precedence hacking is used to make GROUP BY cube(..) refer to the CUBE grouping sets feature, and not the function cube(). To actually group by a function cube(), unlikely as that might be, the function name has to be quoted. Needs a catversion bump because stored rules may change. Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com	2015-05-16 03:46:31 +02:00
Alvaro Herrera	b0b7be6133	Add BRIN infrastructure for "inclusion" opclasses This lets BRIN be used with R-Tree-like indexing strategies. Also provided are operator classes for range types, box and inet/cidr. The infrastructure provided here should be sufficient to create operator classes for similar datatypes; for instance, opclasses for PostGIS geometries should be doable, though we didn't try to implement one. (A box/point opclass was also submitted, but we ripped it out before commit because the handling of floating point comparisons in existing code is inconsistent and would generate corrupt indexes.) Author: Emre Hasegeli. Cosmetic changes by me Review: Andreas Karlsson	2015-05-15 18:05:22 -03:00
Alvaro Herrera	26df7066cc	Move strategy numbers to include/access/stratnum.h For upcoming BRIN opclasses, it's convenient to have strategy numbers defined in a single place. Since there's nothing appropriate, create it. The StrategyNumber typedef now lives there, as well as existing strategy numbers for B-trees (from skey.h) and R-tree-and-friends (from gist.h). skey.h is forced to include stratnum.h because of the StrategyNumber typedef, but gist.h is not; extensions that currently rely on gist.h for rtree strategy numbers might need to add a new A few .c files can stop including skey.h and/or gist.h, which is a nice side benefit. Per discussion: https://www.postgresql.org/message-id/20150514232132.GZ2523@alvh.no-ip.org Authored by Emre Hasegeli and Álvaro. (It's not clear to me why bootscanner.l has any #include lines at all.)	2015-05-15 17:03:16 -03:00
Simon Riggs	f6d208d6e5	TABLESAMPLE, SQL Standard and extensible Add a TABLESAMPLE clause to SELECT statements that allows user to specify random BERNOULLI sampling or block level SYSTEM sampling. Implementation allows for extensible sampling functions to be written, using a standard API. Basic version follows SQLStandard exactly. Usable concrete use cases for the sampling API follow in later commits. Petr Jelinek Reviewed by Michael Paquier and Simon Riggs	2015-05-15 14:37:10 -04:00
Heikki Linnakangas	ffd37740ee	Add archive_mode='always' option. In 'always' mode, the standby independently archives all files it receives from the primary. Original patch by Fujii Masao, docs and review by me.	2015-05-15 18:55:24 +03:00
Heikki Linnakangas	98edd617f3	Fix datatype confusion with the new lossy GiST distance functions. We can only support a lossy distance function when the distance function's datatype is comparable with the original ordering operator's datatype. The distance function always returns a float8, so we are limited to float8, and float4 (by a hard-coded cast of the float8 to float4). In light of this limitation, it seems like a good idea to have a separate 'recheck' flag for the ORDER BY expressions, so that if you have a non-lossy distance function, it still works with lossy quals. There are cases like that with the build-in or contrib opclasses, but it's plausible. There was a hidden assumption that the ORDER BY values returned by GiST match the original ordering operator's return type, but there are plenty of examples where that's not true, e.g. in btree_gist and pg_trgm. As long as the distance function is not lossy, we can tolerate that and just not return the distance to the executor (or rather, always return NULL). The executor doesn't need the distances if there are no lossy results. There was another little bug: the recheck variable was not initialized before calling the distance function. That revealed the bigger issue, as the executor tried to reorder tuples that didn't need reordering, and that failed because of the datatype mismatch.	2015-05-15 18:09:31 +03:00
Heikki Linnakangas	35fcb1b3d0	Allow GiST distance function to return merely a lower-bound. The distance function can now set *recheck = false, like index quals. The executor will then re-check the ORDER BY expressions, and use a queue to reorder the results on the fly. This makes it possible to do kNN-searches on polygons and circles, which don't store the exact value in the index, but just a bounding box. Alexander Korotkov and me	2015-05-15 14:26:51 +03:00
Fujii Masao	ecd222e770	Support VERBOSE option in REINDEX command. When this option is specified, a progress report is printed as each index is reindexed. Per discussion, we agreed on the following syntax for the extensibility of the options. REINDEX (flexible options) { INDEX \| ... } name Sawada Masahiko. Reviewed by Robert Haas, Fabrízio Mello, Alvaro Herrera, Kyotaro Horiguchi, Jim Nasby and me. Discussion: CAD21AoA0pK3YcOZAFzMae+2fcc3oGp5zoRggDyMNg5zoaWDhdQ@mail.gmail.com	2015-05-15 20:09:57 +09:00
Tom Lane	7730f48ede	Teach UtfToLocal/LocalToUtf to support algorithmic encoding conversions. Until now, these functions have only supported encoding conversions using lookup tables, which is fine as long as there's not too many code points to convert. However, GB18030 expects all 1.1 million Unicode code points to be convertible, which would require a ridiculously-sized lookup table. Fortunately, a large fraction of those conversions can be expressed through arithmetic, ie the conversions are one-to-one in certain defined ranges. To support that, provide a callback function that is used after consulting the lookup tables. (This patch doesn't actually change anything about the GB18030 conversion behavior, just provide infrastructure for fixing it.) Since this requires changing the APIs of UtfToLocal/LocalToUtf anyway, take the opportunity to rearrange their argument lists into what seems to me a saner order. And beautify the call sites by using lengthof() instead of error-prone sizeof() arithmetic. In passing, also mark all the lookup tables used by these calls "const". This moves an impressive amount of stuff into the text segment, at least on my machine, and is safer anyhow.	2015-05-14 22:27:12 -04:00
Simon Riggs	83e176ec18	Separate block sampling functions Refactoring ahead of tablesample patch Requested and reviewed by Michael Paquier Petr Jelinek	2015-05-15 04:02:54 +02:00
Peter Eisentraut	a486e35706	Add pg_settings.pending_restart column with input from David G. Johnston, Robert Haas, Michael Paquier	2015-05-14 20:08:51 -04:00
Tom Lane	1dc5ebc907	Support "expanded" objects, particularly arrays, for better performance. This patch introduces the ability for complex datatypes to have an in-memory representation that is different from their on-disk format. On-disk formats are typically optimized for minimal size, and in any case they can't contain pointers, so they are often not well-suited for computation. Now a datatype can invent an "expanded" in-memory format that is better suited for its operations, and then pass that around among the C functions that operate on the datatype. There are also provisions (rudimentary as yet) to allow an expanded object to be modified in-place under suitable conditions, so that operations like assignment to an element of an array need not involve copying the entire array. The initial application for this feature is arrays, but it is not hard to foresee using it for other container types like JSON, XML and hstore. I have hopes that it will be useful to PostGIS as well. In this initial implementation, a few heuristics have been hard-wired into plpgsql to improve performance for arrays that are stored in plpgsql variables. We would like to generalize those hacks so that other datatypes can obtain similar improvements, but figuring out some appropriate APIs is left as a task for future work. (The heuristics themselves are probably not optimal yet, either, as they sometimes force expansion of arrays that would be better left alone.) Preliminary performance testing shows impressive speed gains for plpgsql functions that do element-by-element access or update of large arrays. There are other cases that get a little slower, as a result of added array format conversions; but we can hope to improve anything that's annoyingly bad. In any case most applications should see a net win. Tom Lane, reviewed by Andres Freund	2015-05-14 12:08:49 -04:00
Andrew Dunstan	5c7df74204	Fix some errors from jsonb functions patch. The catalog version should have been bumped, and the alternative regression result file was not up to date with the name of jsonb_pretty.	2015-05-12 16:54:38 -04:00
Andrew Dunstan	c6947010ce	Additional functions and operators for jsonb jsonb_pretty(jsonb) produces nicely indented json output. jsonb \|\| jsonb concatenates two jsonb values. jsonb - text removes a key and its associated value from the json jsonb - int removes the designated array element jsonb - text[] removes a key and associated value or array element at the designated path jsonb_replace(jsonb,text[],jsonb) replaces the array element designated by the path or the value associated with the key designated by the path with the given value. Original work by Dmitry Dolgov, adapted and reworked for PostgreSQL core by Andrew Dunstan, reviewed and tidied up by Petr Jelinek.	2015-05-12 15:52:45 -04:00
Tom Lane	afb9249d06	Add support for doing late row locking in FDWs. Previously, FDWs could only do "early row locking", that is lock a row as soon as it's fetched, even though local restriction/join conditions might discard the row later. This patch adds callbacks that allow FDWs to do late locking in the same way that it's done for regular tables. To make use of this feature, an FDW must support the "ctid" column as a unique row identifier. Currently, since ctid has to be of type TID, the feature is of limited use, though in principle it could be used by postgres_fdw. We may eventually allow FDWs to specify another data type for ctid, which would make it possible for more FDWs to use this feature. This commit does not modify postgres_fdw to use late locking. We've tested some prototype code for that, but it's not in committable shape, and besides it's quite unclear whether it actually makes sense to do late locking against a remote server. The extra round trips required are likely to outweigh any benefit from improved concurrency. Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me	2015-05-12 14:10:17 -04:00
Andrew Dunstan	72d422a522	Map basebackup tablespaces using a tablespace_map file Windows can't reliably restore symbolic links from a tar format, so instead during backup start we create a tablespace_map file, which is used by the restoring postgres to create the correct links in pg_tblspc. The backup protocol also now has an option to request this file to be included in the backup stream, and this is used by pg_basebackup when operating in tar mode. This is done on all platforms, not just Windows. This means that pg_basebackup will not not work in tar mode against 9.4 and older servers, as this protocol option isn't implemented there. Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.	2015-05-12 09:29:10 -04:00
Alvaro Herrera	b488c580ae	Allow on-the-fly capture of DDL event details This feature lets user code inspect and take action on DDL events. Whenever a ddl_command_end event trigger is installed, DDL actions executed are saved to a list which can be inspected during execution of a function attached to ddl_command_end. The set-returning function pg_event_trigger_ddl_commands can be used to list actions so captured; it returns data about the type of command executed, as well as the affected object. This is sufficient for many uses of this feature. For the cases where it is not, we also provide a "command" column of a new pseudo-type pg_ddl_command, which is a pointer to a C structure that can be accessed by C code. The struct contains all the info necessary to completely inspect and even reconstruct the executed command. There is no actual deparse code here; that's expected to come later. What we have is enough infrastructure that the deparsing can be done in an external extension. The intention is that we will add some deparsing code in a later release, as an in-core extension. A new test module is included. It's probably insufficient as is, but it should be sufficient as a starting point for a more complete and future-proof approach. Authors: Álvaro Herrera, with some help from Andres Freund, Ian Barwick, Abhijit Menon-Sen. Reviews by Andres Freund, Robert Haas, Amit Kapila, Michael Paquier, Craig Ringer, David Steele. Additional input from Chris Browne, Dimitri Fontaine, Stephen Frost, Petr Jelínek, Tom Lane, Jim Nasby, Steven Singer, Pavel Stěhule. Based on original work by Dimitri Fontaine, though I didn't use his code. Discussion: https://www.postgresql.org/message-id/m2txrsdzxa.fsf@2ndQuadrant.fr https://www.postgresql.org/message-id/20131108153322.GU5809@eldon.alvh.no-ip.org https://www.postgresql.org/message-id/20150215044814.GL3391@alvh.no-ip.org	2015-05-11 19:14:31 -03:00
Tom Lane	1a8a4e5cde	Code review for foreign/custom join pushdown patch. Commit `e7cb7ee145` included some design decisions that seem pretty questionable to me, and there was quite a lot of stuff not to like about the documentation and comments. Clean up as follows: * Consider foreign joins only between foreign tables on the same server, rather than between any two foreign tables with the same underlying FDW handler function. In most if not all cases, the FDW would simply have had to apply the same-server restriction itself (far more expensively, both for lack of caching and because it would be repeated for each combination of input sub-joins), or else risk nasty bugs. Anyone who's really intent on doing something outside this restriction can always use the set_join_pathlist_hook. * Rename fdw_ps_tlist/custom_ps_tlist to fdw_scan_tlist/custom_scan_tlist to better reflect what they're for, and allow these custom scan tlists to be used even for base relations. * Change make_foreignscan() API to include passing the fdw_scan_tlist value, since the FDW is required to set that. Backwards compatibility doesn't seem like an adequate reason to expect FDWs to set it in some ad-hoc extra step, and anyway existing FDWs can just pass NIL. * Change the API of path-generating subroutines of add_paths_to_joinrel, and in particular that of GetForeignJoinPaths and set_join_pathlist_hook, so that various less-used parameters are passed in a struct rather than as separate parameter-list entries. The objective here is to reduce the probability that future additions to those parameter lists will result in source-level API breaks for users of these hooks. It's possible that this is even a small win for the core code, since most CPU architectures can't pass more than half a dozen parameters efficiently anyway. I kept root, joinrel, outerrel, innerrel, and jointype as separate parameters to reduce code churn in joinpath.c --- in particular, putting jointype into the struct would have been problematic because of the subroutines' habit of changing their local copies of that variable. * Avoid ad-hocery in ExecAssignScanProjectionInfo. It was probably all right for it to know about IndexOnlyScan, but if the list is to grow we should refactor the knowledge out to the callers. * Restore nodeForeignscan.c's previous use of the relcache to avoid extra GetFdwRoutine lookups for base-relation scans. * Lots of cleanup of documentation and missed comments. Re-order some code additions into more logical places.	2015-05-10 14:36:36 -04:00
Andrew Dunstan	cb9fa802b3	Add new OID alias type regnamespace Catalog version bumped Kyotaro HORIGUCHI	2015-05-09 13:36:52 -04:00
Andrew Dunstan	0c90f6769d	Add new OID alias type regrole The new type has the scope of whole the database cluster so it doesn't behave the same as the existing OID alias types which have database scope, concerning object dependency. To avoid confusion constants of the new type are prohibited from appearing where dependencies are made involving it. Also, add a note to the docs about possible MVCC violation and optimization issues, which are general over the all reg* types. Kyotaro Horiguchi	2015-05-09 13:06:49 -04:00
Stephen Frost	4b342fb591	Bump catversion for pg_file_settings Pointed out by Andres (thanks!) Apologies for not including it in the initial patch.	2015-05-08 19:14:32 -04:00
Stephen Frost	a97e0c3354	Add pg_file_settings view and function The function and view added here provide a way to look at all settings in postgresql.conf, any #include'd files, and postgresql.auto.conf (which is what backs the ALTER SYSTEM command). The information returned includes the configuration file name, line number in that file, sequence number indicating when the parameter is loaded (useful to see if it is later masked by another definition of the same parameter), parameter name, and what it is set to at that point. This information is updated on reload of the server. This is unfiltered, privileged, information and therefore access is restricted to superusers through the GRANT system. Author: Sawada Masahiko, various improvements by me. Reviewers: David Steele	2015-05-08 19:09:26 -04:00
Heikki Linnakangas	de7688442f	At promotion, archive last segment from old timeline with .partial suffix. Previously, we would archive the possible-incomplete WAL segment with its normal filename, but that causes trouble if the server owning that timeline is still running, and tries to archive the same segment later. It's not nice for the standby to trip up the master's archival like that. And it's pretty confusing, anyway, to have an incomplete segment in the archive that's indistinguishable from a normal, complete segment. To avoid such confusion, add a .partial suffix to the file. Or to be more precise, make a copy of the old segment under the .partial suffix, and archive that instead of the original file. pg_receivexlog also uses the .partial suffix for the same purpose, to tell apart incompletely streamed files from complete ones. There is no automatic mechanism to use the .partial files at recovery, so they will go unused, unless the administrator manually copies to them to the pg_xlog directory (and removes the .partial suffix). Recovery won't normally need the WAL - when recovering to the new timeline, it will find the same WAL on the first segment on the new timeline instead - but it nevertheless feels better to archive the file with the .partial suffix, for debugging purposes if nothing else.	2015-05-08 21:59:01 +03:00
Heikki Linnakangas	179cdd0981	Add macros to check if a filename is a WAL segment or other such file. We had many instances of the strlen + strspn combination to check for that. This makes the code a bit easier to read.	2015-05-08 21:58:57 +03:00
Andres Freund	e8898e9169	Minor ON CONFLICT related comments and doc fixes. Geoff Winkless, Stephen Frost, Peter Geoghegan and me.	2015-05-08 19:24:14 +02:00
Robert Haas	53bb309d2d	Teach autovacuum about multixact member wraparound. The logic introduced in commit `b69bf30b9b` and repaired in commits `669c7d20e6` and `7be47c56af` helps to ensure that we don't overwrite old multixact member information while it is still needed, but a user who creates many large multixacts can still exhaust the member space (and thus start getting errors) while autovacuum stands idly by. To fix this, progressively ramp down the effective value (but not the actual contents) of autovacuum_multixact_freeze_max_age as member space utilization increases. This makes autovacuum more aggressive and also reduces the threshold for a manual VACUUM to perform a full-table scan. This patch leaves unsolved the problem of ensuring that emergency autovacuums are triggered even when autovacuum=off. We'll need to fix that via a separate patch. Thomas Munro and Robert Haas	2015-05-08 12:53:00 -04:00
Andres Freund	168d5805e4	Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE. The newly added ON CONFLICT clause allows to specify an alternative to raising a unique or exclusion constraint violation error when inserting. ON CONFLICT refers to constraints that can either be specified using a inference clause (by specifying the columns of a unique constraint) or by naming a unique or exclusion constraint. DO NOTHING avoids the constraint violation, without touching the pre-existing row. DO UPDATE SET ... [WHERE ...] updates the pre-existing tuple, and has access to both the tuple proposed for insertion and the existing tuple; the optional WHERE clause can be used to prevent an update from being executed. The UPDATE SET and WHERE clauses have access to the tuple proposed for insertion using the "magic" EXCLUDED alias, and to the pre-existing tuple using the table name or its alias. This feature is often referred to as upsert. This is implemented using a new infrastructure called "speculative insertion". It is an optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert. If a violating tuple was inserted concurrently, the speculatively inserted tuple is deleted and a new attempt is made. If the pre-check finds a matching tuple the alternative DO NOTHING or DO UPDATE action is taken. If the insertion succeeds without detecting a conflict, the tuple is deemed inserted. To handle the possible ambiguity between the excluded alias and a table named excluded, and for convenience with long relation names, INSERT INTO now can alias its target table. Bumps catversion as stored rules change. Author: Peter Geoghegan, with significant contributions from Heikki Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes. Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs, Dean Rasheed, Stephen Frost and many others.	2015-05-08 05:43:10 +02:00
Andres Freund	2c8f4836db	Represent columns requiring insert and update privileges indentently. Previously, relation range table entries used a single Bitmapset field representing which columns required either UPDATE or INSERT privileges, despite the fact that INSERT and UPDATE privileges are separately cataloged, and may be independently held. As statements so far required either insert or update privileges but never both, that was sufficient. The required permission could be inferred from the top level statement run. The upcoming INSERT ... ON CONFLICT UPDATE feature needs to independently check for both privileges in one statement though, so that is not sufficient anymore. Bumps catversion as stored rules change. Author: Peter Geoghegan Reviewed-By: Andres Freund	2015-05-08 00:20:46 +02:00
Alvaro Herrera	db5f98ab4f	Improve BRIN infra, minmax opclass and regression test The minmax opclass was using the wrong support functions when cross-datatypes queries were run. Instead of trying to fix the pg_amproc definitions (which apparently is not possible), use the already correct pg_amop entries instead. This requires jumping through more hoops (read: extra syscache lookups) to obtain the underlying functions to execute, but it is necessary for correctness. Author: Emre Hasegeli, tweaked by Álvaro Review: Andreas Karlsson Also change BrinOpcInfo to record each stored type's typecache entry instead of just the OID. Turns out that the full type cache is necessary in brin_deform_tuple: the original code used the indexed type's byval and typlen properties to extract the stored tuple, which is correct in Minmax; but in other implementations that want to store something different, that's wrong. The realization that this is a bug comes from Emre also, but I did not use his patch. I also adopted Emre's regression test code (with smallish changes), which is more complete.	2015-05-07 13:02:22 -03:00
Robert Haas	1998261034	Avoid using a C++ keyword as a structure member name. Per request from Peter Eisentraut.	2015-05-05 22:41:03 -04:00
Alvaro Herrera	3b6db1f445	Add geometry/range functions to support BRIN inclusion This commit adds the following functions: box(point) -> box bound_box(box, box) -> box inet_same_family(inet, inet) -> bool inet_merge(inet, inet) -> cidr range_merge(anyrange, anyrange) -> anyrange The first of these is also used to implement a new assignment cast from point to box. These functions are the first part of a base to implement an "inclusion" operator class for BRIN, for multidimensional data types. Author: Emre Hasegeli Reviewed by: Andreas Karlsson	2015-05-05 15:22:24 -03:00
Tom Lane	2503982be4	Improve procost estimates for some text search functions. The text search functions that involve parsing raw text into lexemes are remarkably CPU-intensive, so estimating them at the same cost as most other built-in functions seems like a mistake; moreover, doing so turns out to discourage the optimizer from using functional indexes on these functions. After some debate, we've agreed to raise procost from 1 to 100 for to_tsvector(), plainto_tsvector(), to_tsquery(), ts_headline(), ts_match_tt(), and ts_match_tq(), which are all the text search functions that parse raw text. Also increase procost for the 2-argument form of ts_rewrite() (tsquery_rewrite_query); while this function doesn't do text parsing, it does execute a user-supplied SQL query, so its previous procost of 1 is clearly a drastic underestimate. It seems reasonable to assign it the same cost we assign to PL functions by default, so 100 is the number here too. I did not bother bumping catversion for this change, since it does not break catalog compatibility with the server executable nor result in any regression test changes. Per complaint from Andrew Gierth and subsequent discussion.	2015-05-04 15:38:57 -04:00
Robert Haas	2ce439f337	Recursively fsync() the data directory after a crash. Otherwise, if there's another crash, some writes from after the first crash might make it to disk while writes from before the crash fail to make it to disk. This could lead to data corruption. Back-patch to all supported versions. Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised by me.	2015-05-04 14:13:53 -04:00
Robert Haas	e7cb7ee145	Allow FDWs and custom scan providers to replace joins with scans. Foreign data wrappers can use this capability for so-called "join pushdown"; that is, instead of executing two separate foreign scans and then joining the results locally, they can generate a path which performs the join on the remote server and then is scanned locally. This commit does not extend postgres_fdw to take advantage of this capability; it just provides the infrastructure. Custom scan providers can use this in a similar way. Previously, it was only possible for a custom scan provider to scan a single relation. Now, it can scan an entire join tree, provided of course that it knows how to produce the same results that the join would have produced if executed normally. KaiGai Kohei, reviewed by Shigeru Hanada, Ashutosh Bapat, and me.	2015-05-01 08:50:35 -04:00
Andres Freund	2b22795b32	Copy editing of the replication origins patch. Michael Paquier and myself.	2015-05-01 12:22:13 +02:00
Andres Freund	1db12da85b	Fix unaligned memory access in xlog parsing due to replication origin patch. ParseCommitRecord() accessed xl_xact_origin directly. But the chunks in the commit record's data only have 4 byte alignment, whereas xl_xact_origin's members require 8 byte alignment on some platforms. Update comments to make not of that and copy the record to stack local storage before reading. With help from Stefan Kaltenbrunner in pinning down the buildfarm and verifying the fix.	2015-05-01 11:36:14 +02:00
Robert Haas	924bcf4f16	Create an infrastructure for parallel computation in PostgreSQL. This does four basic things. First, it provides convenience routines to coordinate the startup and shutdown of parallel workers. Second, it synchronizes various pieces of state (e.g. GUCs, combo CID mappings, transaction snapshot) from the parallel group leader to the worker processes. Third, it prohibits various operations that would result in unsafe changes to that state while parallelism is active. Finally, it propagates events that would result in an ErrorResponse, NoticeResponse, or NotifyResponse message being sent to the client from the parallel workers back to the master, from which they can then be sent on to the client. Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke. Suggestions and review from Andres Freund, Heikki Linnakangas, Noah Misch, Simon Riggs, Euler Taveira, and Jim Nasby.	2015-04-30 15:02:14 -04:00
Andres Freund	e0f26fc765	Correct replication origin's use of UINT16_MAX to PG_UINT16_MAX. We can't rely on UINT16_MAX being present, which is why we introduced PG_UINT16_MAX... Buildfarm animal bowerbird via Andrew Gierth.	2015-04-30 00:19:36 +02:00
Andres Freund	5aa2350426	Introduce replication progress tracking infrastructure. When implementing a replication solution ontop of logical decoding, two related problems exist: * How to safely keep track of replication progress * How to change replication behavior, based on the origin of a row; e.g. to avoid loops in bi-directional replication setups The solution to these problems, as implemented here, consist out of three parts: 1) 'replication origins', which identify nodes in a replication setup. 2) 'replication progress tracking', which remembers, for each replication origin, how far replay has progressed in a efficient and crash safe manner. 3) The ability to filter out changes performed on the behest of a replication origin during logical decoding; this allows complex replication topologies. E.g. by filtering all replayed changes out. Most of this could also be implemented in "userspace", e.g. by inserting additional rows contain origin information, but that ends up being much less efficient and more complicated. We don't want to require various replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure also replaces the 'nodeid' infrastructure of commit timestamps. It is intended to provide all the former capabilities, except that there's only 2^16 different origins; but now they integrate with logical decoding. Additionally more functionality is accessible via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 (commit `73c986add`) changing the API is not a problem. For now the number of origins for which the replication progress can be tracked simultaneously is determined by the max_replication_slots GUC. That GUC is not a perfect match to configure this, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer Discussion: 20150216002155.GI15326@awork2.anarazel.de, 20140923182422.GA15776@alap3.anarazel.de, 20131114172632.GE7522@alap2.anarazel.de	2015-04-29 19:30:53 +02:00
Stephen Frost	dcbf5948e1	Improve qual pushdown for RLS and SB views The original security barrier view implementation, on which RLS is built, prevented all non-leakproof functions from being pushed down to below the view, even when the function was not receiving any data from the view. This optimization improves on that situation by, instead of checking strictly for non-leakproof functions, it checks for Vars being passed to non-leakproof functions and allows functions which do not accept arguments or whose arguments are not from the current query level (eg: constants can be particularly useful) to be pushed down. As discussed, this does mean that a function which is pushed down might gain some idea that there are rows meeting a certain criteria based on the number of times the function is called, but this isn't a particularly new issue and the documentation in rules.sgml already addressed similar covert-channel risks. That documentation is updated to reflect that non-leakproof functions may be pushed down now, if they meet the above-described criteria. Author: Dean Rasheed, with a bit of rework to make things clearer, along with comment and documentation updates from me.	2015-04-27 12:29:42 -04:00
Andres Freund	6aab1f45ac	Fix various typos and grammar errors in comments. Author: Dmitriy Olshevskiy Discussion: 553D00A6.4090205@bk.ru	2015-04-26 18:42:31 +02:00
Peter Eisentraut	cac7658205	Add transforms feature This provides a mechanism for specifying conversions between SQL data types and procedural languages. As examples, there are transforms for hstore and ltree for PL/Perl and PL/Python. reviews by Pavel Stěhule and Andres Freund	2015-04-26 10:33:14 -04:00
Stephen Frost	e89bd02f58	Perform RLS WITH CHECK before constraints, etc The RLS capability is built on top of the WITH CHECK OPTION system which was added for auto-updatable views, however, unlike WCOs on views (which are mandated by the SQL spec to not fire until after all other constraints and checks are done), it makes much more sense for RLS checks to happen earlier than constraint and uniqueness checks. This patch reworks the structure which holds the WCOs a bit to be explicitly either VIEW or RLS checks and the RLS-related checks are done prior to the constraint and uniqueness checks. This also allows better error reporting as we are now reporting when a violation is due to a WITH CHECK OPTION and when it's due to an RLS policy violation, which was independently noted by Craig Ringer as being confusing. The documentation is also updated to include a paragraph about when RLS WITH CHECK handling is performed, as there have been a number of questions regarding that and the documentation was previously silent on the matter. Author: Dean Rasheed, with some kabitzing and comment changes by me.	2015-04-24 20:34:26 -04:00
Heikki Linnakangas	62420ae7d6	Move functions related to index maintenance to separate source file. There is enough code here to deserve a file of their own, not be buried in the middle of execUtils.c.	2015-04-24 09:33:23 +03:00
Stephen Frost	0bf22e0c8b	RLS fixes, new hooks, and new test module In prepend_row_security_policies(), defaultDeny was always true, so if there were any hook policies, the RLS policies on the table would just get discarded. Fixed to start off with defaultDeny as false and then properly set later if we detect that only the default deny policy exists for the internal policies. The infinite recursion detection in fireRIRrules() didn't properly manage the activeRIRs list in the case of WCOs, so it would incorrectly report infinite recusion if the same relation with RLS appeared more than once in the rtable, for example "UPDATE t ... FROM t ...". Further, the RLS expansion code in fireRIRrules() was handling RLS in the main loop through the rtable, which lead to RTEs being visited twice if they contained sublink subqueries, which prepend_row_security_policies() attempted to handle by exiting early if the RTE already had securityQuals. That doesn't work, however, since if the query involved a security barrier view on top of a table with RLS, the RTE would already have securityQuals (from the view) by the time fireRIRrules() was invoked, and so the table's RLS policies would be ignored. This is fixed in fireRIRrules() by handling RLS in a separate loop at the end, after dealing with any other sublink subqueries, thus ensuring that each RTE is only visited once for RLS expansion. The inheritance planner code didn't correctly handle non-target relations with RLS, which would get turned into subqueries during planning. Thus an update of the form "UPDATE t1 ... FROM t2 ..." where t1 has inheritance and t2 has RLS quals would fail. Fix by making sure to copy in and update the securityQuals when they exist for non-target relations. process_policies() was adding WCOs to non-target relations, which is unnecessary, and could lead to a lot of wasted time in the rewriter and the planner. Fix by only adding WCO policies when working on the result relation. Also in process_policies, we should be copying the USING policies to the WITH CHECK policies on a per-policy basis, fix by moving the copying up into the per-policy loop. Lastly, as noted by Dean, we were simply adding policies returned by the hook provided to the list of quals being AND'd, meaning that they would actually restrict records returned and there was no option to have internal policies and hook-based policies work together permissively (as all internal policies currently work). Instead, explicitly add support for both permissive and restrictive policies by having a hook for each and combining the results appropriately. To ensure this is all done correctly, add a new test module (test_rls_hooks) to test the various combinations of internal, permissive, and restrictive hook policies. Largely from Dean Rasheed (thanks!): CAEZATCVmFUfUOwwhnBTcgi6AquyjQ0-1fyKd0T3xBWJvn+xsFA@mail.gmail.com Author: Dean Rasheed, though I added the new hooks and test module.	2015-04-22 12:01:06 -04:00
Andres Freund	cef939c347	Rename pg_replication_slot's new active_in to active_pid. In `d811c037ce` active_in was added but discussion since showed that active_pid is preferred as a name. Discussion: CAMsr+YFKgZca5_7_ouaMWxA5PneJC9LNViPzpDHusaPhU9pA7g@mail.gmail.com	2015-04-22 09:43:40 +02:00
Andres Freund	d811c037ce	Add 'active_in' column to pg_replication_slots. Right now it is visible whether a replication slot is active in any session, but not in which. Adding the active_in column, containing the pid of the backend having acquired the slot, makes it much easier to associate pg_replication_slots entries with the corresponding pg_stat_replication/pg_stat_activity row. This should have been done from the start, but I (Andres) dropped the ball there somehow. Author: Craig Ringer, revised by me Discussion: CAMsr+YFKgZca5_7_ouaMWxA5PneJC9LNViPzpDHusaPhU9pA7g@mail.gmail.com	2015-04-21 11:51:06 +02:00
Bruce Momjian	f92fc4c95d	pg_upgrade: binary_upgrade_create_empty_extension() is strict Was broken by commit `30982be4e5`. Patch by Jeff Janes	2015-04-17 20:08:42 -04:00
Peter Eisentraut	30982be4e5	Integrate pg_upgrade_support module into backend Previously, these functions were created in a schema "binary_upgrade", which was deleted after pg_upgrade was finished. Because we don't want to keep that schema around permanently, move them to pg_catalog but rename them with a binary_upgrade_... prefix. The provided functions are only small wrappers around global variables that were added specifically for pg_upgrade use, so keeping the module separate does not create any modularity. The functions still check that they are only called in binary upgrade mode, so it is not possible to call these during normal operation. Reviewed-by: Michael Paquier <michael.paquier@gmail.com>	2015-04-14 19:26:37 -04:00
Heikki Linnakangas	b73e7a0716	Oops, fix misspelled #endif I hope this fixes the Windows builfarm failures.	2015-04-14 22:00:52 +03:00
Heikki Linnakangas	3dc2d62d04	Use Intel SSE 4.2 CRC instructions where available. Modern x86 and x86-64 processors with SSE 4.2 support have special instructions, crc32b and crc32q, for calculating CRC-32C. They greatly speed up CRC calculation. Whether the instructions can be used or not depends on the compiler and the target architecture. If generation of SSE 4.2 instructions is allowed for the target (-msse4.2 flag on gcc and clang), use them. If they are not allowed by default, but the compiler supports the -msse4.2 flag to enable them, compile just the CRC-32C function with -msse4.2 flag, and check at runtime whether the processor we're running on supports it. If it doesn't, fall back to the slicing-by-8 algorithm. (With the common defaults on current operating systems, the runtime-check variant is what you get in practice.) Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund.	2015-04-14 17:05:03 +03:00
Heikki Linnakangas	4f700bcd20	Reorganize our CRC source files again. Now that we use CRC-32C in WAL and the control file, the "traditional" and "legacy" CRC-32 variants are not used in any frontend programs anymore. Move the code for those back from src/common to src/backend/utils/hash. Also move the slicing-by-8 implementation (back) to src/port. This is in preparation for next patch that will add another implementation that uses Intel SSE 4.2 instructions to calculate CRC-32C, where available.	2015-04-14 17:03:42 +03:00
Heikki Linnakangas	b2a5545bd6	Don't archive bogus recycled or preallocated files after timeline switch. After a timeline switch, we would leave behind recycled WAL segments that are in the future, but on the old timeline. After promotion, and after they become old enough to be recycled again, we would notice that they don't have a .ready or .done file, create a .ready file for them, and archive them. That's bogus, because the files contain garbage, recycled from an older timeline (or prealloced as zeros). We shouldn't archive such files. This could happen when we're following a timeline switch during replay, or when we switch to new timeline at end-of-recovery. To fix, whenever we switch to a new timeline, scan the data directory for WAL segments on the old timeline, but with a higher segment number, and remove them. Those don't belong to our timeline history, and are most likely bogus recycled or preallocated files. They could also be valid files that we streamed from the primary ahead of time, but in any case, they're not needed to recover to the new timeline.	2015-04-13 16:53:49 +03:00
Magnus Hagander	9029f4b374	Add system view pg_stat_ssl This view shows information about all connections, such as if the connection is using SSL, which cipher is used, and which client certificate (if any) is used. Reviews by Alex Shulgin, Heikki Linnakangas, Andres Freund & Michael Paquier	2015-04-12 19:07:46 +02:00
Alvaro Herrera	27846f02c1	Optimize locking a tuple already locked by another subxact Locking and updating the same tuple repeatedly led to some strange multixacts being created which had several subtransactions of the same parent transaction holding locks of the same strength. However, once a subxact of the current transaction holds a lock of a given strength, it's not necessary to acquire the same lock again. This made some coding patterns much slower than required. The fix is twofold. First we change HeapTupleSatisfiesUpdate to return HeapTupleBeingUpdated for the case where the current transaction is already a single-xid locker for the given tuple; it used to return HeapTupleMayBeUpdated for that case. The new logic is simpler, and the change to pgrowlocks is a testament to that: previously we needed to check for the single-xid locker separately in a very ugly way. That test is simpler now. As fallout from the HTSU change, some of its callers need to be amended so that tuple-locked-by-own-transaction is taken into account in the BeingUpdated case rather than the MayBeUpdated case. For many of them there is no difference; but heap_delete() and heap_update now check explicitely and do not grab tuple lock in that case. The HTSU change also means that routine MultiXactHasRunningRemoteMembers introduced in commit `11ac4c73cb` is no longer necessary and can be removed; the case that used to require it is now handled naturally as result of the changes to heap_delete and heap_update. The second part of the fix to the performance issue is to adjust heap_lock_tuple to avoid the slowness: 1. Previously we checked for the case that our own transaction already held a strong enough lock and returned MayBeUpdated, but only in the multixact case. Now we do it for the plain Xid case as well, which saves having to LockTuple. 2. If the current transaction is the only locker of the tuple (but with a lock not as strong as what we need; otherwise it would have been caught in the check mentioned above), we can skip sleeping on the multixact, and instead go straight to create an updated multixact with the additional lock strength. 3. Most importantly, make sure that both the single-xid-locker case and the multixact-locker case optimization are applied always. We do this by checking both in a single place, rather than them appearing in two separate portions of the routine -- something that is made possible by the HeapTupleSatisfiesUpdate API change. Previously we would only check for the single-xid case when HTSU returned MayBeUpdated, and only checked for the multixact case when HTSU returned BeingUpdated. This was at odds with what HTSU actually returned in one case: if our own transaction was locker in a multixact, it returned MayBeUpdated, so the optimization never applied. This is what led to the large multixacts in the first place. Per bug report #8470 by Oskari Saarenmaa.	2015-04-10 13:47:15 -03:00
Alvaro Herrera	e9a077cad3	pg_event_trigger_dropped_objects: add is_temp column It now also reports temporary objects dropped that are local to the backend. Previously we weren't reporting any temp objects because it was deemed unnecessary; but as it turns out, it is necessary if we want to keep close track of DDL command execution inside one session. Temp objects are reported as living in schema pg_temp, which works because such a schema-qualification always refers to the temp objects of the current session.	2015-04-06 11:40:55 -03:00
Alvaro Herrera	4ff695b17d	Add log_min_autovacuum_duration per-table option This is useful to control autovacuum log volume, for situations where monitoring only a set of tables is necessary. Author: Michael Paquier Reviewed by: A team led by Naoya Anzai (also including Akira Kurosawa, Taiki Kondo, Huong Dangminh), Fujii Masao.	2015-04-03 11:55:50 -03:00
Fujii Masao	8c8a886268	Add palloc_extended for frontend and backend. This commit also adds pg_malloc_extended for frontend. These interfaces can be used to control at a lower level memory allocation using an interface similar to MemoryContextAllocExtended. For example, the callers can specify MCXT_ALLOC_NO_OOM if they want to suppress the "out of memory" error while allocating the memory and handle a NULL return value. Michael Paquier, reviewed by me.	2015-04-03 17:36:12 +09:00
Robert Haas	abd94bcac4	Use abbreviated keys for faster sorting of numeric datums. Andrew Gierth, reviewed by Peter Geoghegan, with further tweaks by me.	2015-04-02 14:04:26 -04:00
Andres Freund	62e2a8dc2c	Define integer limits independently from the system definitions. In `83ff1618` we defined integer limits iff they're not provided by the system. That turns out not to be the greatest idea because there's different ways some datatypes can be represented. E.g. on OSX PG's 64bit datatype will be a 'long int', but OSX unconditionally uses 'long long'. That disparity then can lead to warnings, e.g. around printf formats. One way to fix that would be to back int64 using stdint.h's int64_t. While a good idea it's not that easy to implement. We would e.g. need to include stdint.h in our external headers, which we don't today. Also computing the correct int64 printf formats in that case is nontrivial. Instead simply prefix the integer limits with PG_ and define them unconditionally. I've adjusted all the references to them in code, but not the ones in comments; the latter seems unnecessary to me. Discussion: 20150331141423.GK4878@alap3.anarazel.de	2015-04-02 17:43:35 +02:00
Robert Haas	4cd639baf4	Revert "psql: fix \connect with URIs and conninfo strings" This reverts commit `fcef161729`, about which both the buildfarm and my local machine are very unhappy.	2015-04-02 10:10:22 -04:00
Alvaro Herrera	fcef161729	psql: fix \connect with URIs and conninfo strings psql was already accepting conninfo strings as the first parameter in \connect, but the way it worked wasn't sane; some of the other parameters would get the previous connection's values, causing it to connect to a completely unexpected server or, more likely, not finding any server at all because of completely wrong combinations of parameters. Fix by explicitely checking for a conninfo-looking parameter in the dbname position; if one is found, use its complete specification rather than mix with the other arguments. Also, change tab-completion to not try to complete conninfo/URI-looking "dbnames" and document that conninfos are accepted as first argument. There was a weak consensus to backpatch this, because while the behavior of using the dbname as a conninfo is nowhere documented for \connect, it is reasonable to expect that it works because it does work in many other contexts. Therefore this is backpatched all the way back to 9.0. To implement this, routines previously private to libpq have been duplicated so that psql can decide what looks like a conninfo/URI string. In back branches, just duplicate the same code all the way back to 9.2, where URIs where introduced; 9.0 and 9.1 have a simpler version. In master, the routines are moved to src/common and renamed. Author: David Fetter, Andrew Dunstan. Some editorialization by me (probably earning a Gierth's "Sloppy" badge in the process.) Reviewers: Andrew Gierth, Erik Rijkers, Pavel Stěhule, Stephen Frost, Robert Haas, Andrew Dunstan.	2015-04-01 20:00:07 -03:00
Heikki Linnakangas	f770870d9e	Move inet/cidr GiST opclass functions to correct place in header file. They were accidentally placed under the GIN heading. Andreas Karlsson	2015-04-01 19:20:45 +03:00
Andrew Dunstan	fa1e5afa8a	Run pg_upgrade and pg_resetxlog with restricted token on Windows As with initdb these programs need to run with a restricted token, and if they don't pg_upgrade will fail when run as a user with Adminstrator privileges. Backpatch to all live branches. On the development branch the code is reorganized so that the restricted token code is now in a single location. On the stable bramches a less invasive change is made by simply copying the relevant code to pg_upgrade.c and pg_resetxlog.c. Patches and bug report from Muhammad Asif Naeem, reviewed by Michael Paquier, slightly edited by me.	2015-03-30 17:07:52 -04:00
Alvaro Herrera	97690ea6e8	Change array_offset to return subscripts, not offsets ... and rename it and its sibling array_offsets to array_position and array_positions, to account for the changed behavior. Having the functions return subscripts better matches existing practice, and is better suited to using the result value as a subscript into the array directly. For one-based arrays, the new definition is identical to what was originally committed. (We use the term "subscript" in the documentation, which is what we use whenever we talk about arrays; but the functions themselves are named using the word "position" to match the standard-defined POSITION() functions.) Author: Pavel Stěhule Behavioral problem noted by Dean Rasheed.	2015-03-30 16:13:21 -03:00
Heikki Linnakangas	0633a60f4d	Add index-only scan support to range type GiST opclass. Andreas Karlsson	2015-03-30 13:22:38 +03:00
Heikki Linnakangas	3a20b0e7b6	Add index-only scan support to inet GiST opclass. Andreas Karlsson	2015-03-28 15:11:53 +02:00
Heikki Linnakangas	55b59eda13	Fix GiST index-only scans for opclasses with different storage type. We cannot use the index's tuple descriptor directly to describe the index tuples returned in an index-only scan. That's because the index might use a different datatype for the values stored on disk than the type originally indexed. As long as they were both pass-by-ref, it worked, but will not work for pass-by-value types of different sizes. I noticed this as a crash when I started hacking a patch to add fetch methods to btree_gist.	2015-03-26 23:07:52 +02:00
Tom Lane	785941cdc3	Tweak __attribute__-wrapping macros for better pgindent results. This improves on commit `bbfd7edae5` by making two simple changes: * pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn(). Likewise pg_attribute_unused(), pg_attribute_packed(). This reduces pgindent's tendency to misformat declarations involving them. * attributes are now always attached to function declarations, not definitions. Previously some places were taking creative shortcuts, which were not merely candidates for bad misformatting by pgindent but often were outright wrong anyway. (It does little good to put a noreturn annotation where callers can't see it.) In any case, if we would like to believe that these macros can be used with non-gcc compilers, we should avoid gratuitous variance in usage patterns. I also went through and manually improved the formatting of a lot of declarations, and got rid of excessively repetitive (and now obsolete anyway) comments informing the reader what pg_attribute_printf is for.	2015-03-26 14:03:25 -04:00
Heikki Linnakangas	d04c8ed904	Add support for index-only scans in GiST. This adds a new GiST opclass method, 'fetch', which is used to reconstruct the original Datum from the value stored in the index. Also, the 'canreturn' index AM interface function gains a new 'attno' argument. That makes it possible to use index-only scans on a multi-column index where some of the opclasses support index-only scans but some do not. This patch adds support in the box and point opclasses. Other opclasses can added later as follow-on patches (btree_gist would be particularly interesting). Anastasia Lubennikova, with additional fixes and modifications by me.	2015-03-26 19:12:00 +02:00
Heikki Linnakangas	8fa393a6d7	Minor cleanup of GiST code, for readability. Remove the gistcentryinit function, inlining the relevant part of it into the only caller.	2015-03-26 19:11:54 +02:00
Tatsuo Ishii	656ea810e5	Make SyncRepWakeQueue to a static function It is only used in src/backend/replication/syncrep.c. Back-patch to all supported branches except 9.1 which declares the function as static.	2015-03-26 10:34:08 +09:00
Andres Freund	83ff1618bc	Centralize definition of integer limits. Several submitted and even committed patches have run into the problem that C89, our baseline, does not provide minimum/maximum values for various integer datatypes. C99's stdint.h does, but we can't rely on it. Several parts of the code defined limits locally, so instead centralize the definitions to c.h. This patch also changes the more obvious usages of literal limit values; there's more places that could be changed, but it's less clear whether it's beneficial to change those. Author: Andrew Gierth Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk	2015-03-25 22:39:42 +01:00
Alvaro Herrera	bdc3d7fa23	Return ObjectAddress in many ALTER TABLE sub-routines Since commit `a2e35b53c3`, most CREATE and ALTER commands return the ObjectAddress of the affected object. This is useful for event triggers to try to figure out exactly what happened. This patch extends this idea a bit further to cover ALTER TABLE as well: an auxiliary ObjectAddress is returned for each of several subcommands of ALTER TABLE. This makes it possible to decode with precision what happened during execution of any ALTER TABLE command; for instance, which constraint was added by ALTER TABLE ADD CONSTRAINT, or which parent got dropped from the parents list by ALTER TABLE NO INHERIT. As with the previous patch, there is no immediate user-visible change here. This is all really just continuing what `c504513f83` started. Reviewed by Stephen Frost.	2015-03-25 17:17:56 -03:00
Kevin Grittner	2ed5b87f96	Reduce pinning and buffer content locking for btree scans. Even though the main benefit of the Lehman and Yao algorithm for btrees is that no locks need be held between page reads in an index search, we were holding a buffer pin on each leaf page after it was read until we were ready to read the next one. The reason was so that we could treat this as a weak lock to create an "interlock" with vacuum's deletion of heap line pointers, even though our README file pointed out that this was not necessary for a scan using an MVCC snapshot. The main goal of this patch is to reduce the blocking of vacuum processes by in-progress btree index scans (including a cursor which is idle), but the code rearrangement also allows for one less buffer content lock to be taken when a forward scan steps from one page to the next, which results in a small but consistent performance improvement in many workloads. This patch leaves behavior unchanged for some cases, which can be addressed separately so that each case can be evaluated on its own merits. These unchanged cases are when a scan uses a non-MVCC snapshot, an index-only scan, and a scan of a btree index for which modifications are not WAL-logged. If later patches allow all of these cases to drop the buffer pin after reading a leaf page, then the btree vacuum process can be simplified; it will no longer need the "super-exclusive" lock to delete tuples from a page. Reviewed by Heikki Linnakangas and Kyotaro Horiguchi	2015-03-25 14:24:43 -05:00
Alvaro Herrera	8217fb1441	Add OID output argument to DefineTSConfiguration ... which is set to the OID of a copied text search config, whenever the COPY clause is used. This is in the spirit of commit `a2e35b53c3`.	2015-03-25 15:57:08 -03:00
Tom Lane	cb1ca4d800	Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me	2015-03-22 13:53:21 -04:00
Bruce Momjian	1c7087af42	Add TOAST table to pg_shseclabel for long label use Report by Andres Freund	2015-03-21 22:14:49 -04:00
Bruce Momjian	34afbba84e	Use mmap MAP_NOSYNC option to limit shared memory writes mmap() is rarely used for shared memory, but when it is, this option is useful, particularly on the BSDs. Patch by Sean Chittenden	2015-03-21 22:06:19 -04:00
Andres Freund	959277a4f5	Use 128-bit math to accelerate some aggregation functions. On platforms where we support 128bit integers, use them to implement faster transition functions for sum(int8), avg(int8), var_(int2/int4),stdev_(int2/int4). Where not supported continue to use numeric as a transition type. In some synthetic benchmarks this has been shown to provide significant speedups. Bumps catversion. Discussion: 544BB5F1.50709@proxel.se Author: Andreas Karlsson Reviewed-By: Peter Geoghegan, Petr Jelinek, Andres Freund, Oskari Saarenmaa, David Rowley	2015-03-20 10:29:32 +01:00
Andres Freund	8122e1437e	Add, optional, support for 128bit integers. We will, for the foreseeable future, not expose 128 bit datatypes to SQL. But being able to use 128bit math will allow us, in a later patch, to use 128bit accumulators for some aggregates; leading to noticeable speedups over using numeric. So far we only detect a gcc/clang extension that supports 128bit math, but no 128bit literals, and no *printf support. We might want to expand this in the future to further compilers; if there are any that that provide similar support. Discussion: 544BB5F1.50709@proxel.se Author: Andreas Karlsson, with significant editorializing by me Reviewed-By: Peter Geoghegan, Oskari Saarenmaa	2015-03-20 10:26:17 +01:00
Robert Haas	12968cf408	Add flags argument to dsm_create. Right now, there's only one flag, DSM_CREATE_NULL_IF_MAXSEGMENTS, which suppresses the error that would normally be thrown when the maximum number of segments already exists, instead returning NULL. It might be useful to add more flags in the future, such as one to ignore allocation errors, but I haven't done that here.	2015-03-19 13:03:03 -04:00
Alvaro Herrera	13dbc7a824	array_offset() and array_offsets() These functions return the offset position or positions of a value in an array. Author: Pavel Stěhule Reviewed by: Jim Nasby	2015-03-18 16:01:34 -03:00
Alvaro Herrera	0d83138974	Rationalize vacuuming options and parameters We were involving the parser too much in setting up initial vacuuming parameters. This patch moves that responsibility elsewhere to simplify code, and also to make future additions easier. To do this, create a new struct VacuumParams which is filled just prior to vacuum execution, instead of at parse time; for user-invoked vacuuming this is set up in a new function ExecVacuum, while autovacuum sets it up by itself. While at it, add a new member VACOPT_SKIPTOAST to enum VacuumOption, only set by autovacuum, which is used to disable vacuuming of the toast table instead of the old do_toast parameter; this relieves the argument list of vacuum() and some callees a bit. This partially makes up for having added more arguments in an effort to avoid having autovacuum from constructing a VacuumStmt parse node. Author: Michael Paquier. Some tweaks by Álvaro Reviewed by: Robert Haas, Stephen Frost, Álvaro Herrera	2015-03-18 11:52:33 -03:00
Alvaro Herrera	a61fd5334e	Support opfamily members in get_object_address In the spirit of `890192e99a` and `4464303405`: have get_object_address understand individual pg_amop and pg_amproc objects. There is no way to refer to such objects directly in the grammar -- rather, they are almost always considered an integral part of the opfamily that contains them. (The only case that deals with them individually is ALTER OPERATOR FAMILY ADD/DROP, which carries the opfamily address separately and thus does not need it to be part of each added/dropped element's address.) In event triggers it becomes possible to become involved with individual amop/amproc elements, and this commit enables pg_get_object_address to do so as well. To make the overall coding simpler, this commit also slightly changes the get_object_address representation for opclasses and opfamilies: instead of having the AM name in the objargs array, I moved it as the first element of the objnames array. This enables the new code to use objargs for the type names used by pg_amop and pg_amproc. Reviewed by: Stephen Frost	2015-03-16 12:06:34 -03:00
Tom Lane	7b8b8a4331	Improve representation of PlanRowMark. This patch fixes two inadequacies of the PlanRowMark representation. First, that the original LockingClauseStrength isn't stored (and cannot be inferred for foreign tables, which always get ROW_MARK_COPY). Since some PlanRowMarks are created out of whole cloth and don't actually have an ancestral RowMarkClause, this requires adding a dummy LCS_NONE value to enum LockingClauseStrength, which is fairly annoying but the alternatives seem worse. This fix allows getting rid of the use of get_parse_rowmark() in FDWs (as per the discussion around commits `462bd95705` and `8ec8760fc8`), and it simplifies some things elsewhere. Second, that the representation assumed that all child tables in an inheritance hierarchy would use the same RowMarkType. That's true today but will soon not be true. We add an "allMarkTypes" field that identifies the union of mark types used in all a parent table's children, and use that where appropriate (currently, only in preprocess_targetlist()). In passing fix a couple of minor infelicities left over from the SKIP LOCKED patch, notably that _outPlanRowMark still thought waitPolicy is a bool. Catversion bump is required because the numeric values of enum LockingClauseStrength can appear in on-disk rules. Extracted from a much larger patch to support foreign table inheritance; it seemed worth breaking this out, since it's a separable concern. Shigeru Hanada and Etsuro Fujita, somewhat modified by me	2015-03-15 18:41:47 -04:00
Tom Lane	9fac5fd741	Move LockClauseStrength, LockWaitPolicy into new file nodes/lockoptions.h. Commit `df630b0dd5` moved enum LockWaitPolicy into its very own header file utils/lockwaitpolicy.h, which does not seem like a great idea from here. First, it's still a node-related declaration, and second, a file named like that can never sensibly be used for anything else. I do not think we want to encourage a one-typedef-per-header-file approach. The upcoming foreign table inheritance patch was doubling down on this bad idea by moving enum LockClauseStrength into its own can-never-be-used-for-anything-else file. Instead, let's put them both in a file named nodes/lockoptions.h. (They do seem to need a separate header file because we need them in both parsenodes.h and plannodes.h, and we don't want either of those including the other. Past practice might suggest adding them to nodes/nodes.h, but they don't seem sufficiently globally useful to justify that.) Committed separately since there's no functional change here, just some header-file refactoring.	2015-03-15 15:19:04 -04:00
Andres Freund	4f1b890b13	Merge the various forms of transaction commit & abort records. Since `465883b0a` two versions of commit records have existed. A compact version that was used when no cache invalidations, smgr unlinks and similar were needed, and a full version that could deal with all that. Additionally the full version was embedded into twophase commit records. That resulted in a measurable reduction in the size of the logged WAL in some workloads. But more recently additions like logical decoding, which e.g. needs information about the database something was executed on, made it applicable in fewer situations. The static split generally made it hard to expand the commit record, because concerns over the size made it hard to add anything to the compact version. Additionally it's not particularly pretty to have twophase.c insert RM_XACT records. Rejigger things so that the commit and abort records only have one form each, including the twophase equivalents. The presence of the various optional (in the sense of not being in every record) pieces is indicated by a bits in the 'xinfo' flag. That flag previously was not included in compact commit records. To prevent an increase in size due to its presence, it's only included if necessary; signalled by a bit in the xl_info bits available for xact.c, similar to heapam.c's XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE. Twophase commit/aborts are now the same as their normal counterparts. The original transaction's xid is included in an optional data field. This means that commit records generally are smaller, except in the case of a transaction with subtransactions, but no other special cases; the increase there is four bytes, which seems acceptable given that the more common case of not having subtransactions shrank. The savings are especially measurable for twophase commits, which previously always used the full version; but will in practice only infrequently have required that. The motivation for this work are not the space savings and and deduplication though; it's that it makes it easier to extend commit records with additional information. That's just a few lines of code now; without impacting the common case where that information is not needed. Discussion: 20150220152150.GD4149@awork2.anarazel.de, 235610.92468.qm%40web29004.mail.ird.yahoo.com Reviewed-By: Heikki Linnakangas, Simon Riggs	2015-03-15 17:37:07 +01:00
Tom Lane	91f4a5a976	Build src/port/dirmod.c only on Windows. Since commit `ba7c5975ad`, port/dirmod.c has contained only Windows-specific functions. Most platforms don't seem to mind uselessly building an empty file, but OS X for one issues warnings. Hence, treat dirmod.c as a Windows-specific file selected by configure rather than one that's always built. We can revert this change if dirmod.c ever gains any non-Windows functionality again. Back-patch to 9.4 where the mentioned commit appeared.	2015-03-14 14:08:45 -04:00
Tom Lane	f4abd0241d	Support flattening of empty-FROM subqueries and one-row VALUES tables. We can't handle this in the general case due to limitations of the planner's data representations; but we can allow it in many useful cases, by being careful to flatten only when we are pulling a single-row subquery up into a FROM (or, equivalently, inner JOIN) node that will still have at least one remaining relation child. Per discussion of an example from Kyotaro Horiguchi.	2015-03-11 23:18:03 -04:00
Tom Lane	b55722692b	Improve planner's cost estimation in the presence of semijoins. If we have a semijoin, say SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y) and we're estimating the cost of a parameterized indexscan on x, the number of repetitions of the indexscan should not be taken as the size of y; it'll really only be the number of distinct values of y1, because the only valid plan with y on the outside of a nestloop would require y to be unique-ified before joining it to x. Most of the time this doesn't make that much difference, but sometimes it can lead to drastically underestimating the cost of the indexscan and hence choosing a bad plan, as pointed out by David Kubečka. Fixing this is a bit difficult because parameterized indexscans are costed out quite early in the planning process, before we have the information that would be needed to call estimate_num_groups() and thereby estimate the number of distinct values of the join column(s). However we can move the code that extracts a semijoin RHS's unique-ification columns, so that it's done in initsplan.c rather than on-the-fly in create_unique_path(). That shouldn't make any difference speed-wise and it's really a bit cleaner too. The other bit of information we need is the size of the semijoin RHS, which is easy if it's a single relation (we make those estimates before considering indexscan costs) but problematic if it's a join relation. The solution adopted here is just to use the product of the sizes of the join component rels. That will generally be an overestimate, but since estimate_num_groups() only uses this input as a clamp, an overestimate shouldn't hurt us too badly. In any case we don't allow this new logic to produce a value larger than we would have chosen before, so that at worst an overestimate leaves us no wiser than we were before.	2015-03-11 21:21:00 -04:00
Alvaro Herrera	4464303405	Support default ACLs in get_object_address In the spirit of `890192e99a`, this time add support for the things living in the pg_default_acl catalog. These are not really "objects", but they show up as such in event triggers. There is no "DROP DEFAULT PRIVILEGES" or similar command, so it doesn't look like the new representation given would be useful anywhere else, so I didn't try to use it outside objectaddress.c. (That might be a bug in itself, but that would be material for another commit.) Reviewed by Stephen Frost.	2015-03-11 19:23:47 -03:00
Alvaro Herrera	890192e99a	Support user mappings in get_object_address Since commit `72dd233d3e` we were trying to obtain object addressing information in sql_drop event triggers, but that caused failures when the drops involved user mappings. This addition enables that to work again. Naturally, pg_get_object_address can work with these objects now, too. I toyed with the idea of removing DropUserMappingStmt as a node and using DropStmt instead in the DropUserMappingStmt grammar production, but that didn't go very well: for one thing the messages thrown by the specific code are specialized (you get "server not found" if you specify the wrong server, instead of a generic "user mapping for ... not found" which you'd get it we were to merge this with RemoveObjects --- unless we added even more special cases). For another thing, it would require to pass RoleSpec nodes through the objname/objargs representation used by RemoveObjects, which works in isolation, but gets messy when pg_get_object_address is involved. So I dropped this part for now. Reviewed by Stephen Frost.	2015-03-11 17:04:27 -03:00
Tom Lane	c6b3c939b7	Make operator precedence follow the SQL standard more closely. While the SQL standard is pretty vague on the overall topic of operator precedence (because it never presents a unified BNF for all expressions), it does seem reasonable to conclude from the spec for <boolean value expression> that OR has the lowest precedence, then AND, then NOT, then IS tests, then the six standard comparison operators, then everything else (since any non-boolean operator in a WHERE clause would need to be an argument of one of these). We were only sort of on board with that: most notably, while "<" ">" and "=" had properly low precedence, "<=" ">=" and "<>" were treated as generic operators and so had significantly higher precedence. And "IS" tests were even higher precedence than those, which is very clearly wrong per spec. Another problem was that "foo NOT SOMETHING bar" constructs, such as "x NOT LIKE y", were treated inconsistently because of a bison implementation artifact: they had the documented precedence with respect to operators to their right, but behaved like NOT (i.e., very low priority) with respect to operators to their left. Fixing the precedence issues is just a small matter of rearranging the precedence declarations in gram.y, except for the NOT problem, which requires adding an additional lookahead case in base_yylex() so that we can attach a different token precedence to NOT LIKE and allied two-word operators. The bulk of this patch is not the bug fix per se, but adding logic to parse_expr.c to allow giving warnings if an expression has changed meaning because of these precedence changes. These warnings are off by default and are enabled by the new GUC operator_precedence_warning. It's believed that very few applications will be affected by these changes, but it was agreed that a warning mechanism is essential to help debug any that are.	2015-03-11 13:22:52 -04:00
Robert Haas	e529cd4ffa	Suggest to the user the column they may have meant to reference. Error messages informing the user that no such column exists can sometimes provoke a perplexed response. This often happens due to a subtle typo in the column name or, perhaps less likely, in the alias name. To speed discovery of what the real issue is in such cases, we'll now search the range table for approximate matches. If there are one or two such matches that are good enough to think that they might be what the user intended to type, and better than all other approximate matches, we'll issue a hint suggesting that the user might have intended to reference those columns. Peter Geoghegan and Robert Haas	2015-03-11 10:44:04 -04:00
Andres Freund	bbfd7edae5	Add macros wrapping all usage of gcc's __attribute__. Until now __attribute__() was defined to be empty for all compilers but gcc. That's problematic because it prevents using it in other compilers; which is necessary e.g. for atomics portability. It's also just generally dubious to do so in a header as widely included as c.h. Instead add pg_attribute_format_arg, pg_attribute_printf, pg_attribute_noreturn macros which are implemented in the compilers that understand them. Also add pg_attribute_noreturn and pg_attribute_packed, but don't provide fallbacks, since they can affect functionality. This means that external code that, possibly unwittingly, relied on __attribute__ defined to be empty on !gcc compilers may now run into warnings or errors on those compilers. But there shouldn't be many occurances of that and it's hard to work around... Discussion: 54B58BA3.8040302@ohmu.fi Author: Oskari Saarenmaa, with some minor changes by me.	2015-03-11 14:30:01 +01:00
Fujii Masao	57aa5b2bb1	Add GUC to enable compression of full page images stored in WAL. When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup. A compressed page image will be decompressed during WAL replay. Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. This commit changes the WAL format (so bumping WAL version number) so that the one-byte flag indicating whether a full page image is compressed or not is included in its header information. This means that the commit increases the WAL volume one-byte per a full page image even if WAL compression is not used at all. We can save that one-byte by borrowing one-bit from the existing field like hole_offset in the header and using it as the flag, for example. But which would reduce the code readability and the extensibility of the feature. Per discussion, it's not worth paying those prices to save only one-byte, so we decided to add the one-byte flag to the header. This commit doesn't introduce any new compression algorithm like lz4. Currently a full page image is compressed using the existing PGLZ algorithm. Per discussion, we decided to use it at least in the first version of the feature because there were no performance reports showing that its compression ratio is unacceptably lower than that of other algorithm. Of course, in the future, it's worth considering the support of other compression algorithm for the better compression. Rahila Syed and Michael Paquier, reviewed in various versions by myself, Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.	2015-03-11 15:52:24 +09:00
Tom Lane	2fbb286647	Clean up the mess from => patch. Commit `865f14a2d3` was quite a few bricks shy of a load: psql, ecpg, and plpgsql were all left out-of-step with the core lexer. Of these only the last was likely to be a fatal problem; but still, a minimal amount of grepping, or even just reading the comments adjacent to the places that were changed, would have found the other places that needed to be changed.	2015-03-10 11:48:38 -04:00
Alvaro Herrera	e491bd2ee3	Move BRIN page type to page's last two bytes ... which is the usual convention among AMs, so that pg_filedump and similar utilities can tell apart pages of different AMs. It was also the intent of the original code, but I failed to realize that alignment considerations would move the whole thing to the previous-to-last word in the page. The new definition of the associated macro makes surrounding code a bit leaner, too. Per note from Heikki at http://www.postgresql.org/message-id/546A16EF.9070005@vmware.com	2015-03-10 12:27:15 -03:00
Alvaro Herrera	4f3924d9cd	Keep CommitTs module in sync in standby and master We allow this module to be turned off on restarts, so a restart time check is enough to activate or deactivate the module; however, if there is a standby replaying WAL emitted from a master which is restarted, but the standby isn't, the state in the standby becomes inconsistent and can easily be crashed. Fix by activating and deactivating the module during WAL replay on parameter change as well as on system start. Problem reported by Fujii Masao in http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com Author: Petr Jelínek	2015-03-09 17:44:00 -03:00
Alvaro Herrera	31eae6028e	Allow CURRENT/SESSION_USER to be used in certain commands Commands such as ALTER USER, ALTER GROUP, ALTER ROLE, GRANT, and the various ALTER OBJECT / OWNER TO, as well as ad-hoc clauses related to roles such as the AUTHORIZATION clause of CREATE SCHEMA, the FOR clause of CREATE USER MAPPING, and the FOR ROLE clause of ALTER DEFAULT PRIVILEGES can now take the keywords CURRENT_USER and SESSION_USER as user specifiers in place of an explicit user name. This commit also fixes some quite ugly handling of special standards- mandated syntax in CREATE USER MAPPING, which in particular would fail to work in presence of a role named "current_user". The special role specifiers PUBLIC and NONE also have more consistent handling now. Also take the opportunity to add location tracking to user specifiers. Authors: Kyotaro Horiguchi. Heavily reworked by Álvaro Herrera. Reviewed by: Rushabh Lathia, Adam Brightwell, Marti Raudsepp.	2015-03-09 15:41:54 -03:00
Heikki Linnakangas	f1fd515b39	Move WAL-related definitions from dbcommands.h to separate header file. This makes it easier to write frontend programs that needs to understand the WAL record format of CREATE/DROP DATABASE. dbcommands.h cannot easily be #included in a frontend program, because it pulls in other header files that need backend stuff, but the new dbcommands_xlog.h header file has fewer dependencies.	2015-03-09 15:50:49 +02:00
Fujii Masao	828599acec	Fix typo in comment.	2015-03-09 14:39:46 +09:00
Tom Lane	01cca2c1b1	Remove struct PQArgBlock from server-side header libpq/libpq.h. This struct is purely a client-side artifact. Perhaps there was once reason for the server to know it, but any such reason is lost in the mists of time. We certainly don't need two independent declarations of it.	2015-03-08 13:42:59 -04:00
Tom Lane	90c35a9ed0	Code cleanup for REINDEX DATABASE/SCHEMA/SYSTEM. Fix some minor infelicities. Some of these things were introduced in commit `fe263d115a`, and some are older.	2015-03-08 12:18:43 -04:00
Peter Eisentraut	bb8582abf3	Remove rolcatupdate This role attribute is an ancient PostgreSQL feature, but could only be set by directly updating the system catalogs, and it doesn't have any clearly defined use. Author: Adam Brightwell <adam.brightwell@crunchydatasolutions.com>	2015-03-06 23:42:38 -05:00
Tom Lane	3200b15b20	Remove comment claiming that PARAM_EXTERN Params always have typmod -1. This hasn't been true in quite some time, cf plpgsql's make_datum_param().	2015-03-05 13:16:27 -05:00
Alvaro Herrera	a2e35b53c3	Change many routines to return ObjectAddress rather than OID The changed routines are mostly those that can be directly called by ProcessUtilitySlow; the intention is to make the affected object information more precise, in support for future event trigger changes. Originally it was envisioned that the OID of the affected object would be enough, and in most cases that is correct, but upon actually implementing the event trigger changes it turned out that ObjectAddress is more widely useful. Additionally, some command execution routines grew an output argument that's an object address which provides further info about the executed command. To wit: * for ALTER DOMAIN / ADD CONSTRAINT, it corresponds to the address of the new constraint * for ALTER OBJECT / SET SCHEMA, it corresponds to the address of the schema that originally contained the object. * for ALTER EXTENSION {ADD, DROP} OBJECT, it corresponds to the address of the object added to or dropped from the extension. There's no user-visible change in this commit, and no functional change either. Discussion: 20150218213255.GC6717@tamriel.snowman.net Reviewed-By: Stephen Frost, Andres Freund	2015-03-03 14:10:50 -03:00
Tom Lane	b67f1ce181	Reduce json <=> jsonb casts from explicit-only to assignment level. There's no reason to make users write an explicit cast to store a json value in a jsonb column or vice versa. We could probably even make these implicit, but that might open us up to problems with ambiguous function calls, so for now just do this.	2015-03-03 11:26:04 -05:00
Tom Lane	8abb3cda0d	Use the typcache to cache constraints for domain types. Previously, we cached domain constraints for the life of a query, or really for the life of the FmgrInfo struct that was used to invoke domain_in() or domain_check(). But plpgsql (and probably other places) are set up to cache such FmgrInfos for the whole lifespan of a session, which meant they could be enforcing really stale sets of constraints. On the other hand, searching pg_constraint once per query gets kind of expensive too: testing says that as much as half the runtime of a trivial query such as "SELECT 0::domaintype" went into that. To fix this, delegate the responsibility for tracking a domain's constraints to the typcache, which has the infrastructure needed to detect syscache invalidation events that signal possible changes. This not only removes unnecessary repeat reads of pg_constraint, but ensures that we never apply stale constraint data: whatever we use is the current data according to syscache rules. Unfortunately, the current configuration of the system catalogs means we have to flush cached domain-constraint data whenever either pg_type or pg_constraint changes, which happens rather a lot (eg, creation or deletion of a temp table will do it). It might be worth rearranging things to split pg_constraint into two catalogs, of which the domain constraint one would probably be very low-traffic. That's a job for another patch though, and in any case this patch should improve matters materially even with that handicap. This patch makes use of the recently-added memory context reset callback feature to manage the lifespan of domain constraint caches, so that we don't risk deleting a cache that might be in the midst of evaluation. Although this is a bug fix as well as a performance improvement, no back-patch. There haven't been many if any field complaints about stale domain constraint checks, so it doesn't seem worth taking the risk of modifying data structures as basic as MemoryContexts in back branches.	2015-03-01 14:06:55 -05:00
Noah Misch	b8a18ad485	Add transform functions for AT TIME ZONE. This makes "ALTER TABLE tabname ALTER tscol TYPE ... USING tscol AT TIME ZONE 'UTC'" skip rewriting the table when altering from "timestamp" to "timestamptz" or vice versa. While it would be nicer still to optimize this in the absence of the USING clause given timezone==UTC, transform functions must consult IMMUTABLE facts only.	2015-03-01 13:22:34 -05:00
Tom Lane	097fe194aa	Move memory context callback declarations into palloc.h. Initial experience with this feature suggests that instances of MemoryContextCallback are likely to propagate into some widely-used headers over time. As things stood, that would result in pulling memutils.h or at least memnodes.h into common headers, which does not seem desirable. Instead, let's decide that this feature is part of the "ordinary palloc user" API rather than the "specialized context management" API, and as such should be declared in palloc.h not memutils.h.	2015-03-01 12:31:32 -05:00
Tom Lane	eaa5808e8e	Redefine MemoryContextReset() as deleting, not resetting, child contexts. That is, MemoryContextReset() now means what was formerly meant by MemoryContextResetAndDeleteChildren(), and the latter is now just a macro alias for the former. If you really want the functionality that was formerly provided by MemoryContextReset(), what you have to do is MemoryContextResetChildren() plus MemoryContextResetOnly() (which is a new API to reset only the named context and not touch its children). The reason for this change is that near fifteen years of experience has proven that there is noplace where old-style MemoryContextReset() is actually what you want. Making that the default behavior has led to lots of context-leakage bugs, while we've not found anyplace where it's actually necessary to keep the child contexts; at least the standard regression tests do not reveal anyplace where this change breaks anything. And there are upcoming patches that will introduce additional reasons why child contexts need to be removed. We could change existing calls of MemoryContextResetAndDeleteChildren to be just MemoryContextReset, but for the moment I'll leave them alone; they're not costing anything.	2015-02-27 18:10:04 -05:00
Tom Lane	f65e827058	Invent a memory context reset/delete callback mechanism. This allows cleanup actions to be registered to be called just before a particular memory context's contents are flushed (either by deletion or MemoryContextReset). The patch in itself has no use-cases for this, but several likely reasons for wanting this exist. In passing, per discussion, rearrange some boolean fields in struct MemoryContextData so as to avoid wasted padding space. For safety, this requires making allowInCritSection's existence unconditional; but I think that's a better approach than what was there anyway.	2015-02-27 17:16:43 -05:00
Tom Lane	d809fd0008	Improve parser's one-extra-token lookahead mechanism. There are a couple of places in our grammar that fail to be strict LALR(1), by requiring more than a single token of lookahead to decide what to do. Up to now we've dealt with that by using a filter between the lexer and parser that merges adjacent tokens into one in the places where two tokens of lookahead are necessary. But that creates a number of user-visible anomalies, for instance that you can't name a CTE "ordinality" because "WITH ordinality AS ..." triggers folding of WITH and ORDINALITY into one token. I realized that there's a better way. In this patch, we still do the lookahead basically as before, but we never merge the second token into the first; we replace just the first token by a special lookahead symbol when one of the lookahead pairs is seen. This requires a couple extra productions in the grammar, but it involves fewer special tokens, so that the grammar tables come out a bit smaller than before. The filter logic is no slower than before, perhaps a bit faster. I also fixed the filter logic so that when backing up after a lookahead, the current token's terminator is correctly restored; this eliminates some weird behavior in error message issuance, as is shown by the one change in existing regression test outputs. I believe that this patch entirely eliminates odd behaviors caused by lookahead for WITH. It doesn't really improve the situation for NULLS followed by FIRST/LAST unfortunately: those sequences still act like a reserved word, even though there are cases where they should be seen as two ordinary identifiers, eg "SELECT nulls first FROM ...". I experimented with additional grammar hacks but couldn't find any simple solution for that. Still, this is better than before, and it seems much more likely that we could somehow solve the NULLS case on the basis of this filter behavior than the previous one.	2015-02-24 17:53:45 -05:00
Peter Eisentraut	23a78352c0	Error when creating names too long for tar format The tar format (at least the version we are using), does not support file names or symlink targets longer than 99 bytes. Until now, the tar creation code would silently truncate any names that are too long. (Its original application was pg_dump, where this never happens.) This creates problems when running base backups over the replication protocol. The most important problem is when a tablespace path is longer than 99 bytes, which will result in a truncated tablespace path being backed up. Less importantly, the basebackup protocol also promises to back up any other files it happens to find in the data directory, which would also lead to file name truncation if someone put a file with a long name in there. Now both of these cases result in an error during the backup. Add tests that fail when a too-long file name or symlink is attempted to be backed up. Reviewed-by: Robert Hass <robertmhaas@gmail.com>	2015-02-24 13:41:07 -05:00
Tom Lane	56be925e4b	Further tweaking of raw grammar output to distinguish different inputs. Use a different A_Expr_Kind for LIKE/ILIKE/SIMILAR TO constructs, so that they can be distinguished from direct invocation of the underlying operators. Also, postpone selection of the operator name when transforming "x IN (select)" to "x = ANY (select)", so that those syntaxes can be told apart at parse analysis time. I had originally thought I'd also have to do something special for the syntaxes IS NOT DISTINCT FROM, IS NOT DOCUMENT, and x NOT IN (SELECT...), which the grammar translates as though they were NOT (construct). On reflection though, we can distinguish those cases reliably by noting whether the parse location shown for the NOT is the same as for its child node. This only requires tweaking the parse locations for NOT IN, which I've done here. These changes should have no effect outside the parser; they're just in support of being able to give accurate warnings for planned operator precedence changes.	2015-02-23 12:46:50 -05:00
Alvaro Herrera	296f3a6053	Support more commands in event triggers COMMENT, SECURITY LABEL, and GRANT/REVOKE now also fire ddl_command_start and ddl_command_end event triggers, when they operate on database-local objects. Reviewed-By: Michael Paquier, Andres Freund, Stephen Frost	2015-02-23 14:22:42 -03:00
Heikki Linnakangas	88e9823026	Replace checkpoint_segments with min_wal_size and max_wal_size. Instead of having a single knob (checkpoint_segments) that both triggers checkpoints, and determines how many checkpoints to recycle, they are now separate concerns. There is still an internal variable called CheckpointSegments, which triggers checkpoints. But it no longer determines how many segments to recycle at a checkpoint. That is now auto-tuned by keeping a moving average of the distance between checkpoints (in bytes), and trying to keep that many segments in reserve. The advantage of this is that you can set max_wal_size very high, but the system won't actually consume that much space if there isn't any need for it. The min_wal_size sets a floor for that; you can effectively disable the auto-tuning behavior by setting min_wal_size equal to max_wal_size. The max_wal_size setting is now the actual target size of WAL at which a new checkpoint is triggered, instead of the distance between checkpoints. Previously, you could calculate the actual WAL usage with the formula "(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this patch, you set the desired WAL usage with max_wal_size, and the system calculates the appropriate CheckpointSegments with the reverse of that formula. That's a lot more intuitive for administrators to set. Reviewed by Amit Kapila and Venkata Balaji N.	2015-02-23 18:53:02 +02:00
Heikki Linnakangas	0fec000365	Renumber GUC_* constants. This moves all the regular flags back together (for aesthetic reasons), and makes room for more GUC_UNIT_* types.	2015-02-23 18:33:16 +02:00
Heikki Linnakangas	1b63026473	Refactor unit conversions code in guc.c. Replace the if-switch-case constructs with two conversion tables, containing all the supported conversions between human-readable unit strings and the base units used in GUC variables. This makes the code easier to read, and makes adding new units simpler.	2015-02-23 18:06:16 +02:00
Fujii Masao	5d2b45e3f7	Add GUC to control the time to wait before retrieving WAL after failed attempt. Previously when the standby server failed to retrieve WAL files from any sources (i.e., streaming replication, local pg_xlog directory or WAL archive), it always waited for five seconds (hard-coded) before the next attempt. For example, this is problematic in warm-standby because restore_command can fail every five seconds even while new WAL file is expected to be unavailable for a long time and flood the log files with its error messages. This commit adds new parameter, wal_retrieve_retry_interval, to control that wait time. Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.	2015-02-23 20:55:17 +09:00

... 5 6 7 8 9 ...

7322 Commits