postgresql

Commit Graph

Author	SHA1	Message	Date
Robert Haas	9c75e1a36b	Forbid parallel Hash Right Join or Hash Full Join. That won't work. You'll get bogus null-extended rows. Mithun Cy	2016-04-20 17:48:55 -04:00
Tom Lane	bde361fef5	Fix memory leak and other bugs in ginPlaceToPage() & subroutines. Commit `36a35c550a` turned the interface between ginPlaceToPage and its subroutines in gindatapage.c and ginentrypage.c into a royal mess: page-update critical sections were started in one place and finished in another place not even in the same file, and the very same subroutine might return having started a critical section or not. Subsequent patches band-aided over some of the problems with this design by making things even messier. One user-visible resulting problem is memory leaks caused by the need for the subroutines to allocate storage that would survive until ginPlaceToPage calls XLogInsert (as reported by Julien Rouhaud). This would not typically be noticeable during retail index updates. It could be visible in a GIN index build, in the form of memory consumption swelling to several times the commanded maintenance_work_mem. Another rather nasty problem is that in the internal-page-splitting code path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before entering the critical section that it's supposed to be cleared in; a failure in between would leave the index in a corrupt state. There were also assorted coding-rule violations with little immediate consequence but possible long-term hazards, such as beginning an XLogInsert sequence before entering a critical section, or calling elog(DEBUG) inside a critical section. To fix, redefine the API between ginPlaceToPage() and its subroutines by splitting the subroutines into two parts. The "beginPlaceToPage" subroutine does what can be done outside a critical section, including full computation of the result pages into temporary storage when we're going to split the target page. The "execPlaceToPage" subroutine is called within a critical section established by ginPlaceToPage(), and it handles the actual page update in the non-split code path. The critical section, as well as the XLOG insertion call sequence, are both now always started and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and work in a short-lived memory context to eliminate the leakage problem. (Since a short-lived memory context had been getting created in the most common code path in the subroutines, this shouldn't cause any noticeable performance penalty; we're just moving the overhead up one call level.) In passing, fix a bunch of comments that had gone unmaintained throughout all this klugery. Report: <571276DD.5050303@dalibo.com>	2016-04-20 14:25:15 -04:00
Kevin Grittner	a343e223a5	Revert no-op changes to BufferGetPage() The reverted changes were intended to force a choice of whether any newly-added BufferGetPage() calls needed to be accompanied by a test of the snapshot age, to support the "snapshot too old" feature. Such an accompanying test is needed in about 7% of the cases, where the page is being used as part of a scan rather than positioning for other purposes (such as DML or vacuuming). The additional effort required for back-patching, and the doubt whether the intended benefit would really be there, have indicated it is best just to rely on developers to do the right thing based on comments and existing usage, as we do with many other conventions. This change should have little or no effect on generated executable code. Motivated by the back-patching pain of Tom Lane and Robert Haas	2016-04-20 08:31:19 -05:00
Tom Lane	a0382e2d7e	Make partition-lock-release coding more transparent in BufferAlloc(). Coverity complained that oldPartitionLock was possibly dereferenced after having been set to NULL. That actually can't happen, because we'd only use it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is justified in complaining, because at line 1275 we actually overwrite oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to whether to release the oldPartitionLock. Thus, the code would be incorrect if someone else had changed the buffer's BM_TAG_VALID flag meanwhile. That should not happen, since we hold pin on the buffer throughout this sequence, but it's starting to look like a rather shaky chain of logic. And there's no need for such assumptions, because we can simply replace the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL), which has identical results and makes it plain to all comers that we don't dereference a null pointer. A small side benefit is that the range of liveness of oldFlags is greatly reduced, possibly allowing the compiler to save a register. This is just cleanup, not an actual bug fix, so there seems no need for a back-patch.	2016-04-18 18:05:56 -04:00
Tom Lane	4039c736eb	Adjust spin.c's spinlock emulation so that 0 is not a valid spinlock value. We've had repeated troubles over the years with failures to initialize spinlocks correctly; see `6b93fcd14` for a recent example. Most of the time, on most platforms, such oversights can escape notice because all-zeroes is the expected initial content of an slock_t variable. The only platform we have where the initialized state of an slock_t isn't zeroes is HPPA, and that's practically gone in the wild. To make it easier to catch such errors without needing one of those, adjust the --disable-spinlocks code so that zero is not a valid value for an slock_t for it. In passing, remove a bunch of unnecessary #include's from spin.c; commit `daa7527afc` removed all the intermodule coupling that made them necessary.	2016-04-16 19:53:38 -04:00
Tom Lane	c34df8a003	Disallow creation of indexes on system columns (except for OID). Although OID acts pretty much like user data, the other system columns do not, so an index on one would likely misbehave. And it's pretty hard to see a use-case for one, anyway. Let's just forbid the case rather than worry about whether it should be supported. David Rowley	2016-04-16 12:11:41 -04:00
Stephen Frost	99f2f3c19a	In recordExtensionInitPriv(), keep the scan til we're done with it For reasons of sheer brain fade, we (I) was calling systable_endscan() immediately after systable_getnext() and expecting the tuple returned by systable_getnext() to still be valid. That's clearly wrong. Move the systable_endscan() down below the tuple usage. Discovered initially by Pavel Stehule and then also by Alvaro. Add a regression test based on Alvaro's testing.	2016-04-15 21:57:15 -04:00
Tom Lane	8f1911d5e6	Fix possible crash in ALTER TABLE ... REPLICA IDENTITY USING INDEX. Careless coding added by commit `07cacba983` could result in a crash or a bizarre error message if someone tried to select an index on the OID column as the replica identity index for a table. Back-patch to 9.4 where the feature was introduced. Discussion: CAKJS1f8TQYgTRDyF1_u9PVCKWRWz+DkieH=U7954HeHVPJKaKg@mail.gmail.com David Rowley	2016-04-15 12:11:40 -04:00
Robert Haas	5702277ca9	Tweak EXPLAIN for parallel query to show workers launched. The previous display was sort of confusing, because it didn't distinguish between the number of workers that we planned to launch and the number that actually got launched. This has already confused several people, so display both numbers and label them clearly. Julien Rouhaud, reviewed by me.	2016-04-15 11:52:18 -04:00
Tom Lane	6b85d4ba9b	Fix portability problem induced by commit `a6f6b7819`. pg_xlogdump includes bufmgr.h. With a compiler that emits code for static inline functions even when they're unreferenced, that leads to unresolved external references in the new static-inline version of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done for similar issues elsewhere. Per buildfarm member pademelon.	2016-04-15 10:44:28 -04:00
Magnus Hagander	ba8fe38f58	Fix typo in comment	2016-04-15 13:32:54 +02:00
Tom Lane	f0e766bd7f	Fix memory leak in GIN index scans. The code had a query-lifespan memory leak when encountering GIN entries that have posting lists (rather than posting trees, ie, there are a relatively small number of heap tuples containing this index key value). With a suitable data distribution this could add up to a lot of leakage. Problem seems to have been introduced by commit `36a35c550`, so back-patch to 9.4. Julien Rouhaud	2016-04-15 00:02:26 -04:00
Andres Freund	4b74c6a40e	Make init_spin_delay() C89 compliant #2 . My previous attempt at doing so, in `80abbeba23`, was not sufficient. While that fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant expressions in the struct initializer, because the file/line/function information comes from the caller of s_lock(). Give up on using a macro, and use a static inline instead. Discussion: 4369.1460435533@sss.pgh.pa.us	2016-04-14 19:26:13 -07:00
Andres Freund	533cd2303a	Remove trailing commas in enums. These aren't valid C89. Found thanks to gcc's -Wc90-c99-compat. These exist in differing places in most supported branches.	2016-04-14 19:25:16 -07:00
Andres Freund	7b16781228	Fix trivial typo.	2016-04-14 19:25:16 -07:00
Tom Lane	6a3d3965d6	Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms. When re-reading an update involving both an old tuple and a new tuple from disk, reorderbuffer.c was careless about whether the new tuple is suitably aligned for direct access --- in general, it isn't. We'd missed seeing this in the buildfarm because the contrib/test_decoding tests exercise this code path only a few times, and by chance all of those cases have old tuples with length a multiple of 4, which is usually enough to make the access to the new tuple's t_len safe. For some still-not-entirely-clear reason, however, Debian's sparc build gets a bus error, as reported by Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer? The lack of previous field reports is probably because you need all of these conditions to trigger a crash: an alignment-picky platform (not Intel), a transaction large enough to spill to disk, an update within that xact that changes a primary-key field and has an odd-length old tuple, and of course logical decoding tracing the transaction. Avoid the alignment assumption by using memcpy instead of fetching t_len directly, and add a test case that exposes the crash on picky platforms. Back-patch to 9.4 where the bug was introduced. Discussion: <20160413094117.GC21485@msg.credativ.de>	2016-04-14 19:42:21 -04:00
Tom Lane	c2dc194bdb	Adjust signature of walrcv_receive hook. Commit `314cbfc5da` redefined the signature of this hook as typedef int (walrcv_receive_type) (char buffer, int wait_fd); But in fact the type of the "wait_fd" variable ought to be pgsocket, which is what WaitLatchOrSocket expects, and which is necessary if we want to be able to assign PGINVALID_SOCKET to it on Windows. So fix that.	2016-04-14 13:49:37 -04:00
Tom Lane	994f112573	Adjust datatype of ReplicationState.acquired_by. It was declared as "pid_t", which would be fine except that none of the places that printed it in error messages took any thought for the possibility that it's not equivalent to "int". This leads to warnings on some buildfarm members, and could possibly lead to actually wrong error messages on those platforms. There doesn't seem to be any very good reason not to just make it "int"; it's only ever assigned from MyProcPid, which is int. If we want to cope with PIDs that are wider than int, this is not the place to start. Also, fix the comment, which seems to perhaps be a leftover from a time when the field was only a bool? Per buildfarm. Back-patch to 9.5 which has same issue.	2016-04-14 12:18:09 -04:00
Tom Lane	22989a8e34	Fix prototype of pgwin32_bind(). I (tgl) had copied-and-pasted this from pgwin32_accept(), failing to notice that the third parameter should be "int" not "int *". David Rowley	2016-04-14 09:44:21 -04:00
Tom Lane	92a30a7eb0	Fix broken dependency-mongering for index operator classes/families. For a long time, opclasscmds.c explained that "we do not create a dependency link to the AM [for an opclass or opfamily], because we don't currently support DROP ACCESS METHOD". Commit `473b932870` invented DROP ACCESS METHOD, but it batted only 1 for 2 on adding the dependency links, and 0 for 2 on updating the comments about the topic. In passing, undo the same commit's entirely inappropriate decision to blow away an existing index as a side-effect of create_am.sql.	2016-04-13 23:33:31 -04:00
Stephen Frost	bfed4ab824	Disallow SET SESSION AUTHORIZATION pg_* As part of reserving the pg_* namespace for default roles and in line with SET ROLE and other previous efforts, disallow settings the role to a default/reserved role using SET SESSION AUTHORIZATION. These checks and restrictions on what is allowed regarding default / reserved roles are under debate, but it seems prudent to ensure that the existing checks at least cover the intended cases while the debate rages on. On me to clean it up if the consensus decision is to remove these checks.	2016-04-13 21:31:24 -04:00
Andres Freund	be65eddd80	Add required database and origin filtering for logical messages. Logical messages, added in `3fe3511d05`, during decoding failed to filter messages emitted in other databases and messages emitted "under" a replication origin the output plugin isn't interested in. Add tests to verify that both types of filtering actually work. While touching message.sql remove hunk obsoleted by `d25379e`. Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because `3fe3511d05` had omitted doing so. `3fe3511d05` additionally didn't bump catversion, but `7a542700d` has done so since. Author: Petr Jelinek Reported-By: Andres Freund Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de	2016-04-13 17:38:54 -07:00
Andres Freund	80abbeba23	Make init_spin_delay() C89 compliant and change stuck spinlock reporting. The current definition of init_spin_delay (introduced recently in `48354581a`) wasn't C89 compliant. It's not legal to refer to refer to non-constant expressions, and the ptr argument was one. This, as reported by Tom, lead to a failure on buildfarm animal pademelon. The pointer, especially on system systems with ASLR, isn't super helpful anyway, though. So instead of making init_spin_delay into an inline function, make s_lock_stuck() report the function name in addition to file:line and change init_spin_delay() accordingly. While not a direct replacement, the function name is likely more useful anyway (line numbers are often hard to interpret in third party reports). This also fixes what file/line number is reported for waits via s_lock(). As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h. Reported-By: Tom Lane Discussion: 4369.1460435533@sss.pgh.pa.us	2016-04-13 17:00:53 -07:00
Andres Freund	6b93fcd149	Avoid atomic operation in MarkLocalBufferDirty(). The recent patch to make Pin/UnpinBuffer lockfree in the hot path (`48354581a`), accidentally used pg_atomic_fetch_or_u32() in MarkLocalBufferDirty(). Other code operating on local buffers was careful to only use pg_atomic_read/write_u32 which just read/write from memory; to avoid unnecessary overhead. On its own that'd just make MarkLocalBufferDirty() slightly less efficient, but in addition InitLocalBuffers() doesn't call pg_atomic_init_u32() - thus the spinlock fallback for the atomic operations isn't initialized. That in turn caused, as reported by Tom, buildfarm animal gaur to fail. As those errors are actually useful against this type of error, continue to omit - intentionally this time - initialization of the atomic variable. In addition, add an explicit note about only using pg_atomic_read/write on local buffers's state to BufferDesc's description. Reported-By: Tom Lane Discussion: 1881.1460431476@sss.pgh.pa.us	2016-04-13 15:28:29 -07:00
Tom Lane	95ef43c430	Widen amount-to-flush arguments of FileWriteback and callers. It's silly to define these counts as narrower than they might someday need to be. Also, I believe that the BLCKSZ * nflush calculation in mdwriteback was capable of overflowing an int.	2016-04-13 18:12:06 -04:00
Tom Lane	fa11a09fed	Fix assorted portability issues with using msync() for data flushing. Commit `428b1d6b29` introduced the use of msync() for flushing dirty data from the kernel's file buffers. Several portability issues were overlooked, though: * Not all implementations of mmap() think that nbytes == 0 means "map the whole file". To fix, use lseek() to find out the true length. Fix callers of pg_flush_data to be aware that nbytes == 0 may result in trashing the file's seek position. * Not all implementations of mmap() will accept partial-page mmap requests. To fix, round down the length request to whatever sysconf() says the page size is. (I think this is OK from a portability standpoint, because sysconf() is required by SUS v2, and we aren't trying to compile this part on Windows anyway. Buildfarm should let us know if not.) * On 32-bit machines, the file size might exceed the available free address space, or even exceed what will fit in size_t. Check for the latter explicitly to avoid passing a false request size to mmap(). If mmap fails, silently fall through to the next implementation method, rather than bleating to the postmaster log and giving up. * mmap'ing directories fails on some platforms, and even if it works, msync'ing the directory is quite unlikely to help, as for that matter are the other flush implementations. In pre_sync_fname(), just skip flush attempts on directories. In passing, copy-edit the comments a bit. Stas Kelvich and myself	2016-04-13 17:17:51 -04:00
Robert Haas	cbb2a812d7	Use PG_INT32_MIN instead of reiterating the constant. Makes no difference, but it's cleaner this way. Michael Paquier	2016-04-13 07:54:45 -04:00
Tom Lane	d1b7d4877b	Provide errno-translation wrappers around bind() and listen() on Windows. I've seen one too many "could not bind IPv4 socket: No error" log entries from the Windows buildfarm members. Per previous discussion, this is likely caused by the fact that we're doing nothing to translate WSAGetLastError() to errno. Put in a wrapper layer to do that. If this works as expected, it should get back-patched, but let's see what happens in the buildfarm first. Discussion: <4065.1452450340@sss.pgh.pa.us>	2016-04-12 19:52:21 -04:00
Robert Haas	deb71fa971	Fix costing for parallel aggregation. The original patch kind of ignored the fact that we were doing something different from a costing point of view, but nobody noticed. This patch fixes that oversight. David Rowley	2016-04-12 16:25:55 -04:00
Fujii Masao	46d73e0d65	Remove unused function GetOldestWALSendPointer from walsender code. That unused function was introduced as a sample because synchronous replication or replication monitoring tools might need it in the future. Recently commit `989be08` added the function SyncRepGetOldestSyncRecPtr which provides almost the same functionality for multiple synchronous standbys feature. So it's time to remove that unused sample function. This commit does that.	2016-04-13 04:36:29 +09:00
Tom Lane	f1f01de145	Redefine create_upper_paths_hook as being invoked once per upper relation. Per discussion, this gives potential users of the hook more flexibility, because they can build custom Paths that implement only one stage of upper processing atop core-provided Paths for earlier stages.	2016-04-12 15:23:14 -04:00
Kevin Grittner	2201d801b0	Avoid extra locks in GetSnapshotData if old_snapshot_threshold < 0 On a big NUMA machine with 1000 connections in saturation load there was a performance regression due to spinlock contention, for acquiring values which were never used. Just fill with dummy values if we're not going to use them. This patch has not been benchmarked yet on a big NUMA machine, but it seems like a good idea on general principle, and it seemed to prevent an apparent 2.2% regression on a single-socket i7 box running 200 connections at saturation load.	2016-04-12 11:48:02 -05:00
Tom Lane	5713f03973	Improve API of GenericXLogRegister(). Rename this function to GenericXLogRegisterBuffer() to make it clearer what it does, and leave room for other sorts of "register" actions in future. Also, replace its "bool isNew" argument with an integer flags argument, so as to allow adding more flags in future without an API break. Alexander Korotkov, adjusted slightly by me	2016-04-12 11:42:06 -04:00
Tom Lane	bdf7db8192	In generic WAL application and replay, ensure page "hole" is always zero. The previous coding could allow the contents of the "hole" between pd_lower and pd_upper to diverge during replay from what it had been when the update was originally applied. This would pose a problem if checksums were in use, and in any case would complicate forensic comparisons between master and slave servers. So force the "hole" to contain zeroes, both at initial application of a generically-logged action, and at replay. Alexander Korotkov, adjusted slightly by me	2016-04-12 11:14:00 -04:00
Stephen Frost	cd13471f2e	Correct copyright for newly added genericdesc.c It's 2016 these days (no, not entirely sure how we got here either). Pointed out by Amit Langote	2016-04-12 08:45:09 -04:00
Peter Eisentraut	70715e6a60	Fix whitespace	2016-04-11 20:59:04 -04:00
Tom Lane	39c283e498	Fix _SPI_execute_plan() for CREATE TABLE IF NOT EXISTS foo AS ... When IF NOT EXISTS was added to CREATE TABLE AS, this logic didn't get the memo, possibly resulting in an Assert failure. It looks like there would have been no ill effects in a non-Assert build, though. Back-patch to 9.5 where the IF NOT EXISTS option was added. Stas Kelvich	2016-04-11 20:07:17 -04:00
Kevin Grittner	a6f6b78196	Use static inline function for BufferGetPage() I was initially concerned that the some of the hundreds of references to BufferGetPage() where the literal BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a macro, leading to some hard-to-find performance regressions in corner cases. Inspection of disassembled code has shown identical code at all inspected locations, and the size difference doesn't amount to even one byte per such call. So make it readable. Per gripes from Álvaro Herrera and Tom Lane	2016-04-11 16:47:50 -05:00
Kevin Grittner	80647bf65a	Make oldSnapshotControl a pointer to a volatile structure It was incorrectly declared as a volatile pointer to a non-volatile structure. Eliminate the OldSnapshotControl struct definition; it is really not needed. Pointed out by Tom Lane. While at it, add OldSnapshotControlData to pgindent's list of structures.	2016-04-11 15:43:52 -05:00
Peter Eisentraut	d8ed83cd7f	Fix whitespace	2016-04-11 14:44:51 -04:00
Fujii Masao	0038c1e218	Use ereport(ERROR) instead of Assert() to emit syncrep_parser error. The existing code would either Assert or generate an invalid SyncRepConfig variable, neither of which is desirable. A regular error should be thrown instead. This commit silences compiler warning in non assertion-enabled builds. Per report from Jeff Janes. Suggested fix by Tom Lane.	2016-04-11 15:52:27 +09:00
Tom Lane	1630f5b92a	Add comment about intentional fallthrough in switch. Coverity complained about an apparent missing "break" in a switch added by `bb140506df`. The human-readable comments are pretty clear that this is intentional, but add a standard /* FALL THRU */ comment to make it clear to tools too.	2016-04-10 23:52:34 -04:00
Tom Lane	5306df2831	Clean up foreign-key caching code in planner. Coverity complained that the code added by `015e88942a` lacked an error check for SearchSysCache1 failures, which it should have. But the code was pretty duff in other ways too, including failure to think about whether it could really cope with arrays of different lengths.	2016-04-10 23:47:30 -04:00
Andres Freund	008608b9d5	Avoid the use of a separate spinlock to protect a LWLock's wait queue. Previously we used a spinlock, in adition to the atomically manipulated ->state field, to protect the wait queue. But it's pretty simple to instead perform the locking using a flag in state. Due to `6150a1b0` BufferDescs, on platforms (like PPC) with > 1 byte spinlocks, increased their size above 64byte. As 64 bytes are the size we pad allocated BufferDescs to, this can increase false sharing; causing performance problems in turn. Together with the previous commit this reduces the size to <= 64 bytes on all common platforms. Author: Andres Freund Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com 20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de	2016-04-10 20:12:32 -07:00
Andres Freund	48354581a4	Allow Pin/UnpinBuffer to operate in a lockfree manner. Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell	2016-04-10 20:12:32 -07:00
Alvaro Herrera	bd905a0d04	Fix possible NULL dereference in ExecAlterObjectDependsStmt I used the wrong variable here. Doesn't make a difference today because the only plausible caller passes a non-NULL variable, but someday it will be wrong, and even today's correctness is subtle: the caller that does pass a NULL is never invoked because of object type constraints. Surely not a condition to rely on. Noted by Coverity	2016-04-10 11:03:35 -03:00
Tom Lane	660d5fb856	Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD. Since we're requiring pages handled by generic_xlog.c to be standard format, specify REGBUF_STANDARD when doing a full-page image, so that xloginsert.c can compress out the "hole" between pd_lower and pd_upper. Given the current API in which this path will be taken only for a newly initialized page, the hole is likely to be particularly large in such cases, so that this oversight could easily be performance-significant. I don't notice any particular change in the runtime of contrib/bloom's regression test, though.	2016-04-10 00:24:28 -04:00
Tom Lane	68689c66ef	Micro-optimize GenericXLogFinish(). Make the inner comparison loops of computeDelta() as tight as possible by pulling considerations of valid and invalid ranges out of the inner loops, and extending a match or non-match detection as far as possible before deciding what to do next. To keep this tractable, give up the possibility of merging fragments across the pd_lower to pd_upper gap. The fraction of pages where that could happen (ie, there are 4 or fewer bytes in the gap, and data changes immediately adjacent to it on both sides) is too small to be worth spending cycles on. Also, avoid two BLCKSZ-length memcpy()s by computing the delta before moving data into the target buffer, instead of after. This doesn't save nearly as many cycles as being tenser about computeDelta(), but it still seems worth doing. On my machine, this patch cuts a full 40% off the runtime of contrib/bloom's regression test.	2016-04-09 19:30:56 -04:00
Tom Lane	08e785436f	Get rid of GenericXLogUnregister(). This routine is unsafe as implemented, because it invalidates the page image pointers returned by previous GenericXLogRegister() calls. Rather than complicate the API or the implementation to avoid that, let's just get rid of it; the use-case for having it seems much too thin to justify a lot of work here. While at it, do some wordsmithing on the SGML docs for generic WAL.	2016-04-09 16:39:30 -04:00
Tom Lane	db03cf375d	Code review/prettification for generic_xlog.c. Improve commentary, use more specific names for the delta fields, const-ify pointer arguments where possible, avoid assuming that initializing only the first element of a local array will guarantee that the remaining elements end up as we need them. (I think that code in generic_redo actually worked, but only because InvalidBuffer is zero; this is a particularly ugly way of depending on that ...)	2016-04-09 15:02:19 -04:00
Tom Lane	2dd318d277	Run pgindent on generic_xlog.c. This code desperately needs some micro-optimization, and I'd like it to be formatted a bit more nicely while I work on it.	2016-04-09 13:33:33 -04:00
Andres Freund	c1ddd2361f	Expose more out/readfuncs support functions. Previously `bcac23d` exposed a subset of support functions, namely the ones Kaigai found useful. In 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that there's some functions missing to use the facility in an external project. To avoid having to add functions piecemeal, add all the functions which are used to define READ_* and WRITE_* macros; users of the extensible node functionality are likely to need these. Additionally expose outDatum(), which doesn't have it's own WRITE_ macro, as it needs information from the embedding struct. Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de	2016-04-08 14:26:36 -07:00
Stephen Frost	7a542700df	Create default roles This creates an initial set of default roles which administrators may use to grant access to, historically, superuser-only functions. Using these roles instead of granting superuser access reduces the number of superuser roles required for a system. Documention for each of the default roles has been added to user-manag.sgml. Bump catversion to 201604082, as we had a commit that bumped it to 201604081 and another that set it back to 201604071... Reviews by José Luis Tallón and Robert Haas	2016-04-08 16:56:27 -04:00
Stephen Frost	293007898d	Reserve the "pg_" namespace for roles This will prevent users from creating roles which begin with "pg_" and will check for those roles before allowing an upgrade using pg_upgrade. This will allow for default roles to be provided at initdb time. Reviews by José Luis Tallón and Robert Haas	2016-04-08 16:56:27 -04:00
Kevin Grittner	848ef42bb8	Add the "snapshot too old" feature This feature is controlled by a new old_snapshot_threshold GUC. A value of -1 disables the feature, and that is the default. The value of 0 is just intended for testing. Above that it is the number of minutes a snapshot can reach before pruning and vacuum are allowed to remove dead tuples which the snapshot would otherwise protect. The xmin associated with a transaction ID does still protect dead tuples. A connection which is using an "old" snapshot does not get an error unless it accesses a page modified recently enough that it might not be able to produce accurate results. This is similar to the Oracle feature, and we use the same SQLSTATE and error message for compatibility.	2016-04-08 14:36:30 -05:00
Kevin Grittner	8b65cf4c5e	Modify BufferGetPage() to prepare for "snapshot too old" feature This patch is a no-op patch which is intended to reduce the chances of failures of omission once the functional part of the "snapshot too old" patch goes in. It adds parameters for snapshot, relation, and an enum to specify whether the snapshot age check needs to be done for the page at this point. This initial patch passes NULL for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the third. The follow-on patch will change the places where the test needs to be made.	2016-04-08 14:30:10 -05:00
Teodor Sigaev	8b99edefca	Revert CREATE INDEX ... INCLUDING ... It's not ready yet, revert two commits `690c543550` - unstable test output `386e3d7609` - patch itself	2016-04-08 21:52:13 +03:00
Magnus Hagander	35e2e357cb	Add authentication parameters compat_realm and upn_usename for SSPI These parameters are available for SSPI authentication only, to make it possible to make it behave more like "normal gssapi", while making it possible to maintain compatibility. compat_realm is on by default, but can be turned off to make the authentication use the full Kerberos realm instead of the NetBIOS name. upn_username is off by default, and can be turned on to return the users Kerberos UPN rather than the SAM-compatible name (a user in Active Directory can have both a legacy SAM-compatible username and a new Kerberos one. Normally they are the same, but not always) Author: Christian Ullrich Reviewed by: Robbie Harwood, Alvaro Herrera, me	2016-04-08 20:28:38 +02:00
Teodor Sigaev	cb0c8cbf31	Fix possible use of uninitialised value in ts_headline() Found during investigation of failure of skink buildfarm member and its valgrind report. Backpatch to all supported branches	2016-04-08 21:25:14 +03:00
Peter Eisentraut	7c7d4fddab	Distrust external OpenSSL clients; clear err queue OpenSSL has an unfortunate tendency to mix per-session state error handling with per-thread error handling. This can cause problems when programs that link to libpq with OpenSSL enabled have some other use of OpenSSL; without care, one caller of OpenSSL may cause problems for the other caller. Backend code might similarly be affected, for example when a third party extension independently uses OpenSSL without taking the appropriate precautions. To fix, don't trust other users of OpenSSL to clear the per-thread error queue. Instead, clear the entire per-thread queue ahead of certain I/O operations when it appears that there might be trouble (these I/O operations mostly need to call SSL_get_error() to check for success, which relies on the queue being empty). This is slightly aggressive, but it's pretty clear that the other callers have a very dubious claim to ownership of the per-thread queue. Do this is both frontend and backend code. Finally, be more careful about clearing our own error queue, so as to not cause these problems ourself. It's possibly that control previously did not always reach SSLerrmessage(), where ERR_get_error() was supposed to be called to clear the queue's earliest code. Make sure ERR_get_error() is always called, so as to spare other users of OpenSSL the possibility of similar problems caused by libpq (as opposed to problems caused by a third party OpenSSL library like PHP's OpenSSL extension). Again, do this is both frontend and backend code. See bug #12799 and https://bugs.php.net/bug.php?id=68276 Based on patches by Dave Vitek and Peter Eisentraut. From: Peter Geoghegan <pg@bowt.ie>	2016-04-08 14:11:56 -04:00
Tom Lane	34c33a1f00	Add BSD authentication method. Create a "bsd" auth method that works the same as "password" so far as clients are concerned, but calls the BSD Authentication service to check the password. This is currently only available on OpenBSD. Marisa Emerson, reviewed by Thomas Munro	2016-04-08 13:52:06 -04:00
Robert Haas	af025eed53	Add combine functions for various floating-point aggregates. This allows parallel aggregation to use them. It may seem surprising that we use float8_combine for both float4_accum and float8_accum transition functions, but that's because those functions differ only in the type of the non-transition-state argument. Haribabu Kommi, reviewed by David Rowley and Tomas Vondra	2016-04-08 13:47:06 -04:00
Teodor Sigaev	1ec4c7c055	Restore original tsquery operation numbering. As noticed by Tom Lane changing operation's number in commit `bb140506df` causes on-disk format incompatibility. Revert to previous numbering, that is reason to add special array to store priorities of operation. Also it reverts order of tsquery to previous. Author: Dmitry Ivanov	2016-04-08 20:11:30 +03:00
Teodor Sigaev	386e3d7609	CREATE INDEX ... INCLUDING (column[, ...]) Now indexes (but only B-tree for now) can contain "extra" column(s) which doesn't participate in index structure, they are just stored in leaf tuples. It allows to use index only scan by using single index instead of two or more indexes. Author: Anastasia Lubennikova with minor editorializing by me Reviewers: David Rowley, Peter Geoghegan, Jeff Janes	2016-04-08 19:45:59 +03:00
Peter Eisentraut	339025c68f	Replace printf format %i by %d see also `ce8d7bb644`	2016-04-08 12:42:58 -04:00
Tom Lane	93c301fc4f	Fix multiple bugs in tablespace symlink removal. Don't try to examine S_ISLNK(st.st_mode) after a failed lstat(). It's undefined. Also, if the lstat() reported ENOENT, we do not wish that to be a hard error, but the code might nonetheless treat it as one (giving an entirely misleading error message, too) depending on luck-of-the-draw as to what S_ISLNK() returned. Don't throw error for ENOENT from rmdir(), either. (We're not really expecting ENOENT because we just stat'd the file successfully; but if we're going to allow ENOENT in the symlink code path, surely the directory code path should too.) Generate an appropriate errcode for its-the-wrong-type-of-file complaints. (ERRCODE_SYSTEM_ERROR doesn't seem appropriate, and failing to write errcode() around it certainly doesn't work, and not writing an errcode at all is not per project policy.) Valgrind noticed the undefined S_ISLNK result; the other problems emerged while reading the code in the area. All of this appears to have been introduced in `8f15f74a44`. Back-patch to 9.5 where that commit appeared.	2016-04-08 12:31:53 -04:00
Andres Freund	5364b357fb	Increase maximum number of clog buffers. Benchmarking has shown that the current number of clog buffers limits scalability. We've previously increased the number in `33aaa139`, but that's not sufficient with a large number of clients. We've benchmarked the cost of increasing the limit by benchmarking worst case scenarios; testing showed that 128 buffers don't cause a regression, even in contrived scenarios, whereas 256 does There are a number of more complex patches flying around to address various clog scalability problems, but this is simple enough that we can get it into 9.6; and is beneficial even after those patches have been applied. It is a bit unsatisfactory to increase this in small steps every few releases, but a better solution seems to require a rewrite of slru.c; not something done quickly. Author: Amit Kapila and Andres Freund Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com	2016-04-08 08:25:59 -07:00
Robert Haas	25fe8b5f1a	Add a 'parallel_degree' reloption. The code that estimates what parallel degree should be uesd for the scan of a relation is currently rather stupid, so add a parallel_degree reloption that can be used to override the planner's rather limited judgement. Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila, and me. Some further hacking by me.	2016-04-08 11:14:56 -04:00
Robert Haas	b0b64f6505	Attempt to fix breakage due to declaration following code. Per Tom Lane and the buildfarm.	2016-04-08 10:52:56 -04:00
Peter Eisentraut	2f1d2b7a75	Set PAM_RHOST item for PAM authentication The PAM_RHOST item is set to the remote IP address or host name and can be used by PAM modules. A pg_hba.conf option is provided to choose between IP address and resolved host name. From: Grzegorz Sampolski <grzsmp@gmail.com> Reviewed-by: Haribabu Kommi <kommi.haribabu@gmail.com>	2016-04-08 10:48:44 -04:00
Teodor Sigaev	4e55b3f033	Rename comparePos() to compareWordEntryPos() Rename comparePos() to compareWordEntryPos() to prevent export of too generic name. Per gripe from Tom Lane.	2016-04-08 12:04:15 +03:00
Robert Haas	0711803775	Use quicksort, not replacement selection, for external sorting. We still use replacement selection for the first run of the sort only and only when the number of tuples is relatively small. Otherwise, the first run, and subsequent runs in all cases, are produced using quicksort. This tends to be faster except perhaps for very small amounts of working memory. Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy, Greg Stark, and me.	2016-04-08 02:36:26 -04:00
Robert Haas	719c84c1be	Extend relations multiple blocks at a time to improve scalability. Contention on the relation extension lock can become quite fierce when multiple processes are inserting data into the same relation at the same time at a high rate. Experimentation shows the extending the relation multiple blocks at a time improves scalability. Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.	2016-04-08 02:04:46 -04:00
Simon Riggs	137805f89a	Use Foreign Key relationships to infer multi-column join selectivity In cases where joins use multiple columns we currently assess each join separately causing gross mis-estimates for join cardinality. This patch adds use of FK information for the first time into the planner. When FKs are present and we have multi-column join information, plan estimates will be drastically improved. Cases with multiple FKs are handled, though partial matches are ignored currently. Net effect is substantial performance improvements for joins in many common cases. Additional planning time is isolated to cases that are currently performing poorly, measured at 0.08 - 0.15 ms. Please watch for planner performance regressions; circumstances seem unlikely but the law of unintended consequences may apply somewhen. Additional complex tests welcome to prove this before release. Tests can be performed using SET enable_fkey_estimates = on \| off using scripts provided during Hackers discussions, message id: 552335D9.3090707@2ndquadrant.com Authors: Tomas Vondra and David Rowley Reviewed and tested by Simon Riggs, adding comments only	2016-04-08 02:51:09 +01:00
Teodor Sigaev	3308467905	Zeroing unused parts ducring tsquery construction. Per investigation failure skink buildfarm member and RANDOMIZE_ALLOCATED_MEMORY help	2016-04-07 20:45:24 +03:00
Tom Lane	f338dd7585	Refactor join_is_removable() to separate out distinctness-proving logic. Extracted from pending unique-join patch, since this is a rather large delta but it's simply moving code out into separately-accessible subroutines. I (tgl) did choose to add a bit more logic to rel_supports_distinctness, so that it verifies that there's at least one potentially usable unique index rather than just checking indexlist != NIL. Otherwise there's no functional change here. David Rowley	2016-04-07 13:12:31 -04:00
Kevin Grittner	fcff8a5751	Detect SSI conflicts before reporting constraint violations While prior to this patch the user-visible effect on the database of any set of successfully committed serializable transactions was always consistent with some one-at-a-time order of execution of those transactions, the presence of declarative constraints could allow errors to occur which were not possible in any such ordering, and developers had no good workarounds to prevent user-facing errors where they were not necessary or desired. This patch adds a check for serialization failure ahead of duplicate key checking so that if a developer explicitly (redundantly) checks for the pre-existing value they will get the desired serialization failure where the problem is caused by a concurrent serializable transaction; otherwise they will get a duplicate key error. While it would be better if the reads performed by the constraints could count as part of the work of the transaction for serialization failure checking, and we will hopefully get there some day, this patch allows a clean and reliable way for developers to work around the issue. In many cases existing code will already be doing the right thing for this to "just work". Author: Thomas Munro, with minor editing of docs by me Reviewed-by: Marko Tiikkaja, Kevin Grittner	2016-04-07 11:12:35 -05:00
Teodor Sigaev	bb140506df	Phrase full text search. Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery. On-disk and binary in/out format of tsquery are backward compatible. It has two side effect: - change order for tsquery, so, users, who has a btree index over tsquery, should reindex it - less number of parenthesis in tsquery output, and tsquery becomes more readable Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov Reviewers: Alexander Korotkov, Artur Zakirov	2016-04-07 18:44:18 +03:00
Simon Riggs	015e88942a	Load FK defs into relcache for use by planner Fastpath ignores this if no triggers defined. Author: Tomas Vondra, with fastpath and comments added by me Reviewers: David Rowley, Simon Riggs	2016-04-07 12:08:33 +01:00
Noah Misch	f2b1b3079c	Standardize GetTokenInformation() error reporting. Commit `c22650cd64` sparked a discussion about diverse interpretations of "token user" in error messages. Expel old and new specimens of that phrase by making all GetTokenInformation() callers report errors the way GetTokenUser() has been reporting them. These error conditions almost can't happen, so users are unlikely to observe this change. Reviewed by Tom Lane and Stephen Frost.	2016-04-06 23:41:43 -04:00
Stephen Frost	1574783b4c	Use GRANT system to manage access to sensitive functions Now that pg_dump will properly dump out any ACL changes made to functions which exist in pg_catalog, switch to using the GRANT system to manage access to those functions. This means removing 'if (!superuser()) ereport()' checks from the functions themselves and then REVOKEing EXECUTE right from 'public' for these functions in system_views.sql. Reviews by Alexander Korotkov, Jose Luis Tallon	2016-04-06 21:45:32 -04:00
Stephen Frost	23f34fa4ba	In pg_dump, include pg_catalog and extension ACLs, if changed Now that all of the infrastructure exists, add in the ability to dump out the ACLs of the objects inside of pg_catalog or the ACLs for objects which are members of extensions, but only if they have been changed from their original values. The original values are tracked in pg_init_privs. When pg_dump'ing 9.6-and-above databases, we will dump out the ACLs for all objects in pg_catalog and the ACLs for all extension members, where the ACL has been changed from the original value which was set during either initdb or CREATE EXTENSION. This should not change dumps against pre-9.6 databases. Reviews by Alexander Korotkov, Jose Luis Tallon	2016-04-06 21:45:32 -04:00
Stephen Frost	6c268df127	Add new catalog called pg_init_privs This new catalog holds the privileges which the system was initialized with at initdb time, along with any permissions set by extensions at CREATE EXTENSION time. This allows pg_dump (and any other similar use-cases) to detect when the privileges set on initdb-created or extension-created objects have been changed from what they were set to at initdb/extension-creation time and handle those changes appropriately. Reviews by Alexander Korotkov, Jose Luis Tallon	2016-04-06 21:45:32 -04:00
Teodor Sigaev	0b62fd036e	Add jsonb_insert It inserts a new value into an jsonb array at arbitrary position or a new key to jsonb object. Author: Dmitry Dolgov Reviewers: Petr Jelinek, Vitaly Burovoy, Andrew Dunstan	2016-04-06 19:25:00 +03:00
Tom Lane	de94e2af18	Run pgindent on a batch of (mostly-planner-related) source files. Getting annoyed at the amount of unrelated chatter I get from pgindent'ing Rowley's unique-joins patch. Re-indent all the files it touches.	2016-04-06 11:34:02 -04:00
Fujii Masao	ead9963c47	Use proper format specifier %X/%X for LSN, again. Commit `cee31f5` fixed this problem, but commit `989be08` accidentally reverted the fix. Thomas Munro	2016-04-06 22:20:52 +09:00
Simon Riggs	cac0e36682	Revert `bf08f2292f` Remove recent changes to logging XLOG_RUNNING_XACTS by request.	2016-04-06 14:03:46 +01:00
Simon Riggs	3fe3511d05	Generic Messages for Logical Decoding API and mechanism to allow generic messages to be inserted into WAL that are intended to be read by logical decoding plugins. This commit adds an optional new callback to the logical decoding API. Messages are either text or bytea. Messages can be transactional, or not, and are identified by a prefix to allow multiple concurrent decoding plugins. (Not to be confused with Generic WAL records, which are intended to allow crash recovery of extensible objects.) Author: Petr Jelinek and Andres Freund Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs Discussion: 5685F999.6010202@2ndquadrant.com	2016-04-06 10:05:41 +01:00
Fujii Masao	989be0810d	Support multiple synchronous standby servers. Previously synchronous replication offered only the ability to confirm that all changes made by a transaction had been transferred to at most one synchronous standby server. This commit extends synchronous replication so that it supports multiple synchronous standby servers. It enables users to consider one or more standby servers as synchronous, and increase the level of transaction durability by ensuring that transaction commits wait for replies from all of those synchronous standbys. Multiple synchronous standby servers are configured in synchronous_standby_names which is extended to support new syntax of 'num_sync ( standby_name [ , ... ] )', where num_sync specifies the number of synchronous standbys that transaction commits need to wait for replies from and standby_name is the name of a standby server. The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before is also still supported. It's the same as new syntax with num_sync=1. This commit doesn't include "quorum commit" feature which was discussed in pgsql-hackers. Synchronous standbys are chosen based on their priorities. synchronous_standby_names determines the priority of each standby for being chosen as a synchronous standby. The standbys whose names appear earlier in the list are given higher priority and will be considered as synchronous. Other standby servers appearing later in this list represent potential synchronous standbys. The regression test for multiple synchronous standbys is not included in this commit. It should come later. Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs, Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen, Rajeev Rastogi Many thanks to the various individuals who were involved in discussing and developing this feature.	2016-04-06 17:18:25 +09:00
Alvaro Herrera	f2fcad27d5	Support ALTER THING .. DEPENDS ON EXTENSION This introduces a new dependency type which marks an object as depending on an extension, such that if the extension is dropped, the object automatically goes away; and also, if the database is dumped, the object is included in the dump output. Currently the grammar supports this for indexes, triggers, materialized views and functions only, although the utility code is generic so adding support for more object types is a matter of touching the parser rules only. Author: Abhijit Menon-Sen Reviewed-by: Alexander Korotkov, Álvaro Herrera Discussion: http://www.postgresql.org/message-id/20160115062649.GA5068@toroid.org	2016-04-05 18:38:54 -03:00
Robert Haas	41ea0c2376	Fix parallel-safety code for parallel aggregation. has_parallel_hazard() was ignoring the proparallel markings for aggregates, which is no good. Fix that. There was no way to mark an aggregate as actually being parallel-safe, either, so add a PARALLEL option to CREATE AGGREGATE. Patch by me, reviewed by David Rowley.	2016-04-05 16:06:15 -04:00
Robert Haas	09adc9a8c0	Align all shared memory allocations to cache line boundaries. Experimentation shows this only costs about 6kB, which seems well worth it given the major performance effects that can be caused by insufficient alignment, especially on larger systems. Discussion: 14166.1458924422@sss.pgh.pa.us	2016-04-05 15:47:49 -04:00
Robert Haas	11c8669c0c	Add parallel query support functions for assorted aggregates. This lets us use parallel aggregate for a variety of useful cases that didn't work before, like sum(int8), sum(numeric), several versions of avg(), and various other functions. Add some regression tests, as well, testing the general sanity of these and future catalog entries. David Rowley, reviewed by Tomas Vondra, with a few further changes by me.	2016-04-05 14:32:53 -04:00
Magnus Hagander	7117685461	Implement backup API functions for non-exclusive backups Previously non-exclusive backups had to be done using the replication protocol and pg_basebackup. With this commit it's now possible to make them using pg_start_backup/pg_stop_backup as well, as long as the backup program can maintain a persistent connection to the database. Doing this, backup_label and tablespace_map are returned as results from pg_stop_backup() instead of being written to the data directory. This makes the server safe from a crash during an ongoing backup, which can be a problem with exclusive backups. The old syntax of the functions remain and work exactly as before, but since the new syntax is safer this should eventually be deprecated and removed. Only reference documentation is included. The main section on backup still needs to be rewritten to cover this, but since that is already scheduled for a separate large rewrite, it's not included in this patch. Reviewed by David Steele and Amit Kapila	2016-04-05 20:03:49 +02:00
Peter Eisentraut	4dcd4da98c	Fix error message from wal_level value renaming found by Ian Barwick	2016-04-04 21:17:54 -04:00
Tom Lane	99f3b5613b	Disallow newlines in parameter values to be set in ALTER SYSTEM. As noted by Julian Schauder in bug #14063, the configuration-file parser doesn't support embedded newlines in string literals. While there might someday be a good reason to remove that restriction, there doesn't seem to be one right now. However, ALTER SYSTEM SET could accept strings containing newlines, since many of the variable-specific value-checking routines would just see a newline as whitespace. This led to writing a postgresql.auto.conf file that was broken and had to be removed manually. Pending a reason to work harder, just throw an error if someone tries this. In passing, fix several places in the ALTER SYSTEM logic that failed to provide an errcode() for an ereport(), and thus would falsely log the failure as an internal XX000 error. Back-patch to 9.4 where ALTER SYSTEM was introduced.	2016-04-04 18:05:23 -04:00
Alvaro Herrera	890614d2b3	Display WAL pointer in rm_redo error callback This makes it easier to identify the source of a recovery problem in case of a bug or data corruption.	2016-04-04 18:12:12 -03:00
Tom Lane	3c69b33f45	Add a few comments about ANALYZE's strategy for collecting MCVs. Alex Shulgin complained that the underlying strategy wasn't all that apparent, particularly not the fact that we intentionally have two code paths depending on whether we think the column has a limited set of possible values or not. Try to make it clearer.	2016-04-04 17:06:33 -04:00
Tom Lane	391159e03a	Partially revert commit `3d3bf62f30`. On reflection, the pre-existing logic in ANALYZE is specifically meant to compare the frequency of a candidate MCV against the estimated frequency of a random distinct value across the whole table. The change to compare it against the average frequency of values actually seen in the sample doesn't seem very principled, and if anything it would make us less likely not more likely to consider a value an MCV. So revert that, but keep the aspect of considering only nonnull values, which definitely is correct. In passing, rename the local variables in these stanzas to "ndistinct_table", to avoid confusion with the "ndistinct" that appears at an outer scope in compute_scalar_stats.	2016-04-04 16:48:13 -04:00
Alvaro Herrera	c9ff752a85	Silence compiler warning Reported by Peter Eisentraut to occur on 32bit systems	2016-04-04 17:07:23 -03:00
Tom Lane	66229ac004	Introduce a LOG_SERVER_ONLY ereport level, which is never sent to client. This elevel is useful for logging audit messages and similar information that should not be passed to the client. It's equivalent to LOG in terms of decisions about logging priority in the postmaster log, but messages with this elevel will never be sent to the client. In the current implementation, it's just an alias for the longstanding COMMERROR elevel (or more accurately, we've made COMMERROR an alias for this). At some point it might be interesting to allow a LOG_ONLY flag to be attached to any elevel, but that would be considerably more complicated, and it's not clear there's enough use-cases to justify the extra work. For now, let's just take the easy 90% solution. David Steele, reviewed by Fabien Coelho, Petr Jelínek, and myself	2016-04-04 12:32:42 -04:00
Tom Lane	58666ed28a	Fix latent portability issue in pgwin32_dispatch_queued_signals(). The first iteration of the signal-checking loop would compute sigmask(0) which expands to 1<<(-1) which is undefined behavior according to the C standard. The lack of field reports of trouble suggest that it evaluates to 0 on all existing Windows compilers, but that's hardly something to rely on. Since signal 0 isn't a queueable signal anyway, we can just make the loop iterate from 1 instead, and save a few cycles as well as avoiding the undefined behavior. In passing, avoid evaluating the volatile expression UNBLOCKED_SIGNAL_QUEUE twice in a row; there's no reason to waste cycles like that. Noted by Aleksander Alekseev, though this isn't his proposed fix. Back-patch to all supported branches.	2016-04-04 11:13:17 -04:00
Dean Rasheed	84f9a35e39	Improve estimate of distinct values in estimate_num_groups(). When adjusting the estimate for the number of distinct values from a rel in a grouped query to take into account the selectivity of the rel's restrictions, use a formula that is less likely to produce under-estimates. The old formula simply multiplied the number of distinct values in the rel by the restriction selectivity, which would be correct if the restrictions were fully correlated with the grouping expressions, but can produce significant under-estimates in cases where they are not well correlated. The new formula is based on the random selection probability, and so assumes that the restrictions are not correlated with the grouping expressions. This is guaranteed to produce larger estimates, and of course risks over-estimating in cases where the restrictions are correlated, but that has less severe consequences than under-estimating, which might lead to a HashAgg that consumes an excessive amount of memory. This could possibly be improved upon in the future by identifying correlated restrictions and using a hybrid of the old and new formulae. Author: Tomas Vondra, with some hacking be me Reviewed-by: Mark Dilger, Alexander Korotkov, Dean Rasheed and Tom Lane Discussion: http://www.postgresql.org/message-id/flat/56CD0381.5060502@2ndquadrant.com	2016-04-04 12:41:56 +01:00
Simon Riggs	bf08f2292f	Avoid archiving XLOG_RUNNING_XACTS on idle server If archive_timeout > 0 we should avoid logging XLOG_RUNNING_XACTS if idle. Bug 13685 reported by Laurence Rowe, investigated in detail by Michael Paquier, though this is not his proposed fix. 20151016203031.3019.72930@wrigleys.postgresql.org Simple non-invasive patch to allow later backpatch to 9.4 and 9.5	2016-04-04 07:18:05 +01:00
Simon Riggs	3e4b7d8798	Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the pin scan that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here and WAL is identical. The current commit applies in all cases, effectively replacing commit `687f2cd7a0`.	2016-04-03 17:46:09 +01:00
Tom Lane	a1953f3a60	Make all the declarations of WaitEventSetWaitBlock be marked "inline". The inconsistency here triggered compiler warnings on some buildfarm members, and it's surely pretty pointless.	2016-04-02 13:55:44 -04:00
Noah Misch	c22650cd64	Refer to a TOKEN_USER payload as a "token user," not as a "user token". This corrects messages for can't-happen errors. The corresponding "user token" appears in the HANDLE argument of GetTokenInformation().	2016-04-01 21:53:18 -04:00
Noah Misch	4ad6f13500	Copyedit comments and documentation.	2016-04-01 21:53:10 -04:00
Tom Lane	3d3bf62f30	Omit null rows when setting the threshold for what's a most-common value. As with the previous patch, large numbers of null rows could skew this calculation unfavorably, causing us to discard values that have a legitimate claim to be MCVs, since our definition of MCV is that it's most common among the non-null population of the column. Hence, make the numerator of avgcount be the number of non-null sample values not the number of sample rows; likewise for maxmincount in the compute_scalar_stats variant. Also, make the denominator be the number of distinct values actually observed in the sample, rather than reversing it back out of the computed stadistinct. This avoids depending on the accuracy of the Haas-Stokes approximation, and really it's what we want anyway; the threshold should depend only on what we see in the sample, not on what we extrapolate about the contents of the whole column. Alex Shulgin, reviewed by Tomas Vondra and myself	2016-04-01 17:03:27 -04:00
Tom Lane	be4b4dc759	Omit null rows when applying the Haas-Stokes estimator for ndistinct. Previously, we included null rows in the values of n and N that went into the formula, which amounts to considering null as a value in its own right; but the d and f1 values do not include nulls. This is inconsistent, and it contributes to significant underestimation of ndistinct when the column is mostly nulls. In any case stadistinct is defined as the number of distinct non-null values, so we should exclude nulls when doing this computation. This is an aboriginal bug in our application of the Haas-Stokes formula, but we'll refrain from back-patching for fear of destabilizing plan choices in released branches. While at it, make the code a bit more readable by omitting unnecessary casts and intermediate variables. Observation and original patch by Tomas Vondra, adjusted to fix both uses of the formula by Alex Shulgin, cosmetic improvements by me	2016-04-01 15:48:24 -04:00
Alvaro Herrera	f402b99501	Type names should not be quoted Our actual convention, contrary to what I said in `59a2111b23`, is not to quote type names, as evidenced by unquoted use of format_type_be() result value in error messages. Remove quotes from recently tweaked messages accordingly. Per note from Tom Lane	2016-04-01 13:35:48 -03:00
Teodor Sigaev	65578341af	Add Generic WAL interface This interface is designed to give an access to WAL for extensions which could implement new access method, for example. Previously it was impossible because restoring from custom WAL would need to access system catalog to find a redo custom function. This patch suggests generic way to describe changes on page with standart layout. Bump XLOG_PAGE_MAGIC because of new record type. Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and minor editorization by my Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby, Michael Paquier	2016-04-01 12:21:48 +03:00
Tom Lane	f9aefcb91f	Support using index-only scans with partial indexes in more cases. Previously, the planner would reject an index-only scan if any restriction clause for its table used a column not available from the index, even if that restriction clause would later be dropped from the plan entirely because it's implied by the index's predicate. This is a fairly common situation for partial indexes because predicates using columns not included in the index are often the most useful kind of predicate, and we have to duplicate (or at least imply) the predicate in the WHERE clause in order to get the index to be considered at all. So index-only scans were essentially unavailable with such partial indexes. To fix, we have to do detection of implied-by-predicate clauses much earlier in the planner. This patch puts it in check_index_predicates (nee check_partial_indexes), meaning it gets done for every partial index, whereas we previously only considered this issue at createplan time, so that the work was only done for an index actually selected for use. That could result in a noticeable planning slowdown for queries against tables with many partial indexes. However, testing suggested that there isn't really a significant cost, especially not with reasonable numbers of partial indexes. We do get a small additional benefit, which is that cost_index is more accurate since it correctly discounts the evaluation cost of clauses that will be removed. We can also avoid considering such clauses as potential indexquals, which saves useless matching cycles in the case where the predicate columns aren't in the index, and prevents generating bogus plans that double-count the clause's selectivity when the columns are in the index. Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and Konstantin Knizhnik, and whacked around a little by me	2016-03-31 14:49:10 -04:00
Alvaro Herrera	3501f71c21	Fix broken variable declaration Author: Konstantin Knizhnik	2016-03-30 23:39:15 -03:00
Fujii Masao	cee31f5fee	Use proper format specifier %X/%X for LSN.	2016-03-31 11:03:40 +09:00
Alvaro Herrera	24c5f1a103	Enable logical slots to follow timeline switches When decoding from a logical slot, it's necessary for xlog reading to be able to read xlog from historical (i.e. not current) timelines; otherwise, decoding fails after failover, because the archives are in the historical timeline. This is required to make "failover logical slots" possible; it currently has no other use, although theoretically it could be used by an extension that creates a slot on a standby and continues to replay from the slot when the standby is promoted. This commit includes a module in src/test/modules with functions to manipulate the slots (which is not otherwise possible in SQL code) in order to enable testing, and a new test in src/test/recovery to ensure that the behavior is as expected. Author: Craig Ringer Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek	2016-03-30 20:07:05 -03:00
Alvaro Herrera	3b02ea4f07	XLogReader general code cleanup Some minor tweaks and comment additions, for cleanliness sake and to avoid having the upcoming timeline-following patch be polluted with unrelated cleanup. Extracted from a larger patch by Craig Ringer, reviewed by Andres Freund, with some additions by myself.	2016-03-30 18:56:13 -03:00
Tom Lane	50861cd683	Improve portability of I/O behavior for the geometric types. Formerly, the geometric I/O routines such as box_in and point_out relied directly on strtod() and sprintf() for conversion of the float8 component values of their data types. However, the behavior of those functions is pretty platform-dependent, especially for edge-case values such as infinities and NaNs. This was exposed by commit `acdf2a8b37`, which added test cases involving boxes with infinity endpoints, and immediately failed on Windows and AIX buildfarm members. We solved these problems years ago in the main float8in and float8out functions, so let's fix it by making the geometric types use that code instead of depending directly on the platform-supplied functions. To do this, refactor the float8in code so that it can be used to parse just part of a string, and as a convenience make the guts of float8out usable without going through DirectFunctionCall. While at it, get rid of geo_ops.c's fairly shaky assumptions about the maximum output string length for a double, by having it build results in StringInfo buffers instead of fixed-length strings. In passing, convert all the "invalid input syntax for type foo" messages in this area of the code into "invalid input syntax for type %s" to reduce the number of distinct translatable strings, per recent discussion. We would have needed a fair number of the latter anyway for code-sharing reasons, so we might as well just go whole hog. Note: this patch is by no means intended to guarantee that the geometric types uniformly behave sanely for infinity or NaN component values. But any bugs we have in that line were there all along, they were just harder to reach in a platform-independent way.	2016-03-30 17:25:03 -04:00
Tom Lane	818e593736	Suppress uninitialized-variable warnings. My compiler doesn't like the lack of initialization of "flag", and I think it's right: if there were zero keys we'd have an undefined result. The AND of zero items is TRUE, so initialize to TRUE.	2016-03-30 13:36:18 -04:00
Teodor Sigaev	acdf2a8b37	Introduce SP-GiST operator class over box. Patch implements quad-tree over boxes, naive approach of 2D quad tree will not work for any non-point objects because splitting space on node is not efficient. The idea of pathc is treating 2D boxes as 4D points, so, object will not overlap (in 4D space). The performance tests reveal that this technique especially beneficial with too much overlapping objects, so called "spaghetti data". Author: Alexander Lebedev with editorization by Emre Hasegeli and me	2016-03-30 18:42:36 +03:00
Teodor Sigaev	87545f5412	Use traversalValue in SP-GiST range opclass. Author: Alexander Lebedev	2016-03-30 18:38:53 +03:00
Teodor Sigaev	ccd6eb49a4	Introduce traversalValue for SP-GiST scan During scan sometimes it would be very helpful to know some information about parent node or all ancestor nodes. Right now reconstructedValue could be used but it's not a right usage of it (range opclass uses that). traversalValue is arbitrary piece of memory in separate MemoryContext while reconstructedVale should have the same type as indexed column. Subsequent patches for range opclass and quad4d tree will use it. Author: Alexander Lebedev, Teodor Sigaev	2016-03-30 18:29:28 +03:00
Robert Haas	314cbfc5da	Add new replication mode synchronous_commit = 'remote_apply'. In this mode, the master waits for the transaction to be applied on the remote side, not just written to disk. That means that you can count on a transaction started on the standby to see all commits previously acknowledged by the master. To make this work, the standby sends a reply after replaying each commit record generated with synchronous_commit >= 'remote_apply'. This introduces a small inefficiency: the extra replies will be sent even by standbys that aren't the current synchronous standby. But previously-existing synchronous_commit levels make no attempt at all to optimize which replies are sent based on what the primary cares about, so this is no worse, and at least avoids any extra replies for people not using the feature at all. Thomas Munro, reviewed by Michael Paquier and by me. Some additional tweaks by me.	2016-03-29 21:29:49 -04:00
Tom Lane	a898b409f6	Fix interval_mul() to not produce insane results. interval_mul() attempts to prevent its calculations from producing silly results, but it forgot that zero times infinity yields NaN in IEEE arithmetic. Hence, a case like '1 second'::interval * 'infinity'::float8 produced a NaN for the months product, which didn't trigger the range check, resulting in bogus and possibly platform-dependent output. This isn't terribly obvious to the naked eye because if you try that exact case, you get "interval out of range" which is what you expect --- but if you look closer, the error is coming from interval_out not interval_mul. interval_mul has allowed a bogus value into the system. Fix by adding isnan tests. Noted while testing Vitaly Burovoy's fix for infinity input to to_timestamp(). Given the lack of field complaints, I doubt this is worth a back-patch.	2016-03-29 17:21:12 -04:00
Tom Lane	e511d878f3	Allow to_timestamp(float8) to convert float infinity to timestamp infinity. With the original SQL-function implementation, such cases failed because we don't support infinite intervals. Converting the function to C lets us bypass the interval representation, which should be a bit faster as well as more flexible. Vitaly Burovoy, reviewed by Anastasia Lubennikova	2016-03-29 17:09:29 -04:00
Robert Haas	96f8373cad	Fix bug in aggregate (de)serialization commit. resulttypeLen and resulttypeByVal must be set correctly when serializing aggregates, not just when finalizing them. This was in David's final patch but I downloaded the wrong version by mistake and failed to spot the error. David Rowley	2016-03-29 15:21:57 -04:00
Robert Haas	5fe5a2cee9	Allow aggregate transition states to be serialized and deserialized. This is necessary infrastructure for supporting parallel aggregation for aggregates whose transition type is "internal". Such values can't be passed between cooperating processes, because they are just pointers. David Rowley, reviewed by Tomas Vondra and by me.	2016-03-29 15:04:05 -04:00
Tom Lane	7abc157165	Avoid possibly-unsafe use of Windows' FormatMessage() function. Whenever this function is used with the FORMAT_MESSAGE_FROM_SYSTEM flag, it's good practice to include FORMAT_MESSAGE_IGNORE_INSERTS as well. Otherwise, if the message contains any %n insertion markers, the function will try to fetch argument strings to substitute --- which we are not passing, possibly leading to a crash. This is exactly analogous to the rule about not giving printf() a format string you're not in control of. Noted and patched by Christian Ullrich. Back-patch to all supported branches.	2016-03-29 11:55:19 -04:00
Teodor Sigaev	61d66c44f1	Fix support of digits in email/hostnames. When tsearch was implemented I did several mistakes in hostname/email definition rules: 1) allow underscore in hostname what prohibited by RFC 2) forget to allow leading digits separated by hyphen (like 123-x.com) in hostname 3) do no allow underscore/hyphen after leading digits in localpart of email Artur's patch resolves two last issues, but by the way allows hosts name like 123_x.com together with 123-x.com. RFC forbids underscore usage in hostname but pg allows that since initial tsearch version in core, although only for non-digits. Patch syncs support digits and nondigits in both hostname and email. Forbidding underscore in hostname may break existsing usage of tsearch and, anyhow, it should be done by separate patch. Author: Artur Zakirov BUG: #13964	2016-03-29 18:28:49 +03:00
Robert Haas	f9143d102f	Rework custom scans to work more like the new extensible node stuff. Per discussion, the new extensible node framework is thought to be better designed than the custom path/scan/scanstate stuff we added in PostgreSQL 9.5. Rework the latter to be more like the former. This is not backward-compatible, but we generally don't promise that for C APIs, and there probably aren't many people using this yet anyway. KaiGai Kohei, reviewed by Petr Jelinek and me. Some further cosmetic changes by me.	2016-03-29 11:28:04 -04:00
Robert Haas	5d4171d1c7	Don't require a user mapping for FDWs to work. Commit `fbe5a3fb73` accidentally changed this behavior; put things back the way they were, and add some regression tests. Report by Andres Freund; patch by Ashutosh Bapat, with a bit of kibitzing by me.	2016-03-28 21:50:28 -04:00
Robert Haas	bd0f206f55	Fix typo in comment. Thomas Munro	2016-03-28 20:55:15 -04:00
Tom Lane	e5a4dea80f	Document errhidecontext() where it ought to be documented. Seems to have been missed when this function was added. Noted while looking at David Steele's proposal to add another similar function.	2016-03-28 14:18:14 -04:00
Alvaro Herrera	59a2111b23	Improve internationalization of messages involving type names Change the slightly different variations of the message function FOO must return type BAR to a single wording, removing the variability in type name so that they all create a single translation entry; since the type name is not to be translated, there's no point in it being part of the message anyway. Also, change them all to use the same quoting convention, namely that the function name is not to be quoted but the type name is. (I'm not quite sure why this is so, but it's the clear majority.) Some similar messages such as "encoding conversion function FOO must ..." are also changed.	2016-03-28 14:24:37 -03:00
Stephen Frost	86ebf30fd6	Reset plan->row_security_env and planUserId In the plancache, we check if the environment we planned the query under has changed in a way which requires us to re-plan, such as when the user for whom the plan was prepared changes and RLS is being used (and, therefore, there may be different policies to apply). Unfortunately, while those values were set and checked, they were not being reset when the query was re-planned and therefore, in cases where we change role, re-plan, and then change role again, we weren't re-planning again. This leads to potentially incorrect policies being applied in cases where role-specific policies are used and a given query is planned under one role and then executed under other roles, which could happen under security definer functions or when a common user and query is planned initially and then re-used across multiple SET ROLEs. Further, extensions which made use of CopyCachedPlan() may suffer from similar issues as the RLS-related fields were not properly copied as part of the plan and therefore RevalidateCachedQuery() would copy in the current settings without invalidating the query. Fix by using the same approach used for 'search_path', where we set the correct values in CompleteCachedPlan(), check them early on in RevalidateCachedQuery() and then properly reset them if re-planning. Also, copy through the values during CopyCachedPlan(). Pointed out by Ashutosh Bapat. Reviewed by Michael Paquier. Back-patch to 9.5 where RLS was introduced. Security: CVE-2016-2193	2016-03-28 09:03:20 -04:00
Tom Lane	d12e5bb79b	Code and docs review for commit `3187d6de0e`. Fix up check for high-bit-set characters, which provoked "comparison is always true due to limited range of data type" warnings on some compilers, and was unlike the way we do it elsewhere anyway. Fix omission of "$" from the set of valid identifier continuation characters. Get rid of sanitize_text(), which was utterly inconsistent with any other error report anywhere in the system, and wasn't even well designed on its own terms (double-quoting the result string without escaping contained double quotes doesn't seem very well thought out). Fix up error messages, which didn't follow the message style guidelines very well, and were overly specific in situations where the actual mistake might not be what they said. Improve documentation. (I started out just intending to fix the compiler warning, but the more I looked at the patch the less I liked it.)	2016-03-28 01:00:30 -04:00
Tom Lane	d65b665d52	Guard against zero vardata.rel->tuples in estimate_hash_bucketsize(). If the referenced rel was proven empty, we'd compute 0/0 here, which results in the function returning NaN. That's a bit more serious than the other zero-divide case. Still, it only seems to be possible in HEAD, so no back-patch. Per report from Piotr Stefaniak. I looked through the rest of selfuncs.c and found no other likely trouble spots.	2016-03-27 18:21:03 -04:00
Tom Lane	fa09f89351	Clamp adjusted ndistinct to positive integer in estimate_hash_bucketsize(). This avoids a possible divide-by-zero in the following calculation, and rounding the number to an integer seems like saner behavior anyway. Assuming IEEE math, the division would yield +Infinity which would get replaced by 1.0 at the bottom of the function, so nothing really interesting would ensue; but avoiding divide-by-zero seems like a good idea on general principles. Per report from Piotr Stefaniak. No back-patch since this seems mostly cosmetic.	2016-03-27 18:07:16 -04:00
Andres Freund	9f7c527af3	Fix LWLockReportWaitEnd() parameter list to be (void). Previously it was an "old style" function declaration.	2016-03-27 22:53:31 +02:00
Andres Freund	1a7a43672b	Don't use !! but != 0/NULL to force boolean evaluation. I introduced several uses of !! to force bit arithmetic to be boolean, but per discussion the project prefers != 0/NULL. Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com	2016-03-27 18:10:19 +02:00
Tom Lane	76281aa964	Avoid a couple of zero-divide scenarios in the planner. cost_subplan() supposed that the given subplan must have plan_rows > 0, which as far as I can tell was true until recent refactoring of the code in createplan.c; but now that code allows the Result for a provably empty subquery to have plan_rows = 0. Rather than undo that change, put in a clamp to prevent zero divide. get_cheapest_fractional_path() likewise supposed that best_path->rows > 0. This assumption has been wrong for longer. It's actually harmless given IEEE float math, because a positive value divided by zero gives +Infinity and compare_fractional_path_costs() will do the right thing with that. Still, best not to assume that. final_cost_nestloop() also seems to have some risks in this area, so borrow the clamping logic already present in the mergejoin cost functions. Lastly, remove unnecessary clamp_row_est() in planner.c's calls to get_number_of_groups(). The only thing that function does with path_rows is pass it to estimate_num_groups() which already has an internal clamp, so we don't need the extra call; and if we did, the callers are arguably the wrong place for it anyway. First two items reported by Piotr Stefaniak, the others are products of my nosing around for similar problems. No back-patch since there's no evidence that problems arise in the back branches.	2016-03-26 12:03:12 -04:00
Tom Lane	cd37bb7859	Improve PL/Tcl errorCode facility by providing decoded name for SQLSTATE. We don't really want to encourage people to write numeric SQLSTATEs in programs; that's unreadable and error-prone. Copy plpgsql's infrastructure for converting between SQLSTATEs and exception names shown in Appendix A, and modify examples in tests and documentation to do it that way.	2016-03-25 16:54:52 -04:00
Tom Lane	c94959d411	Fix DROP OPERATOR to reset oprcom/oprnegate links to the dropped operator. This avoids leaving dangling links in pg_operator; which while fairly harmless are also unsightly. While we're at it, simplify OperatorUpd, which went through heap_modify_tuple for no very good reason considering it had already made a tuple copy it could just scribble on. Roma Sokolov, reviewed by Tomas Vondra, additional hacking by Robert Haas and myself.	2016-03-25 12:33:16 -04:00
Tom Lane	d543170f2f	Don't split up SRFs when choosing to postpone SELECT output expressions. In commit `9118d03a8c` we taught the planner to postpone evaluation of set-returning functions in a SELECT's targetlist until after any sort done to satisfy ORDER BY. However, if we postpone some SRFs this way while others do not get postponed (because they're sort or group key columns) we will break the traditional behavior by which all SRFs in the tlist run in-step during ExecTargetList(), so that you get the least common multiple of their periods not the product. Fix make_sort_input_target() so it will not split up SRF evaluation in such cases. There is still a hazard of similar odd behavior if there's a SRF in a grouping column and another one that isn't, but that was true before and we're just trying to preserve bug-compatibility with the traditional behavior. This whole area is overdue to be rethought and reimplemented, but we'll try to avoid changing behavior until then. Per report from Regina Obe.	2016-03-25 11:19:51 -04:00
Tom Lane	c1156411ad	Move psql's psqlscan.l into src/fe_utils. This completes (at least for now) the project of getting rid of ad-hoc linkages among the src/bin/ subdirectories. Everything they share is now in src/fe_utils/ and is included from a static library at link time. A side benefit is that we can restore the FLEX_NO_BACKUP check for psqlscanslash.l. We might need to think of another way to do that check if we ever need to build two lexers with that property in the same source directory, but there's no foreseeable reason to need that.	2016-03-24 20:28:47 -04:00
Tom Lane	a376960c8f	Suppress compiler warning for get_am_type_string(). Compilers that don't know that elog(ERROR) doesn't return complained that this function might fail to return a value. Per buildfarm. While at it, const-ify the function's declaration, since the intent is evidently to always return a constant string.	2016-03-24 17:22:24 -04:00
Robert Haas	59a02815e2	Use correct GetDatum function. Oops.	2016-03-24 08:57:48 -04:00
Alvaro Herrera	473b932870	Support CREATE ACCESS METHOD This enables external code to create access methods. This is useful so that extensions can add their own access methods which can be formally tracked for dependencies, so that DROP operates correctly. Also, having explicit support makes pg_dump work correctly. Currently only index AMs are supported, but we expect different types to be added in the future. Authors: Alexander Korotkov, Petr Jelínek Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby Commitfest-URL: https://commitfest.postgresql.org/9/353/ Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com	2016-03-23 23:01:35 -03:00
Tom Lane	2c6af4f442	Move keywords.c/kwlookup.c into src/common/. Now that we have src/common/ for code shared between frontend and backend, we can get rid of (most of) the klugy ways that the keyword table and keyword lookup code were formerly shared between different uses. This is a first step towards a more general plan of getting rid of special-purpose kluges for sharing code in src/bin/. I chose to merge kwlookup.c back into keywords.c, as it once was, and always has been so far as keywords.h is concerned. We could have kept them separate, but there is noplace that uses ScanKeywordLookup without also wanting access to the backend's keyword list, so there seems little point. ecpg is still a bit weird, but at least now the trickiness is documented. I think that the MSVC build script should require no adjustments beyond what's done here ... but we'll soon find out.	2016-03-23 20:22:08 -04:00
Robert Haas	3df9c374e2	Disable abbreviated keys for string-sorting in non-C locales. Unfortunately, every version of glibc thus far tested has bugs whereby strcoll() ordering does not match strxfrm() ordering as required by the standard. This can result in, for example, corrupted indexes. Disabling abbreviated keys in these cases slows down non-C-collation string sorting considerably, but there seems to be no practical alternative. Users who are confident that their libc implementations are solid in this regard can re-enable the optimization by compiling with TRUST_STRXFRM. Users who have built indexes using PostgreSQL 9.5 or PostgreSQL 9.5.1 should REINDEX if there is a possibility that they may have been affected by this problem. Report by Marc-Olaf Jaschke. Investigation mostly by Tom Lane, with help from Peter Geoghegan, Noah Misch, Stephen Frost, and me. Patch by me, reviewed by Peter Geoghegan and Tom Lane.	2016-03-23 16:03:13 -04:00
Robert Haas	44ca4022f3	Partition the freelist for shared dynahash tables. Without this, contention on the freelist can become a pretty serious problem on large servers. Aleksander Alekseev, reviewed by Anastasia Lubennikova, Dilip Kumar, and me.	2016-03-23 11:00:54 -04:00
Tom Lane	ea4b8bd618	Code review for error reports in jsonb_set(). User-facing (even tested by regression tests) error conditions were thrown with elog(), hence had wrong SQLSTATE and were untranslatable. And the error message texts weren't up to project style, either.	2016-03-23 11:00:39 -04:00
Tom Lane	384dfbde19	Fix unsafe use of strtol() on a non-null-terminated Text datum. jsonb_set() could produce wrong answers or incorrect error reports, or in the worst case even crash, when trying to convert a path-array element into an integer for use as an array subscript. Per report from Vitaly Burovoy. Back-patch to 9.5 where the faulty code was introduced (in commit `c6947010ce`). Michael Paquier	2016-03-23 10:43:13 -04:00
Simon Riggs	8320c625d9	Change comment to describe correct lock level used	2016-03-23 11:32:34 +00:00
Tom Lane	71404af2a2	Fix EvalPlanQual bug when query contains both locked and not-locked rels. In commit `afb9249d06`, we (probably I) made ExecLockRows assign null test tuples to all relations of the query while setting up to do an EvalPlanQual recheck for a newly-updated locked row. This was sheerest brain fade: we should only set test tuples for relations that are lockable by the LockRows node, and in particular empty test tuples are only sensible for inheritance child relations that weren't the source of the current tuple from their inheritance tree. Setting a null test tuple for an unrelated table causes it to return NULLs when it should not, as exhibited in bug #14034 from Bronislav Houdek. To add insult to injury, doing it the wrong way required two loops where one would suffice; so the corrected code is even a bit shorter and faster. Add a regression test case based on his example, and back-patch to 9.5 where the bug was introduced.	2016-03-22 17:56:20 -04:00
Robert Haas	ae507d9222	Make max_parallel_degree PGC_USERSET. It was intended to be this way all along, just like other planner GUCs such as work_mem. But I goofed.	2016-03-21 10:54:36 -04:00
Robert Haas	e06a38965b	Support parallel aggregation. Parallel workers can now partially aggregate the data and pass the transition values back to the leader, which can combine the partial results to produce the final answer. David Rowley, based on earlier work by Haribabu Kommi. Reviewed by Álvaro Herrera, Tomas Vondra, Amit Kapila, James Sewell, and me.	2016-03-21 09:30:18 -04:00
Andres Freund	7fa0064092	Properly declare FeBeWaitSet. Surprising that this worked on a number of systems. Reported by buildfarm member longfin.	2016-03-21 12:58:18 +01:00
Andres Freund	98a64d0bd7	Introduce WaitEventSet API. Commit `ac1d794` ("Make idle backends exit if the postmaster dies.") introduced a regression on, at least, large linux systems. Constantly adding the same postmaster_alive_fds to the OSs internal datastructures for implementing poll/select can cause significant contention; leading to a performance regression of nearly 3x in one example. This can be avoided by using e.g. linux' epoll, which avoids having to add/remove file descriptors to the wait datastructures at a high rate. Unfortunately the current latch interface makes it hard to allocate any persistent per-backend resources. Replace, with a backward compatibility layer, WaitLatchOrSocket with a new WaitEventSet API. Users can allocate such a Set across multiple calls, and add more than one file-descriptor to wait on. The latter has been added because there's upcoming postgres features where that will be helpful. In addition to the previously existing poll(2), select(2), WaitForMultipleObjects() implementations also provide an epoll_wait(2) based implementation to address the aforementioned performance problem. Epoll is only available on linux, but that is the most likely OS for machines large enough (four sockets) to reproduce the problem. To actually address the aforementioned regression, create and use a long-lived WaitEventSet for FE/BE communication. There are additional places that would benefit from a long-lived set, but that's a task for another day. Thanks to Amit Kapila, who helped make the windows code I blindly wrote actually work. Reported-By: Dmitry Vasilyev Discussion: CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com 20160114143931.GG10941@awork2.anarazel.de	2016-03-21 12:22:54 +01:00
Andres Freund	72e2d21c12	Combine win32 and unix latch implementations. Previously latches for windows and unix had been implemented in different files. A later patch introduce an expanded wait infrastructure, keeping the implementation separate would introduce too much duplication. This basically just moves the functions, without too much change. The reason to keep this separate is that it allows blame to continue working a little less badly; and to make review a tiny bit easier. Discussion: 20160114143931.GG10941@awork2.anarazel.de	2016-03-21 11:03:26 +01:00
Andrew Dunstan	5d03201056	Remove dependency on psed for MSVC builds. Modern Perl has removed psed from its core distribution, so it might not be readily available on some build platforms. We therefore replace its use with a Perl script generated by s2p, which is equivalent to the sed script. The latter is retained for non-MSVC builds to avoid creating a new hard dependency on Perl for non-Windows tarball builds. Backpatch to all live branches. Michael Paquier and me.	2016-03-19 18:36:35 -04:00
Tom Lane	21c8ee7946	Sync backend/parser/scan.l with bin/psql/psqlscan.l. Make some minor formatting adjustments to make it easier to diff these files and see that they indeed implement the same flex rules (at least to the extent that we want them to be the same). (Someday it'd be nice to make ecpg's pgc.l more easily diff'able too, but today is not that day.) Also run relevant parts of these files and psqlscanslash.l through pgindent. No actual behavioral changes here, just obsessive neatnik-ism.	2016-03-19 14:36:22 -04:00
Tom Lane	72b1e3a21f	Build backend/parser/scan.l and interfaces/ecpg/preproc/pgc.l standalone. Now that we know about the %top{} trick, we can revert to building flex lexers as separate .o files. This is worth doing for a couple of reasons besides sheer cleanliness. We can narrow the scope of the -Wno-error flag that's forced on scan.c. Also, since these grammar and lexer files are so large, splitting them into separate build targets should have some advantages in build speed, particularly in parallel or ccache'd builds. We have quite a few other .l files that could be changed likewise, but the above arguments don't apply to them, so the benefit of fixing them seems pretty minimal. Leave the rest for some other day.	2016-03-19 12:07:24 -04:00
Peter Eisentraut	9a83564c58	Allow SSL server key file to have group read access if owned by root We used to require the server key file to have permissions 0600 or less for best security. But some systems (such as Debian) have certificate and key files managed by the operating system that can be shared with other services. In those cases, the "postgres" user is made a member of a special group that has access to those files, and the server key file has permissions 0640. To accommodate that kind of setup, also allow the key file to have permissions 0640 but only if owned by root. From: Christoph Berg <myon@debian.org> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>	2016-03-19 11:03:22 +01:00
Andres Freund	6eb2be15b5	Fix stupid omission in `c4901a1e`. Reported-By: Jeff Janes Discussion: CAMkU=1zGxREwoyaCrp_CHadEB+dPgpVyKBysCJ+6xP9gCOvAuw@mail.gmail.com	2016-03-18 22:37:59 -07:00
Tom Lane	07aed46a6b	Fix missed update in _readForeignScan(). Blatant fail in `0bf3ae88af`. Caught by buildfarm member mandrill.	2016-03-19 01:20:34 -04:00
Peter Eisentraut	b555ed8102	Merge wal_level "archive" and "hot_standby" into new name "replica" The distinction between "archive" and "hot_standby" existed only because at the time "hot_standby" was added, there was some uncertainty about stability. This is now a long time ago. We would like to move forward with simplifying the replication configuration, but this distinction is in the way, because a primary server cannot tell (without asking a standby or predicting the future) which one of these would be the appropriate level. Pick a new name for the combined setting to make it clearer that it covers all (non-logical) backup and replication uses. The old values are still accepted but are converted internally. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reviewed-by: David Steele <david@pgmasters.net>	2016-03-18 23:56:03 +01:00
Robert Haas	08a6d36dcb	Use INT64_FORMAT instead of %ld for int64. Commit `0011c0091e` introduced this mistake. Patch by me. Reported by Andres Freund, who also reviewed the patch.	2016-03-18 14:54:09 -04:00
Andres Freund	c4901a1e03	Only clear latch self-pipe/event if there is a pending notification. This avoids a good number of, individually quite fast, system calls in scenarios with many quick queries. Besides the aesthetic benefit of seing fewer superflous system calls with strace, it also improves performance by ~2% measured by pgbench -M prepared -c 96 -j 8 -S (scale 100). Without having benchmarked it, this patch also adjust the windows code, as that makes it easier to unify the unix/windows codepaths in a later patch. There's little reason to diverge in behaviour between the platforms. Discussion: CA+TgmoYc1Zm+Szoc_Qbzi92z2c1vRHZmjhfPn5uC=w8bXv6Avg@mail.gmail.com Reviewed-By: Robert Haas	2016-03-18 11:47:05 -07:00
Andres Freund	c17966201c	Make it easier to choose the used waiting primitive in unix_latch.c. This allows for easier testing of the different primitives; in preparation for adding a new primitive. Discussion: 20160114143931.GG10941@awork2.anarazel.de Reviewed-By: Robert Haas	2016-03-18 11:46:54 -07:00
Andres Freund	6bc4d95fcc	Error out if waiting on socket readiness without a specified socket. Previously we just ignored such an attempt, but that seems to serve no purpose but making things harder to debug. Discussion: 20160114143931.GG10941@awork2.anarazel.de 20151230173734.hx7jj2fnwyljfqek@alap3.anarazel.de Reviewed-By: Robert Haas	2016-03-18 11:46:45 -07:00
Robert Haas	0bf3ae88af	Directly modify foreign tables. postgres_fdw can now sent an UPDATE or DELETE statement directly to the foreign server in simple cases, rather than sending a SELECT FOR UPDATE statement and then updating or deleting rows one-by-one. Etsuro Fujita, reviewed by Rushabh Lathia, Shigeru Hanada, Kyotaro Horiguchi, Albe Laurenz, Thom Brown, and me.	2016-03-18 13:55:52 -04:00
Teodor Sigaev	3187d6de0e	Introduce parse_ident() SQL-layer function to split qualified identifier into array parts. Author: Pavel Stehule with minor editorization by me and Jim Nasby	2016-03-18 18:16:14 +03:00
Robert Haas	992b5ba30d	Push scan/join target list beneath Gather when possible. This means that, for example, "SELECT expensive_func(a) FROM bigtab WHERE something" can compute expensive_func(a) in the workers rather than the leader if it happens to be parallel-safe, which figures to be a big win in some practical cases. Currently, we can only do this if the entire target list is parallel-safe. If we worked harder, we might be able to evaluate parallel-safe targets in the worker and any parallel-restricted targets in the leader, but that would be more complicated, and there aren't that many parallel-restricted functions that people are likely to use in queries anyway. I think. So just do the simple thing for the moment. Robert Haas, Amit Kapila, and Tom Lane	2016-03-18 09:50:05 -04:00
Robert Haas	2d8a1e22b1	Various minor corrections of and improvements to comments. Aleksander Alekseev	2016-03-18 09:38:59 -04:00
Tom Lane	bd0ab28912	Remove useless double calls of make_parsestate(). Aleksander Alekseev	2016-03-17 16:46:35 -04:00
Robert Haas	c27033ff7c	Update tuplesort.c comments for memory mangement improvements. I'm committing these changes separately so that it's clear what is Peter's original work versus what I changed. This is a followup to commit `0011c0091e`, and these changes are all by me.	2016-03-17 16:11:14 -04:00
Robert Haas	0011c0091e	Improve memory management for external sorts. Introduce a new memory context which stores tuple data, and reset it at the end of each merge pass; this helps avoid memory fragmentation and, consequently, overallocation. Also, for the final merge patch, eliminate memory context chunk header overhead entirely by allocating all of the memory used for buffering tuples during the merge in a single chunk. Since this modestly increases the number of tuples we can store, grow the memtuples array a bit so that we're less likely to run short of slots there. Peter Geoghegan. Review and testing of patches in this series by Jeff Janes, Greg Stark, Mithun Cy, and me.	2016-03-17 16:10:41 -04:00
Tom Lane	55c3a04d60	Fix assorted breakage in to_char()'s OF format option. In HEAD, fix incorrect field width for hours part of OF when tm_gmtoff is negative. This was introduced by commit `2d87eedc1d` as a result of falsely applying a pattern that's correct when + signs are omitted, which is not the case for OF. In 9.4, fix missing abs() call that allowed a sign to be attached to the minutes part of OF. This was fixed in 9.5 by `9b43d73b3f`, but for inscrutable reasons not back-patched. In all three versions, ensure that the sign of tm_gmtoff is correctly reported even when the GMT offset is less than 1 hour. Add regression tests, which evidently we desperately need here. Thomas Munro and Tom Lane, per report from David Fetter	2016-03-17 15:50:33 -04:00
Teodor Sigaev	f4ceed6ceb	Improve support of Hunspell - allow to use non-ascii characters as affix flag. Non-numeric affix flags now are stored as string instead of numeric value of character. - allow to use 0 as affix flag in numeric encoded affixes That adds support for arabian, hungarian, turkish and brazilian portuguese languages. Author: Artur Zakirov with heavy editorization by me	2016-03-17 17:23:38 +03:00
Robert Haas	0218e8b3fa	Fix typos. Jim Nasby	2016-03-17 07:26:20 -04:00
Peter Eisentraut	fc201dfd95	Add syslog_split_messages parameter Reviewed-by: Andreas Karlsson <andreas@proxel.se>	2016-03-16 23:21:44 -04:00
Peter Eisentraut	f4c454e9ba	Add syslog_sequence_numbers parameter Reviewed-by: Andreas Karlsson <andreas@proxel.se>	2016-03-16 23:21:44 -04:00
Tom Lane	5db5146431	Fix j2day() to behave sanely for negative Julian dates. Somebody had apparently once figured that casting to unsigned int would produce the right output for negative inputs, but that would only be true if 2^32 were a multiple of 7, which of course it ain't. We need to use a signed division and then correct the sign of the remainder. AFAICT, the only case where this would arise currently is when doing ISO-week calculations for dates in 4714BC, where we'd compute a negative Julian date representing 4714-01-04BC and then do some arithmetic with it. Since we don't even really document support for such dates, this is not of much consequence. But we may as well get it right. Per report from Vitaly Burovoy.	2016-03-16 20:57:45 -04:00
Tom Lane	a70e13a39e	Be more careful about out-of-range dates and timestamps. Tighten the semantics of boundary-case timestamptz so that we allow timestamps >= '4714-11-24 00:00+00 BC' and < 'ENDYEAR-01-01 00:00+00 AD' exactly, no more and no less, but it is allowed to enter timestamps within that range using non-GMT timezone offsets (which could make the nominal date 4714-11-23 BC or ENDYEAR-01-01 AD). This eliminates dump/reload failure conditions for timestamps near the endpoints. To do this, separate checking of the inputs for date2j() from the final range check, and allow the Julian date code to handle a range slightly wider than the nominal range of the datatypes. Also add a bunch of checks to detect out-of-range dates and timestamps that formerly could be returned by operations such as date-plus-integer. All C-level functions that return date, timestamp, or timestamptz should now be proof against returning a value that doesn't pass IS_VALID_DATE() or IS_VALID_TIMESTAMP(). Vitaly Burovoy, reviewed by Anastasia Lubennikova, and substantially whacked around by me	2016-03-16 19:09:28 -04:00
Robert Haas	f2b74b01d4	Another comment update. I thought this was in my last commit, but I goofed.	2016-03-16 14:28:25 -04:00
Robert Haas	bc55cc0b6a	Fix problems in commit `c16dc1aca5`. Vinayak Pokale provided a patch for a copy-and-paste error in a comment. I noticed that I'd use the word "automatically" nearby where I meant to talk about things being "atomic". Rahila Syed spotted a misplaced counter update. Fix all that stuff.	2016-03-16 13:54:04 -04:00
Robert Haas	c6dda1f48e	Add idle_in_transaction_session_timeout. Vik Fearing, reviewed by Stéphane Schildknecht and me, and revised slightly by me.	2016-03-16 11:30:45 -04:00
Peter Eisentraut	f9e5ed61ed	UCS_to_EUC_JIS_2004.pl: Turn off "test" mode by default It produces debugging output files that are of no further use, so we don't need that by default.	2016-03-16 10:43:05 -04:00
Peter Eisentraut	9dbcb500ca	Make spacing and punctuation consistent	2016-03-16 10:43:05 -04:00
Robert Haas	3aff33aa68	Fix typos. Oskari Saarenmaa	2016-03-15 18:06:11 -04:00
Stephen Frost	fd658dbb30	Avoid incorrectly indicating exclusion constraint wait INSERT ... ON CONFLICT's precheck may have to wait on the outcome of another insertion, which may or may not itself be a speculative insertion. This wait is not necessarily associated with an exclusion constraint, but was always reported that way in log messages if the wait happened to involve a tuple that had no speculative token. Initially discovered through use of ON CONFLICT DO NOTHING, where spurious references to exclusion constraints in log messages were more likely. Patch by Peter Geoghegan. Reviewed by Julien Rouhaud. Back-patch to 9.5 where INSERT ... ON CONFLICT was added.	2016-03-15 18:04:39 -04:00
Alvaro Herrera	5bcc413f80	Fix typos in comments	2016-03-15 17:57:17 -03:00
Robert Haas	c16dc1aca5	Add simple VACUUM progress reporting. There's a lot more that could be done here yet - in particular, this reports only very coarse-grained information about the index vacuuming phase - but even as it stands, the new pg_stat_progress_vacuum can tell you quite a bit about what a long-running vacuum is actually doing. Amit Langote and Robert Haas, based on earlier work by Vinayak Pokale and Rahila Syed.	2016-03-15 13:32:56 -04:00
Tom Lane	101fd9349e	Add a GetForeignUpperPaths callback function for FDWs. This is basically like the just-added create_upper_paths_hook, but control is funneled only to the FDW responsible for all the baserels of the current query; so providing such a callback is much less likely to add useless overhead than using the hook function is. The documentation is a bit sketchy. We'll likely want to improve it, and/or adjust the call conventions, when we get some experience with actually using this callback. Hopefully somebody will find time to experiment with it before 9.6 feature freeze.	2016-03-14 20:04:48 -04:00
Robert Haas	270b7daf5c	Fix EXPLAIN ANALYZE SELECT INTO not to choose a parallel plan. We don't support any parallel write operations at present, so choosing a parallel plan causes us to error out. Also, add a new regression test that uses EXPLAIN ANALYZE SELECT INTO; if we'd had this previously, force_parallel_mode testing would have caught this issue. Mithun Cy and Robert Haas	2016-03-14 19:48:46 -04:00
Tom Lane	5864d6a4b6	Provide a planner hook at a suitable place for creating upper-rel Paths. In the initial revision of the upper-planner pathification work, the only available way for an FDW or custom-scan provider to inject Paths representing post-scan-join processing was to insert them during scan-level GetForeignPaths or similar processing. While that's not impossible, it'd require quite a lot of duplicative processing to look forward and see if the extension would be capable of implementing the whole query. To improve matters for custom-scan providers, provide a hook function at the point where the core code is about to start filling in upperrel Paths. At this point Paths are available for the whole scan/join tree, which should reduce the amount of redundant effort considerably. (An alternative design that was suggested was to provide a separate hook for each post-scan-join processing step, but that seems messy and not clearly more useful.) Following our time-honored tradition, there's no documentation for this hook outside the source code. As-is, this hook is only meant for custom scan providers, which we can't assume very much about. A followon patch will implement an FDW callback to let FDWs do the same thing in a somewhat more structured fashion.	2016-03-14 19:23:29 -04:00
Tom Lane	28048cbaa2	Allow callers of create_foreignscan_path to specify nondefault PathTarget. Although the default choice of rel->reltarget should typically be sufficient for scan or join paths, it's not at all sufficient for the purposes PathTargets were invented for; in particular not for upper-relation Paths. So break API compatibility by adding a PathTarget argument to create_foreignscan_path(). To ease updating of existing code, accept a NULL value of the argument as selecting rel->reltarget.	2016-03-14 17:31:28 -04:00
Tom Lane	307c78852f	Rethink representation of PathTargets. In commit `19a541143a` I did not make PathTarget a subtype of Node, and embedded a RelOptInfo's reltarget directly into it rather than having a separately-allocated Node. In hindsight that was misguided micro-optimization, enabled by the fact that at that point we didn't have any Paths with custom PathTargets. Now that PathTarget processing has been fleshed out some more, it's easier to see that it's better to have PathTarget as an indepedent Node type, even if it does cost us one more palloc to create a RelOptInfo. So change it while we still can. This commit just changes the representation, without doing anything more interesting than that.	2016-03-14 16:59:59 -04:00
Robert Haas	6be84eeb8d	Update more comments for `96198d94cb`. Etsuro Fujita, reviewed (though not completely endorsed) by Ashutosh Bapat, and slightly expanded by me.	2016-03-14 14:29:12 -04:00
Tom Lane	74a379b984	Use repalloc_huge() to enlarge a SPITupleTable's tuple pointer array. Commit `23a27b039d` widened the rows-stored counters to uint64, but that's academic unless we allow the tuple pointer array to exceed 1GB. (It might be a good idea to provide some other limit on how much storage a SPITupleTable can eat. On the other hand, there are plenty of other ways to drive a backend into swap hell.) Dagfinn Ilmari Mannsåker	2016-03-14 14:22:34 -04:00
Robert Haas	3adf9ced17	Improve check for overly-long extensible node name. The old code is bad for two reasons. First, it has an off-by-one error. Second, it won't help if you aren't running with assertions enabled. Per discussion, we want a check here in that case too. Author: KaiGai Kohei, adjusted by me. Reviewed-by: Petr Jelinek Discussion: 56E0D547.1030101@2ndquadrant.com	2016-03-14 13:52:52 -04:00
Tom Lane	ab4ff2889d	Fix memory leak in repeated GIN index searches. Commit `d88976cfa1` removed this code from ginFreeScanKeys(): - if (entry->list) - pfree(entry->list); evidently in the belief that that ItemPointer array is allocated in the keyCtx and so would be reclaimed by the following MemoryContextReset. Unfortunately, it isn't and it won't. It'd likely be a good idea for that to become so, but as a simple and back-patchable fix in the meantime, restore this code to ginFreeScanKeys(). Also, add a similar pfree to where startScanEntry() is about to zero out entry->list. I am not sure if there are any code paths where this change prevents a leak today, but it seems like cheap future-proofing. In passing, make the initial allocation of so->entries[] use palloc not palloc0. The code doesn't depend on unused entries being zero; if it did, the array-enlargement code in ginFillScanEntry() would be wrong. So using palloc0 initially can only serve to confuse readers about what the invariant is. Per report from Felipe de Jesús Molina Bravo, via Jaime Casanova in <CAJGNTeMR1ndMU2Thpr8GPDUfiHTV7idELJRFusA5UXUGY1y-eA@mail.gmail.com>	2016-03-13 16:44:31 -04:00
Peter Eisentraut	96adb14d93	Fix whitespace and remove obsolete gitattributes entry	2016-03-13 16:03:13 -04:00
Tom Lane	4b980167cb	Report memory context stats upon out-of-memory in repalloc[_huge]. This longstanding functionality evidently got lost in commit `3d6d1b5855`. Noted while studying an OOM report from Jaime Casanova. Backpatch to 9.5 where the bug was introduced.	2016-03-13 00:21:07 -05:00
Tom Lane	ab737f6ba9	Fix Windows portability issue in `23a27b039d`. _strtoui64() is available in MSVC builds, but apparently not with other Windows toolchains. Thanks to Petr Jelinek for the diagnosis.	2016-03-12 22:34:47 -05:00
Tom Lane	23a27b039d	Widen query numbers-of-tuples-processed counters to uint64. This patch widens SPI_processed, EState's es_processed field, PortalData's portalPos field, FuncCallContext's call_cntr and max_calls fields, ExecutorRun's count argument, PortalRunFetch's result, and the max number of rows in a SPITupleTable to uint64, and deals with (I hope) all the ensuing fallout. Some of these values were declared uint32 before, and others "long". I also removed PortalData's posOverflow field, since that logic seems pretty useless given that portalPos is now always 64 bits. The user-visible results are that command tags for SELECT etc will correctly report tuple counts larger than 4G, as will plpgsql's GET GET DIAGNOSTICS ... ROW_COUNT command. Queries processing more tuples than that are still not exactly the norm, but they're becoming more common. Most values associated with FETCH/MOVE distances, such as PortalRun's count argument and the count argument of most SPI functions that have one, remain declared as "long". It's not clear whether it would be worth promoting those to int64; but it would definitely be a large dollop of additional API churn on top of this, and it would only help 32-bit platforms which seem relatively less likely to see any benefit. Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me	2016-03-12 16:05:29 -05:00
Andres Freund	e01157500f	Include portability/mem.h into fd.c for MAP_FAILED. Buildfarm members gaur and pademelon are old enough not to know about MAP_FAILED; which is used in `428b1d6`. Include portability/mem.h to fix; as already done in a bunch of other places.	2016-03-12 12:16:48 -08:00
Tom Lane	570be1f73f	Re-export a few of createplan.c's make_xxx() functions. CitusDB is using these and don't wish to redesign their code right now. I am not on board with this being a good idea, or a good precedent, but I lack the energy to fight about it.	2016-03-12 12:12:59 -05:00
Tom Lane	9118d03a8c	When appropriate, postpone SELECT output expressions till after ORDER BY. It is frequently useful for volatile, set-returning, or expensive functions in a SELECT's targetlist to be postponed till after ORDER BY and LIMIT are done. Otherwise, the functions might be executed for every row of the table despite the presence of LIMIT, and/or be executed in an unexpected order. For example, in SELECT x, nextval('seq') FROM tab ORDER BY x LIMIT 10; it's probably desirable that the nextval() values are ordered the same as x, and that nextval() is not run more than 10 times. In the past, Postgres was inconsistent in this area: you would get the desirable behavior if the ordering were performed via an indexscan, but not if it had to be done by an explicit sort step. Getting the desired behavior reliably required contortions like SELECT x, nextval('seq') FROM (SELECT x FROM tab ORDER BY x) ss LIMIT 10; This patch conditionally postpones evaluation of pure-output target expressions (that is, those that are not used as DISTINCT, ORDER BY, or GROUP BY columns) so that they effectively occur after sorting, even if an explicit sort step is necessary. Volatile expressions and set-returning expressions are always postponed, so as to provide consistent semantics. Expensive expressions (costing more than 10 times typical operator cost, which by default would include any user-defined function) are postponed if there is a LIMIT or if there are expressions that must be postponed. We could be more aggressive and postpone any nontrivial expression, but there are costs associated with doing so: it requires an extra Result plan node which adds some overhead, and postponement changes the volume of data going through the sort step, perhaps for the worse. Since we tend not to have very good estimates of the output width of nontrivial expressions, it's hard to have much confidence in our ability to predict whether postponement would increase or decrease the cost of the sort; therefore this patch doesn't attempt to make decisions conditionally on that. Between these factors and a general desire not to change query behavior when there's not a demonstrable benefit, it seems best to be conservative about applying postponement. We might tweak the decision rules in the future, though. Konstantin Knizhnik, heavily rewritten by me	2016-03-11 12:27:50 -05:00
Teodor Sigaev	b1fdc727c3	Fix Windows build broken in `6943a946c7` Also it fixes dynamic array allocation disallowed by ANSI-C. Author: Stas Kelvich	2016-03-11 20:10:20 +03:00
Teodor Sigaev	8829af47ef	Fix merge affixes for numeric ones Some dictionaries have duplicated base words with different affix set, we just merge that sets into one set. But previously merging of sets of affixes was actually a concatenation of strings but it's wrong for numeric representation of affixes because such representation uses comma to separate affixes. Author: Artur Zakirov	2016-03-11 19:47:50 +03:00
Teodor Sigaev	6943a946c7	Tsvector editing functions Adds several tsvector editting function: convert tsvector to/from text array, set weight for given lexemes, delete lexeme(s), unnest, filter lexemes with given weights Author: Stas Kelvich with some editorization by me Reviewers: Tomas Vondram, Teodor Sigaev	2016-03-11 19:22:36 +03:00
Tom Lane	49635d7b3e	Minor additional refactoring of planner.c's PathTarget handling. Teach make_group_input_target() and make_window_input_target() to work entirely with the PathTarget representation of tlists, rather than constructing a tlist and immediately deconstructing it into PathTarget format. In itself this only saves a few palloc's; the bigger picture is that it opens the door for sharing cost_qual_eval work across all of planner.c's constructions of PathTargets. I'll come back to that later. In support of this, flesh out tlist.c's infrastructure for PathTargets a bit more.	2016-03-11 10:24:55 -05:00
Robert Haas	481c76abf4	Fix a typo, and remove unnecessary pgstat_report_wait_end(). Per Amit Kapila.	2016-03-11 07:34:00 -05:00
Simon Riggs	73e7e49da3	Allow emit_log_hook to see original message text emit_log_hook could only see the translated text, making it harder to identify which message was being sent. Pass original text to allow the exact message to be identified, whichever language is used for logging. Discussion: 20160216.184755.59721141.horiguchi.kyotaro@lab.ntt.co.jp Author: Kyotaro Horiguchi	2016-03-11 09:53:06 +00:00
Robert Haas	a414d96ad2	Simplify GetLockNameFromTagType. The old code is wrong, because it returns a pointer to an automatic variable. And it's also more clever than we really need to be considering that the case it's worrying about should never happen.	2016-03-10 21:37:22 -05:00
Andres Freund	c94f0c29ce	Blindly try to fix dtrace enabled builds, broken in `9cd00c45`. Reported-By: Peter Eisentraut Discussion: 56E2239E.1050607@gmx.net	2016-03-10 17:51:03 -08:00
Andres Freund	9cd00c457e	Checkpoint sorting and balancing. Up to now checkpoints were written in the order they're in the BufferDescriptors. That's nearly random in a lot of cases, which performs badly on rotating media, but even on SSDs it causes slowdowns. To avoid that, sort checkpoints before writing them out. We currently sort by tablespace, relfilenode, fork and block number. One of the major reasons that previously wasn't done, was fear of imbalance between tablespaces. To address that balance writes between tablespaces. The other prime concern was that the relatively large allocation to sort the buffers in might fail, preventing checkpoints from happening. Thus pre-allocate the required memory in shared memory, at server startup. This particularly makes it more efficient to have checkpoint flushing enabled, because that'll often result in a lot of writes that can be coalesced into one flush. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund	2016-03-10 17:05:09 -08:00
Andres Freund	428b1d6b29	Allow to trigger kernel writeback after a configurable number of writes. Currently writes to the main data files of postgres all go through the OS page cache. This means that some operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other reads and writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. While desirable and likely possible this patch does not contain an implementation for windows. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund	2016-03-10 17:04:34 -08:00
Tom Lane	c82c92b111	Give pull_var_clause() reject/recurse/return behavior for WindowFuncs too. All along, this function should have treated WindowFuncs in a manner similar to Aggrefs, ie with an option whether or not to recurse into them. By not considering the case, it was always recursing, which is OK for most callers (although I suspect that the case in prepare_sort_from_pathkeys might represent a bug). But now we need return-without-recursing behavior as well. There are also more than a few callers that should never see a WindowFunc, and now we'll get some error checking on that.	2016-03-10 16:23:52 -05:00
Robert Haas	fd31cd2651	Don't vacuum all-frozen pages. Commit `a892234f83` gave us enough infrastructure to avoid vacuuming pages where every tuple on the page is already frozen. So, replace the notion of a scan_all or whole-table vacuum with the less onerous notion of an "aggressive" vacuum, which will pages that are all-visible, but still skip those that are all-frozen. This should greatly reduce the cost of anti-wraparound vacuuming on large clusters where the majority of data is never touched between one cycle and the next, because we'll no longer have to read all of those pages only to find out that we don't need to do anything with them. Patch by me, reviewed by Masahiko Sawada.	2016-03-10 16:14:42 -05:00
Tom Lane	364a9f47ab	Refactor pull_var_clause's API to make it less tedious to extend. In commit `1d97c19a0f` and later `c1d9579dd8`, we extended pull_var_clause's API by adding enum-type arguments. That's sort of a pain to maintain, though, because it means every time we add a new behavior we must touch every last one of the call sites, even if there's a reasonable default behavior that most of them could use. Let's switch over to using a bitmask of flags, instead; that seems more maintainable and might save a nanosecond or two as well. This commit changes no behavior in itself, though I'm going to follow it up with one that does add a new behavior. In passing, remove flatten_tlist(), which has not been used since 9.1 and would otherwise need the same API changes. Removing these enums means that optimizer/tlist.h no longer needs to depend on optimizer/var.h. Changing that caused a number of C files to need addition of #include "optimizer/var.h" (probably we can thank old runs of pgrminclude for that); but on balance it seems like a good change anyway.	2016-03-10 15:53:07 -05:00
Simon Riggs	37c54863cf	Rework wait for AccessExclusiveLocks on Hot Standby Earlier version committed in 9.0 caused spurious waits in some cases. New infrastructure for lock waits in 9.3 used to correct and improve this. Jeff Janes based upon a proposal by Simon Riggs, who also reviewed Additional review comments from Amit Kapila	2016-03-10 19:26:24 +00:00
Robert Haas	53be0b1add	Provide much better wait information in pg_stat_activity. When a process is waiting for a heavyweight lock, we will now indicate the type of heavyweight lock for which it is waiting. Also, you can now see when a process is waiting for a lightweight lock - in which case we will indicate the individual lock name or the tranche, as appropriate - or for a buffer pin. Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful discussion and suggestions by many others, including Alexander Korotkov, Vladimir Borodin, and many others.	2016-03-10 12:44:09 -05:00
Magnus Hagander	9d90388247	Avoid crash on old Windows with AVX2-capable CPU for VS2013 builds The Visual Studio 2013 CRT generates invalid code when it makes a 64-bit build that is later used on a CPU that supports AVX2 instructions using a version of Windows before 7SP1/2008R2SP1. Detect this combination, and in those cases turn off the generation of FMA3, per recommendation from the Visual Studio team. The bug is actually in the CRT shipping with Visual Studio 2013, but Microsoft have stated they're only fixing it in newer major versions. The fix is therefor conditioned specifically on being built with this version of Visual Studio, and not previous or later versions. Author: Christian Ullrich	2016-03-10 14:10:18 +01:00
Simon Riggs	e0694cf9c7	Reduce size of two phase file header Previously 2PC header was fixed at 200 bytes, which in most cases wasted WAL space for a workload using 2PC heavily. Pavan Deolasee, reviewed by Petr Jelinek	2016-03-10 12:51:46 +00:00
Simon Riggs	fcb4bfddb6	Reduce lock level for altering fillfactor Fabrízio de Royes Mello and Simon Riggs	2016-03-10 12:07:33 +00:00
Robert Haas	090b287fc5	Code review for `b6fb6471f6`. Reports by Tomas Vondra, Vinayak Pokale, and Aleksander Alekseev. Patch by Amit Langote.	2016-03-10 06:07:57 -05:00
Tom Lane	cc402116ca	Remove a couple of useless pstrdup() calls. There's no point in pstrdup'ing the result of TextDatumGetCString, since that's necessarily already a freshly-palloc'd C string. These particular calls are unlikely to be of any consequence performance-wise, but still they're a bad precedent that can confuse future patch authors. Noted by Chapman Flack.	2016-03-09 23:29:05 -05:00
Andres Freund	1d4a0ab19a	Avoid unlikely data-loss scenarios due to rename() without fsync. Renaming a file using rename(2) is not guaranteed to be durable in face of crashes. Use the previously added durable_rename()/durable_link_or_rename() in various places where we previously just renamed files. Most of the changed call sites are arguably not critical, but it seems better to err on the side of too much durability. The most prominent known case where the previously missing fsyncs could cause data loss is crashes at the end of a checkpoint. After the actual checkpoint has been performed, old WAL files are recycled. When they're filled, their contents are fdatasynced, but we did not fsync the containing directory. An OS/hardware crash in an unfortunate moment could then end up leaving that file with its old name, but new content; WAL replay would thus not replay it. Reported-By: Tomas Vondra Author: Michael Paquier, Tomas Vondra, Andres Freund Discussion: 56583BDD.9060302@2ndquadrant.com Backpatch: All supported branches	2016-03-09 18:53:53 -08:00
Andres Freund	606e0f9841	Introduce durable_rename() and durable_link_or_rename(). Renaming a file using rename(2) is not guaranteed to be durable in face of crashes; especially on filesystems like xfs and ext4 when mounted with data=writeback. To be certain that a rename() atomically replaces the previous file contents in the face of crashes and different filesystems, one has to fsync the old filename, rename the file, fsync the new filename, fsync the containing directory. This sequence is not generally adhered to currently; which exposes us to data loss risks. To avoid having to repeat this arduous sequence, introduce durable_rename(), which wraps all that. Also add durable_link_or_rename(). Several places use link() (with a fallback to rename()) to rename a file, trying to avoid replacing the target file out of paranoia. Some of those rename sequences need to be durable as well. There seems little reason extend several copies of the same logic, so centralize the link() callers. This commit does not yet make use of the new functions; they're used in a followup commit. Author: Michael Paquier, Andres Freund Discussion: 56583BDD.9060302@2ndquadrant.com Backpatch: All supported branches	2016-03-09 18:53:53 -08:00
Tom Lane	a298a1e06f	Fix incorrect handling of NULL index entries in indexed ROW() comparisons. An index search using a row comparison such as ROW(a, b) > ROW('x', 'y') would stop upon reaching a NULL entry in the "b" column, ignoring the fact that there might be non-NULL "b" values associated with later values of "a". This happens because _bt_mark_scankey_required() marks the subsidiary scankey for "b" as required, which is just wrong: it's for a column after the one with the first inequality key (namely "a"), and thus can't be considered a required match. This bit of brain fade dates back to the very beginnings of our support for indexed ROW() comparisons, in 2006. Kind of astonishing that no one came across it before Glen Takahashi, in bug #14010. Back-patch to all supported versions. Note: the given test case doesn't actually fail in unpatched 9.1, evidently because the fix for bug #6278 (i.e., stopping at nulls in either scan direction) is required to make it fail. I'm sure I could devise a case that fails in 9.1 as well, perhaps with something involving making a cursor back up; but it doesn't seem worth the trouble.	2016-03-09 14:51:22 -05:00
Robert Haas	be060cbcd4	Re-pgindent vacuumlazy.c.	2016-03-09 13:51:11 -05:00
Robert Haas	b6fb6471f6	Add a generic command progress reporting facility. Using this facility, any utility command can report the target relation upon which it is operating, if there is one, and up to 10 64-bit counters; the intent of this is that users should be able to figure out what a utility command is doing without having to resort to ugly hacks like attaching strace to a backend. As a demonstration, this adds very crude reporting to lazy vacuum; we just report the target relation and nothing else. A forthcoming patch will make VACUUM report a bunch of additional data that will make this much more interesting. But this gets the basic framework in place. Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao, and Masanori Oyama.	2016-03-09 12:08:58 -05:00
Tom Lane	8776c15c85	Fix incorrect tlist generation in create_gather_plan(). This function is written as though Gather doesn't project; but it does. Even if it did not project, though, we must use build_path_tlist to ensure that the output columns receive correct sortgroupref labeling. Per report from Amit Kapila.	2016-03-09 10:56:46 -05:00
Tom Lane	d31f20e2b5	Fix copy-and-pasteo in comment. Wensheng Zhang	2016-03-09 10:29:14 -05:00
Tom Lane	51c0f63e4d	Improve handling of pathtargets in planner.c. Refactor so that the internal APIs in planner.c deal in PathTargets not targetlists, and establish a more regular structure for deriving the targets needed for successive steps. There is more that could be done here; calculating the eval costs of each successive target independently is both inefficient and wrong in detail, since we won't actually recompute values available from the input node's tlist. But it's no worse than what happened before the pathification rewrite. In any case this seems like a good starting point for considering how to handle Konstantin Knizhnik's function-evaluation-postponement patch.	2016-03-09 01:12:16 -05:00
Tom Lane	9e8b99420f	Improve handling of group-column indexes in GroupingSetsPath. Instead of having planner.c compute a groupColIdx array and store it in GroupingSetsPaths, make create_groupingsets_plan() find the grouping columns by searching in the child plan node's tlist. Although that's probably a bit slower for create_groupingsets_plan(), it's more like the way every other plan node type does this, and it provides positive confirmation that we know which child output columns we're supposed to be grouping on. (Indeed, looking at this now, I'm not at all sure that it wasn't broken before, because create_groupingsets_plan() isn't demanding an exact tlist match from its child node.) Also, this allows substantial simplification in planner.c, because it no longer needs to compute the groupColIdx array at all; no other cases were using it. I'd intended to put off this refactoring until later (like 9.7), but in view of the likely bug fix and the need to rationalize planner.c's tlist handling so we can do something sane with Konstantin Knizhnik's function-evaluation-postponement patch, I think it can't wait.	2016-03-08 22:32:11 -05:00
Peter Eisentraut	a40814d7aa	Handle invalid libpq sockets in more places Also, make error messages consistent. From: Michael Paquier <michael.paquier@gmail.com>	2016-03-08 21:10:33 -05:00
Tom Lane	61fd218930	Fix minor thinko in pathification code. I passed the wrong "root" struct to create_pathtarget in build_minmax_path. Since the subroot is a clone of the outer root, this would not cause any serious problems, but it would waste some cycles because set_pathtarget_cost_width would not have access to Var width estimates set up while running query_planner on the subroot.	2016-03-08 16:50:44 -05:00
Tom Lane	8c314b9853	Finish refactoring make_foo() functions in createplan.c. This patch removes some redundant cost calculations that I left for later cleanup in commit `3fc6e2d7f5`. There's now a uniform policy that the make_foo() convenience functions don't do any cost calculations. Most of their callers copy costs from the source Path node, and for those that don't, the calculation in the make_foo() function wasn't necessarily right anyhow. (make_result() was particularly a mess, as it was serving multiple callers using cost calcs designed for only the first one or two that had ever existed.) Aside from saving a few cycles, this ensures that what EXPLAIN prints matches the costs we used for planning purposes. It does not change any planner decisions, since the decisions are already made.	2016-03-08 16:28:34 -05:00
Robert Haas	7400559a3f	Comment update for fdw_recheck_quals. Commit `5fc4c26db5` could've done a better job updating these comments. Etsuro Fujita	2016-03-08 14:40:55 -05:00
Robert Haas	734f86d50d	Add new flags argument for xl_heap_visible to heap2_desc. Masahiko Sawada	2016-03-08 13:28:22 -05:00
Robert Haas	dcfecaae9e	Fix parallel query on standby servers. Without this fix, it inevitably bombs out with "ERROR: failed to initialize transaction_read_only to 0". Repair. Ashutosh Sharma; comments adjusted by me.	2016-03-08 10:27:03 -05:00
Robert Haas	070140ee48	Add some functions to fd.c for the convenience of extensions. For example, if you want to perform an ioctl() on a file descriptor opened through the fd.c routines, there's no way to do that without being able to get at the underlying fd. KaiGai Kohei	2016-03-08 10:09:50 -05:00
Robert Haas	77a1d1e798	Department of second thoughts: remove PD_ALL_FROZEN. Commit `a892234f83` added a second bit per page to the visibility map, which still seems like a good idea, but it also added a second page-level bit alongside PD_ALL_VISIBLE to track whether the visibility map bit was set. That no longer seems like a clever plan, because we don't really need that bit for anything. We always clear both bits when the page is modified anyway. Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.	2016-03-08 08:46:48 -05:00
Tom Lane	cf8e7b16a5	Spell "parallel" correctly. Per David Rowley.	2016-03-07 21:48:17 -05:00
Peter Eisentraut	1c2db8c305	Fix uninstall target in tsearch Makefile Artur Zakirov	2016-03-07 20:36:59 -05:00
Andres Freund	b63bea5fd3	Further improvements to `c8f621c43`. Coverity and inspection for the issue addressed in `fd45d16f` found some questionable code. Specifically coverity noticed that the wrong length was added in ReorderBufferSerializeChange() - without immediate negative consequences as the variable isn't used afterwards. During code-review and testing I noticed that a bit of space was wasted when allocating tuple bufs in several places. Thirdly, the debug memset()s in ReorderBufferGetTupleBuf() reduce the error checking valgrind can do. Backpatch: 9.4, like `c8f621c43`.	2016-03-07 14:24:03 -08:00
Tom Lane	3fc6e2d7f5	Make the upper part of the planner work by generating and comparing Paths. I've been saying we needed to do this for more than five years, and here it finally is. This patch removes the ever-growing tangle of spaghetti logic that grouping_planner() used to use to try to identify the best plan for post-scan/join query steps. Now, there is (nearly) independent consideration of each execution step, and entirely separate construction of Paths to represent each of the possible ways to do that step. We choose the best Path or set of Paths using the same add_path() logic that's been used inside query_planner() for years. In addition, this patch removes the old restriction that subquery_planner() could return only a single Plan. It now returns a RelOptInfo containing a set of Paths, just as query_planner() does, and the parent query level can use each of those Paths as the basis of a SubqueryScanPath at its level. This allows finding some optimizations that we missed before, wherein a subquery was capable of returning presorted data and thereby avoiding a sort in the parent level, making the overall cost cheaper even though delivering sorted output was not the cheapest plan for the subquery in isolation. (A couple of regression test outputs change in consequence of that. However, there is very little change in visible planner behavior overall, because the point of this patch is not to get immediate planning benefits but to create the infrastructure for future improvements.) There is a great deal left to do here. This patch unblocks a lot of planner work that was basically impractical in the old code structure, such as allowing FDWs to implement remote aggregation, or rewriting plan_set_operations() to allow consideration of multiple implementation orders for set operations. (The latter will likely require a full rewrite of plan_set_operations(); what I've done here is only to fix it to return Paths not Plans.) I have also left unfinished some localized refactoring in createplan.c and planner.c, because it was not necessary to get this patch to a working state. Thanks to Robert Haas, David Rowley, and Amit Kapila for review.	2016-03-07 15:58:22 -05:00
Andres Freund	fd45d16f62	Fix wrong allocation size in `c8f621c43`. In `c8f621c43` I forgot to account for MAXALIGN when allocating a new tuplebuf in ReorderBufferGetTupleBuf(). That happens to currently not cause active problems on a number of platforms because the affected pointer is already aligned, but others, like ppc and hppa, trigger this in the regression test, due to a debug memset clearing memory. Fix that. Backpatch: 9.4, like the previous commit.	2016-03-06 16:27:20 -08:00
Tom Lane	b3e05097e5	Fix not-terribly-safe coding in NIImportOOAffixes() and NIImportAffixes(). There were two places in spell.c that supposed that they could search for a location in a string produced by lowerstr() and then transpose the offset into the original string. But this fails completely if lowerstr() transforms any characters into characters of different byte length, as can happen in Turkish UTF8 for instance. We'd added some comments about this coding in commit `51e78ab4ff`, but failed to realize that it was not merely confusing but wrong. Coverity complained about this code years ago, but in such an opaque fashion that nobody understood what it was on about. I'm not entirely sure that this issue is what it's on about, actually, but perhaps this patch will shut it up -- and in any case the problem is clear. Back-patch to all supported branches.	2016-03-06 19:20:55 -05:00
Tom Lane	cb0ca0c995	Fix unportable usage of <ctype.h> functions. isdigit(), isspace(), etc are likely to give surprising results if passed a signed char. We should always cast the argument to unsigned char to avoid that. Error in commit `d78a7d9c7f`, found by buildfarm member gaur.	2016-03-06 18:23:53 -05:00
Andres Freund	c8f621c43a	logical decoding: Fix handling of large old tuples with replica identity full. When decoding the old version of an UPDATE or DELETE change, and if that tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or failed in more subtle ways in non-assert builds. Normally individual tuples aren't bigger than MaxHeapTupleSize, with big datums toasted. But that's not the case for the old version of a tuple for logical decoding; the replica identity is logged as one piece. With the default replica identity btree limits that to small tuples, but that's not the case for FULL. Change the tuple buffer infrastructure to separate allocate over-large tuples, instead of always going through the slab cache. This unfortunately requires changing the ReorderBufferTupleBuf definition, we need to store the allocated size someplace. To avoid requiring output plugins to recompile, don't store HeapTupleHeaderData directly after HeapTupleData, but point to it via t_data; that leaves rooms for the allocated size. As there's no reason for an output plugin to look at ReorderBufferTupleBuf->t_data.header, remove the field. It was just a minor convenience having it directly accessible. Reported-By: Adam Dratwiński Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com	2016-03-05 18:02:20 -08:00
Andres Freund	0bda14d54c	logical decoding: old/newtuple in spooled UPDATE changes was switched around. Somehow I managed to flip the order of restoring old & new tuples when de-spooling a change in a large transaction from disk. This happens to only take effect when a change is spooled to disk which has old/new versions of the tuple. That only is the case for UPDATEs where he primary key changed or where replica identity is changed to FULL. The tests didn't catch this because either spooled updates, or updates that changed primary keys, were tested; not both at the same time. Found while adding tests for the following commit. Backpatch: 9.4, where logical decoding was added	2016-03-05 18:02:20 -08:00
Andres Freund	d9e903f3cb	logical decoding: Tell reorderbuffer about all xids. Logical decoding's reorderbuffer keeps transactions in an LSN ordered list for efficiency. To make that's efficiently possible upper-level xids are forced to be logged before nested subtransaction xids. That only works though if these records are all looked at: Unfortunately we didn't do so for e.g. row level locks, which are otherwise uninteresting for logical decoding. This could lead to errors like: "ERROR: subxact logged without previous toplevel record". It's not sufficient to just look at row locking records, the xid could appear first due to a lot of other types of records (which will trigger the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny). So invent infrastructure to tell reorderbuffer about xids seen, when they'd otherwise not pass through reorderbuffer.c. Reported-By: Jarred Ward Bug: #13844 Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org Backpatch: 9.4, where logical decoding was added	2016-03-05 18:02:20 -08:00
Joe Conway	dc7d70ea05	Expose control file data via SQL accessible functions. Add four new SQL accessible functions: pg_control_system(), pg_control_checkpoint(), pg_control_recovery(), and pg_control_init() which expose a subset of the control file data. Along the way move the code to read and validate the control file to src/common, where it can be shared by the new backend functions and the original pg_controldata frontend program. Patch by me, significant input, testing, and review by Michael Paquier.	2016-03-05 11:10:19 -08:00
Fujii Masao	d34794f7d5	Ignore recovery_min_apply_delay until recovery has reached consistent state Previously recovery_min_apply_delay was applied even before recovery had reached consistency. This could cause us to wait a long time unexpectedly for read-only connections to be allowed. It's problematic because the standby was useless during that wait time. This patch changes recovery_min_apply_delay so that it's applied once the database has reached the consistent state. That is, even if the delay is set, the standby tries to replay WAL records as fast as possible until it has reached consistency. Author: Michael Paquier Reviewed-By: Julien Rouhaud Reported-By: Greg Clough Backpatch: 9.4, where recovery_min_apply_delay was added Bug: #13770 Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org	2016-03-06 02:29:04 +09:00
Robert Haas	708020eb7b	Fix typo in comment. Thomas Munro	2016-03-04 15:46:30 -05:00
Robert Haas	6fcde8a5c8	Minor improvements to transaction manager README. A simple SELECT is handled by PortalRunSelect, not ProcessQuery. Also, the previous indentation was unclear: change it so that a deeper level of indentation indicates that the outer function calls the inner one. Stas Kelvich	2016-03-04 14:12:28 -05:00
Robert Haas	17b124d303	Fix SerializeSnapshot not to overrun the allocated space. Rushabh Lathia	2016-03-04 13:48:36 -05:00
Robert Haas	df4685fb0c	Minor optimizations based on ParallelContext having nworkers_launched. Originally, we didn't have nworkers_launched, so code that used parallel contexts had to be preprared for the possibility that not all of the workers requested actually got launched. But now we can count on knowing the number of workers that were successfully launched, which can shave off a few cycles and simplify some code slightly. Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter Geoghegan.	2016-03-04 12:59:10 -05:00
Robert Haas	546cd0d766	Fix InitializeSessionUserId not to deference NULL rolename pointer. Dmitriy Sarafannikov, reviewed by Michael Paquier and Haribabu Kommi, with a minor fix by me.	2016-03-04 12:28:09 -05:00
Teodor Sigaev	d78a7d9c7f	Improve support of Hunspell in ispell dictionary. Now it's possible to load recent version of Hunspell for several languages. To handle these dictionaries Hunspell patch adds support for: * FLAG long - sets the double extended ASCII character flag type * FLAG num - sets the decimal number flag type (from 1 to 65535) * AF parameter - alias for flag's set Also it moves test dictionaries into separate directory. Author: Artur Zakirov with editorization by me	2016-03-04 20:08:47 +03:00
Peter Eisentraut	1fa2a6b1d4	Add prerequisite for KOI8-U.TXT This was missed when the encoding was added.	2016-03-03 20:44:47 -05:00
Peter Eisentraut	b497abc602	Make some adjustments in variable assignments These variables aren't really used for anything interesting, but it seems the existing grouping was somewhat nonsensical.	2016-03-03 20:44:47 -05:00
Peter Eisentraut	7a4a813c99	Add missing rules related to EUC_JIS_2004 and SHIFT_JIS_2004 encodings This was apparently forgotten in commit `75c6519ff6`.	2016-03-03 20:44:47 -05:00
Simon Riggs	c7111d11b1	Revert buggy optimization of index scans `606c0123d6` attempted to reduce cost of index scans using > and < strategies, though got that completely wrong in a few complex cases. Revert whole patch until we find a safe optimization.	2016-03-03 09:53:43 +00:00
Andres Freund	7c17aac69d	logical decoding: fix decoding of a commit's commit time. When adding replication origins in `5aa235042`, I somehow managed to set the timestamp of decoded transactions to InvalidXLogRecptr when decoding one made without a replication origin. Fix that, and the wrong type of the new commit_time variable. This didn't trigger a regression test failure because we explicitly don't show commit timestamps in the regression tests, as they obviously are variable. Add a test that checks that a decoded commit's timestamp is within minutes of NOW() from before the commit. Reported-By: Weiping Qu Diagnosed-By: Artur Zakirov Discussion: 56D4197E.9050706@informatik.uni-kl.de, 56D42918.1010108@postgrespro.ru Backpatch: 9.5, where `5aa235042` originates.	2016-03-02 23:42:21 -08:00
Tom Lane	a9d199f6d3	Fix json_to_record() bug with nested objects. A thinko concerning nesting depth caused json_to_record() to produce bogus output if a field of its input object contained a sub-object with a field name matching one of the requested output column names. Per bug #13996 from Johann Visagie. I added a regression test case based on his example, plus parallel tests for json_to_recordset, jsonb_to_record, jsonb_to_recordset. The latter three do not exhibit the same bug (which suggests that we may be missing some opportunities to share code...) but testing seems like a good idea in any case. Back-patch to 9.4 where these functions were introduced.	2016-03-02 23:31:39 -05:00
Tom Lane	eb43e851d6	Create stub functions to support pg_upgrade of old contrib/tsearch2. Commits `9ff60273e3` and `dbe2328959` adjusted the declarations of some core functions referenced by contrib/tsearch2's install script, forgetting that in a pg_upgrade situation, we'll be trying to restore operator class definitions that reference the old signatures. We've hit this problem before; solve it in the same way as before, namely by installing stub functions that have the expected signature and just invoke the correct function. Per report from Jeff Janes. (Someday we ought to stop supporting contrib/tsearch2, but I'm not sure today is that day.)	2016-03-02 17:37:54 -05:00
Robert Haas	a892234f83	Change the format of the VM fork to add a second bit per page. The new bit indicates whether every tuple on the page is already frozen. It is cleared only when the all-visible bit is cleared, and it can be set only when we vacuum a page and find that every tuple on that page is both visible to every transaction and in no need of any future vacuuming. A future commit will use this new bit to optimize away full-table scans that would otherwise be triggered by XID wraparound considerations. A page which is merely all-visible must still be scanned in that case, but a page which is all-frozen need not be. This commit does not attempt that optimization, although that optimization is the goal here. It seems better to get the basic infrastructure in place first. Per discussion, it's very desirable for pg_upgrade to automatically migrate existing VM forks from the old format to the new format. That, too, will be handled in a follow-on patch. Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit Kapila, Simon Riggs, Andres Freund, and others, and substantially revised by me.	2016-03-01 21:49:41 -05:00
Robert Haas	aec64e8f45	Fix mistake in extensible node code. I believe that I (rhaas) introduced this bug while editing the patch that became `bcac23de73`. Report and patch from KaiGai Kohei.	2016-03-01 13:17:09 -05:00
Peter Eisentraut	bd6cf3f237	Add Unicode map generation scripts as rule prerequisites That way, the rules will trigger when the scripts change.	2016-02-29 21:19:28 -05:00
Peter Eisentraut	cc074bf6c1	Fix comments Some of these comments were copied and pasted without updating them, some of them were duplicates.	2016-02-29 21:19:24 -05:00
Peter Eisentraut	9a3e06baa2	UCS_to_most.pl: Make executable, for consistency with other scripts	2016-02-29 21:19:17 -05:00
Tom Lane	8d8ff5f7db	Improve error message for rejecting RETURNING clauses with dropped columns. This error message was written with only ON SELECT rules in mind, but since then we also made RETURNING-clause targetlists go through the same logic. This means that you got a rather off-topic error message if you tried to add a rule with RETURNING to a table having dropped columns. Ideally we'd just support that, but some preliminary investigation says that it might be a significant amount of work. Seeing that Nicklas Avén's complaint is the first one we've gotten about this in the ten years or so that the code's been like that, I'm unwilling to put much time into it. Instead, improve the error report by issuing a different message for RETURNING cases, and revise the associated comment based on this investigation. Discussion: 1456176604.17219.9.camel@jordogskog.no	2016-02-29 19:11:38 -05:00
Alvaro Herrera	10b4852215	Fix typos Author: Amit Langote	2016-02-29 18:11:58 -03:00
Tom Lane	c110678a47	Remove useless unary plus. It's harmless, but might confuse readers. Seems to have been introduced in `6bc8ef0b7f`. Back-patch, just to avoid cosmetic cross-branch differences. Amit Langote	2016-02-29 10:48:40 -05:00
Tom Lane	05893712cc	Fix build under OPTIMIZER_DEBUG. In commit `19a541143a` I replaced RelOptInfo.width with RelOptInfo.reltarget.width, but I missed updating debug_print_rel() for that because it's not compiled by default. Reported by Salvador Fandino, patch by Michael Paquier.	2016-02-29 10:14:12 -05:00
Dean Rasheed	41fedc2462	Fix incorrect varlevelsup in security_barrier_replace_vars(). When converting an RTE with securityQuals into a security barrier subquery RTE, ensure that the Vars in the new subquery's targetlist all have varlevelsup = 0 so that they correctly refer to the underlying base relation being wrapped. The original code was creating new Vars by copying them from existing Vars referencing the base relation found elsewhere in the query, but failed to account for the fact that such Vars could come from sublink subqueries, and hence have varlevelsup > 0. In practice it looks like this could only happen with nested security barrier views, where the outer view has a WHERE clause containing a correlated subquery, due to the order in which the Vars are processed. Bug: #13988 Reported-by: Adam Guthrie Backpatch-to: 9.4, where updatable SB views were introduced	2016-02-29 12:28:06 +00:00
Tom Lane	907e4dd2b1	Avoid multiple free_struct_lconv() calls on same data. A failure partway through PGLC_localeconv() led to a situation where the next call would call free_struct_lconv() a second time, leading to free() on already-freed strings, typically leading to a core dump. Add a flag to remember whether we need to do that. Per report from Thom Brown. His example case only provokes the failure as far back as 9.4, but nonetheless this code is obviously broken, so back-patch to all supported branches.	2016-02-28 23:39:20 -05:00
Robert Haas	7bea19d0a9	On second thought, disable parallelism for prepared statements. CREATE TABLE .. AS EXECUTE can turn an apparently read-only query into a write operation, which parallel query can't handle. It's a bit of a shame that requires us to avoid parallel query for queries prepared via PREPARE in all cases, but for right now it does.	2016-02-26 16:33:37 +05:30
Robert Haas	35746bc348	Add new FDW API to test for parallel-safety. This is basically a bug fix; the old code assumes that a ForeignScan is always parallel-safe, but for postgres_fdw, for example, this is definitely false. It should be true for file_fdw, though, since a worker can read a file from the filesystem just as well as any other backend process. Original patch by Thomas Munro. Documentation, and changes to the comments, by me.	2016-02-26 16:14:46 +05:30
Robert Haas	57a6a72b6b	Enable parallelism for prepared statements and extended query protocol. Parallel query can't handle running a query only partially rather than to completion. However, there seems to be no way to run a statement prepared via SQL PREPARE other than to completion, so we can enable it there without a problem. The situation is more complicated for the extend query protocol. libpq seems to provide no way to send an Execute message with a non-zero rowcount, but some other client might. If that happens, and a parallel plan was chosen, we'll execute the parallel plan without using any workers, which may be somewhat inefficient but should still work. Hopefully this won't be a problem; users can always set max_parallel_degree=0 to avoid choosing parallel plans in the first place. Amit Kapila, reviewed by me.	2016-02-25 13:02:18 +05:30
Tom Lane	52f5d578d6	Create a function to reliably identify which sessions block which others. This patch introduces "pg_blocking_pids(int) returns int[]", which returns the PIDs of any sessions that are blocking the session with the given PID. Historically people have obtained such information using a self-join on the pg_locks view, but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has pretty much broken that approach altogether. (Given some more columns in the view than there are today, you could imagine handling parallel-query cases with a 4-way join; but ugh.) The new function has the following behaviors that are painful or impossible to get right via pg_locks: 1. Correctly understands which lock modes block which other ones. 2. In soft-block situations (two processes both waiting for conflicting lock modes), only the one that's in front in the wait queue is reported to block the other. 3. In parallel-query cases, reports all sessions blocking any member of the given PID's lock group, and reports a session by naming its leader process's PID, which will be the pg_backend_pid() value visible to clients. The motivation for doing this right now is mostly to fix the isolation tests. Commit `38f8bdcac4` lobotomized isolationtester's is-it-waiting query by removing its ability to recognize nonconflicting lock modes, as a crude workaround for the inability to handle soft-block situations properly. But even without the lock mode tests, the old query was excessively slow, particularly in CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new deadlock-hard test because the deadlock timeout elapses before they can probe the waiting status of all eight sessions. Replacing the pg_locks self-join with use of pg_blocking_pids() is not only much more correct, but a lot faster: I measure it at about 9X faster in a typical dev build with Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the test, without having to lengthen deadlock_timeout yet more and thus slow down the test for everyone else.	2016-02-22 14:31:43 -05:00
Tom Lane	73bf8715aa	Remove redundant PGPROC.lockGroupLeaderIdentifier field. We don't really need this field, because it's either zero or redundant with PGPROC.pid. The use of zero to mark "not a group leader" is not necessary since we can just as well test whether lockGroupLeader is NULL. This does not save very much, either as to code or data, but the simplification seems worthwhile anyway.	2016-02-22 11:20:35 -05:00
Andres Freund	ea56b06cf7	Fix wrong keysize in PrivateRefCountHash creation. In `4b4b680c3` I accidentally used sizeof(PrivateRefCountArray) instead of sizeof(PrivateRefCountEntry) when creating the refcount overflow hashtable. As the former is bigger than the latter, this luckily only resulted in a slightly increased memory usage when many buffers are pinned in a backend. Reported-By: Takashi Horikawa Discussion: 73FA3881462C614096F815F75628AFCD035A48C3@BPXM01GP.gisp.nec.co.jp Backpatch: 9.5, where thew new ref count infrastructure was introduced	2016-02-21 22:48:44 -08:00
Andrew Dunstan	94c745eb18	Fix two-argument jsonb_object when called with empty arrays Some over-eager copy-and-pasting on my part resulted in a nonsense result being returned in this case. I have adopted the same pattern for handling this case as is used in the one argument form of the function, i.e. we just skip over the code that adds values to the object. Diagnosis and patch from Michael Paquier, although not quite his solution. Fixes bug #13936. Backpatch to 9.5 where jsonb_object was introduced.	2016-02-21 10:30:49 -05:00
Robert Haas	88aca5662d	Fix incorrect decision about which lock to take. Spotted by Tom Lane.	2016-02-21 17:06:41 +05:30
Robert Haas	d91a4a6c85	Cosmetic improvements to group locking. Reflow text in lock manager README so that it fits within 80 columns. Correct some mistakes. Expand the README to explain not only why group locking exists but also the data structures that support it. Improve comments related to group locking several files. Change the name of a macro argument for improved clarity. Most of these problems were reported by Tom Lane, but I found a few of them myself. Robert Haas and Tom Lane	2016-02-21 15:42:02 +05:30
Dean Rasheed	740d71842b	Further fixing to make pg_size_bytes() portable. Not all compilers support "long long" and the "LL" integer literal suffix, so use a cast to int64 instead.	2016-02-20 15:49:26 +00:00
Dean Rasheed	ad7cc1c554	Fix pg_size_bytes() to be more portable. Commit `53874c5228` broke various 32-bit buildfarm machines because it incorrectly used an 'L' suffix for what needed to be a 64-bit literal. Thanks to Michael Paquier for helping to diagnose this.	2016-02-20 11:03:04 +00:00
Dean Rasheed	53874c5228	Add pg_size_bytes() to parse human-readable size strings. This will parse strings in the format produced by pg_size_pretty() and return sizes in bytes. This allows queries to be written with clauses like "pg_total_relation_size(oid) > pg_size_bytes('10 GB')". Author: Pavel Stehule with various improvements by Vitaly Burovoy Discussion: http://www.postgresql.org/message-id/CAFj8pRD-tGoDKnxdYgECzA4On01_uRqPrwF-8LdkSE-6bDHp0w@mail.gmail.com Reviewed-by: Vitaly Burovoy, Oleksandr Shulgin, Kyotaro Horiguchi, Michael Paquier and Robert Haas	2016-02-20 09:57:27 +00:00
Simon Riggs	481725c0ba	Correct StartupSUBTRANS for page wraparound StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans data structure, which in some cases could lead to errors in startup for Hot Standby. This patch wraps the pageids correctly, avoiding any such errors. Identified by exhaustive crash testing by Jeff Janes. Jeff Janes	2016-02-19 08:31:12 +00:00
Tom Lane	19a541143a	Add an explicit representation of the output targetlist to Paths. Up to now, there's been an assumption that all Paths for a given relation compute the same output column set (targetlist). However, there are good reasons to remove that assumption. For example, an indexscan on an expression index might be able to return the value of an expensive function "for free". While we have the ability to generate such a plan today in simple cases, we don't have a way to model that it's cheaper than a plan that computes the function from scratch, nor a way to create such a plan in join cases (where the function computation would normally happen at the topmost join node). Also, we need this so that we can have Paths representing post-scan/join steps, where the targetlist may well change from one step to the next. Therefore, invent a "struct PathTarget" representing the columns we expect a plan step to emit. It's convenient to include the output tuple width and tlist evaluation cost in this struct, and there will likely be additional fields in future. While Path nodes that actually do have custom outputs will need their own PathTargets, it will still be true that most Paths for a given relation will compute the same tlist. To reduce the overhead added by this patch, keep a "default PathTarget" in RelOptInfo, and allow Paths that compute that column set to just point to their parent RelOptInfo's reltarget. (In the patch as committed, actually every Path is like that, since we do not yet have any cases of custom PathTargets.) I took this opportunity to provide some more-honest costing of PlaceHolderVar evaluation. Up to now, the assumption that "scan/join reltargetlists have cost zero" was applied not only to Vars, where it's reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval cost of a PlaceHolderVar's expression to the first plan level where it can be computed, by including it in the PathTarget cost field and adding that to the cost estimates for Paths. This isn't perfect yet but it's much better than before, and there is a way forward to improve it more. This costing change affects the join order chosen for a couple of the regression tests, changing expected row ordering.	2016-02-18 20:02:03 -05:00
Peter Eisentraut	18777c38e9	Improve error message about active replication slot The old phrasing was awkward if a replication slot is activated and deactivated repeatedly.	2016-02-17 21:23:28 -05:00
Joe Conway	a5c43b8869	Add new system view, pg_config Move and refactor the underlying code for the pg_config client application to src/common in support of sharing it with a new system information SRF called pg_config() which makes the same information available via SQL. Additionally wrap the SRF with a new system view, as called pg_config. Patch by me with extensive input and review by Michael Paquier and additional review by Alvaro Herrera.	2016-02-17 09:12:06 -08:00
Robert Haas	f1f5ec1efa	Reuse abbreviated keys in ordered [set] aggregates. When processing ordered aggregates following a sort that could make use of the abbreviated key optimization, only call the equality operator to compare successive pairs of tuples when their abbreviated keys were not equal. Peter Geoghegan, reviewd by Andreas Karlsson and by me.	2016-02-17 15:40:00 +05:30
Andres Freund	7975c5e0a9	Allow the WAL writer to flush WAL at a reduced rate. Commit `4de82f7d7` increased the WAL flush rate, mainly to increase the likelihood that hint bits can be set quickly. More quickly set hint bits can reduce contention around the clog et al. But unfortunately the increased flush rate can have a significant negative performance impact, I have measured up to a factor of ~4. The reason for this slowdown is that if there are independent writes to the underlying devices, for example because shared buffers is a lot smaller than the hot data set, or because a checkpoint is ongoing, the fdatasync() calls force cache flushes to be emitted to the storage. This is achieved by flushing WAL only if the last flush was longer than wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC) unflushed blocks are pending. Based on some tests the default for wal_writer_delay is 1MB, which seems to work well both on SSD and rotational media. To avoid negative performance impact due to `4de82f7d7` an earlier commit (`db76b1e`) made SetHintBits() more likely to succeed; preventing performance regressions in the pgbench tests I performed. Discussion: 20160118163908.GW10941@awork2.anarazel.de	2016-02-16 00:56:34 +01:00
Tom Lane	8c95ae81fa	Suppress compiler warnings about useless comparison of unsigned to zero. Reportedly, some compilers warn about tests like "c < 0" if c is unsigned, and hence complain about the character range checks I added in commit `3bb3f42f37`. This is a bit of a pain since the regex library doesn't really want to assume that chr is unsigned. However, since any such reconfiguration would involve manual edits of regcustom.h anyway, we can put it on the shoulders of whoever wants to do that to adjust this new range-checking macro correctly. Per gripes from Coverity and Andres.	2016-02-15 17:12:16 -05:00
Andres Freund	db76b1efbb	Allow SetHintBits() to succeed if the buffer's LSN is new enough. Previously we only allowed SetHintBits() to succeed if the commit LSN of the last transaction touching the page has already been flushed to disk. We can't generally change the LSN of the page, because we don't necessarily have the required locks on the page. But the required LSN interlock does not mean the commit record has to be flushed immediately, it just requires that the commit record will be flushed before the page is written out. Therefore if the buffer LSN is newer than the commit LSN, the hint bit can be safely set. In a number of scenarios (e.g. pgbench) this noticeably increases the number of hint bits are set. But more importantly it also keeps the success rate up when flushing WAL less frequently. That was the original reason for commit `4de82f7d7`, which has negative performance consequences in a number of scenarios. This will allow a followup commit to reduce the flush rate. Discussion: 20160118163908.GW10941@awork2.anarazel.de	2016-02-15 22:48:51 +01:00
Fujii Masao	31b6606c48	Make concurrent refresh check early that there is a unique index on matview. In REFRESH MATERIALIZED VIEW command, CONCURRENTLY option is only allowed if there is at least one unique index with no WHERE clause on one or more columns of the matview. Previously, concurrent refresh checked the existence of a unique index on the matview after filling the data to new snapshot, i.e., after calling refresh_matview_datafill(). So, when there was no unique index, we could need to wait a long time before we detected that and got the error. It was a waste of time. To eliminate such wasting time, this commit changes concurrent refresh so that it checks the existence of a unique index at the beginning of the refresh operation, i.e., before starting any time-consuming jobs. If CONCURRENTLY option is not allowed due to lack of a unique index, concurrent refresh can immediately detect it and emit an error. Author: Masahiko Sawada Reviewed-by: Michael Paquier, Fujii Masao	2016-02-16 02:15:44 +09:00
Tom Lane	9b92e76f7b	Make GetLockStatusData's header comment resemble reality. The API spec for this function was changed completely (and for the better) by commit `3cba8999b3`, but it didn't bother with anything as mundane as updating the comments.	2016-02-13 15:42:31 -05:00
Joe Conway	59a884e985	Change delimiter used for display of NextXID NextXID has been rendered in the form of a pg_lsn even though it really is not. This can cause confusion, so change the format from %u/%u to %u:%u, per discussion on hackers. Complaint by me, patch by me and Bruce, reviewed by Michael Paquier and Alvaro. Applied to HEAD only. Author: Joe Conway, Bruce Momjian Reviewed-by: Michael Paquier, Alvaro Herrera Backpatch-through: master	2016-02-12 14:23:59 -08:00
Robert Haas	bcac23de73	Introduce extensible node types. An extensible node is always tagged T_Extensible, but the extnodename field identifies it more specifically; it may also include arbitrary private data. Extensible nodes can be copied, tested for equality, serialized, and deserialized, but the core system doesn't know anything about them otherwise. Some extensions may find it useful to include these nodes in fdw_private or custom_private lists in lieu of arm-wrestling their data into a format that the core code can understand. Along the way, so as not to burden the authors of such extensible node types too much, expose the functions for writing serialized tokens, and for serializing and deserializing bitmapsets. KaiGai Kohei, per a design suggested by me. Reviewed by Andres Freund and by me, and further edited by me.	2016-02-12 09:38:11 -05:00
Robert Haas	63461a63f9	Make builtin lwlock tranche names consistent. Previously, we had a mix of styles. Amit Kapila	2016-02-12 08:07:11 -05:00
Tom Lane	f144f73242	Refactor check_functional_grouping() to use get_primary_key_attnos(). If we ever get around to allowing functional dependency to be proven from other things besides simple primary keys, this code will need to be rethought, but that was true anyway. In the meantime, we might as well not have two very-similar routines for scanning pg_constraint. David Rowley, reviewed by Julien Rouhaud	2016-02-11 17:52:03 -05:00
Tom Lane	d4c3a156cb	Remove GROUP BY columns that are functionally dependent on other columns. If a GROUP BY clause includes all columns of a non-deferred primary key, as well as other columns of the same relation, those other columns are redundant and can be dropped from the grouping; the pkey is enough to ensure that each row of the table corresponds to a separate group. Getting rid of the excess columns will reduce the cost of the sorting or hashing needed to implement GROUP BY, and can indeed remove the need for a sort step altogether. This seems worth testing for since many query authors are not aware of the GROUP-BY-primary-key exception to the rule about queries not being allowed to reference non-grouped-by columns in their targetlists or HAVING clauses. Thus, redundant GROUP BY items are not uncommon. Also, we can make the test pretty cheap in most queries where it won't help by not looking up a rel's primary key until we've found that at least two of its columns are in GROUP BY. David Rowley, reviewed by Julien Rouhaud	2016-02-11 17:34:59 -05:00
Tom Lane	72eee410d4	Move pg_constraint.h function declarations to new file pg_constraint_fn.h. A pending patch requires exporting a function returning Bitmapset from catalog/pg_constraint.c. As things stand, that would mean including nodes/bitmapset.h in pg_constraint.h, which might be hazardous for the client-side includability of that header. It's not entirely clear whether any client-side code needs to include pg_constraint.h, but it seems prudent to assume that there is some such code somewhere. Therefore, split off the function definitions into a new file pg_constraint_fn.h, similarly to what we've done for some other catalog header files.	2016-02-11 15:51:28 -05:00
Tom Lane	2564be360a	Fix typo in comment.	2016-02-11 15:20:14 -05:00
Tom Lane	d18643c4a6	Shift the responsibility for emitting "database system is shut down". Historically this message has been emitted at the end of ShutdownXLOG(). That's not an insane place for it in a standalone backend, but in the postmaster environment we've grown a fair amount of stuff that happens later, including archiver/walsender shutdown, stats collector shutdown, etc. Recent buildfarm experimentation showed that on slower machines there could be many seconds' delay between finishing ShutdownXLOG() and actual postmaster exit. That's fairly confusing, both for testing purposes and for DBAs. Hence, move the code that prints this message into UnlinkLockFiles(), so that it comes out just after we remove the postmaster's pidfile. That is a more appropriate definition of "is shut down" from the point of view of "pg_ctl stop", for example. In general, removing the pidfile should be the last externally-visible action of either a postmaster or a standalone backend; compare commit `d73d14c271` for instance. So this seems like a reasonably future-proof approach.	2016-02-11 14:14:22 -05:00
Robert Haas	c319991bca	Use separate lwlock tranches for buffer, lock, and predicate lock managers. This finishes the work - spread across many commits over the last several months - of putting each type of lock other than the named individual locks into a separate tranche. Amit Kapila	2016-02-11 14:07:33 -05:00
Teodor Sigaev	07d25a964b	Improve error reporting in format() Clarify invalid format conversion type error message and add hint. Author: Jim Nasby	2016-02-11 18:11:11 +03:00
Robert Haas	a455878d99	Rename PGPROC fields related to group XID clearing again. Commit `0e141c0fbb` introduced a new facility to reduce ProcArrayLock contention by clearing several XIDs from the ProcArray under a single lock acquisition. The names initially chosen were deemed not to be very good choices, so commit `4aec49899e` renamed them. But now it seems like we still didn't get it right. A pending patch wants to add similar infrastructure for batching CLOG updates, so the names need to be clear enough to allow a new set of structure members with a related purpose. Amit Kapila	2016-02-11 08:55:24 -05:00
Tom Lane	51e78ab4ff	Avoid use of sscanf() to parse ispell dictionary files. It turns out that on FreeBSD-derived platforms (including OS X), the scanf() family of functions is pretty much brain-dead about multibyte characters. In particular it will apply isspace() to individual bytes of input even when those bytes are part of a multibyte character, thus allowing false recognition of a field-terminating space. We appear to have little alternative other than instituting a coding rule that scanf() is not to be used if the input string might contain multibyte characters. (There was some discussion of relying on "%ls", but that probably just moves the portability problem somewhere else, and besides it doesn't fully prevent BSD *scanf() from using isspace().) This patch is a down payment on that: it gets rid of use of sscanf() to parse ispell dictionary files, which are certainly at great risk of having a problem. The code is cleaner this way anyway, though a bit longer. In passing, improve a few comments. Report and patch by Artur Zakirov, reviewed and somewhat tweaked by me. Back-patch to all supported branches.	2016-02-10 19:30:11 -05:00
Tom Lane	c5e9b77127	Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier." This reverts commit `3971f64843` and a couple of followon debugging commits; I think we've learned what we can from them.	2016-02-10 16:01:04 -05:00
Robert Haas	79a7ff0fe5	Code cleanup in the wake of recent LWLock refactoring. As of commit `c1772ad922`, there's no longer any way of requesting additional LWLocks in the main tranche, so we don't need NumLWLocks() or LWLockAssign() any more. Also, some of the allocation counters that we had previously aren't needed any more either. Amit Kapila	2016-02-10 09:58:09 -05:00
Tom Lane	41d505a7ff	Add still more chattiness in server shutdown. Further investigation says that there may be some slow operations after we've finished ShutdownXLOG(), so add some more log messages to try to isolate that. This is all temporary code too.	2016-02-09 19:36:30 -05:00
Tom Lane	7351e18286	Add more chattiness in server shutdown. Early returns from the buildfarm show that there's a bit of a gap in the logging I added in 3971f64843b02e4a: the portion of CreateCheckPoint() after CheckPointGuts() can take a fair amount of time. Add a few more log messages in that section of code. This too shall be reverted later.	2016-02-09 11:21:46 -05:00
Tom Lane	3971f64843	Temporarily make pg_ctl and server shutdown a whole lot chattier. This is a quick hack, due to be reverted when its purpose has been served, to try to gather information about why some of the buildfarm critters regularly fail with "postmaster does not shut down" complaints. Maybe they are just really overloaded, but maybe something else is going on. Hence, instrument pg_ctl to print the current time when it starts waiting for postmaster shutdown and when it gives up, and add a lot of logging of the current time in the server's checkpoint and shutdown code paths. No attempt has been made to make this pretty. I'm not even totally sure if it will build on Windows, but we'll soon find out.	2016-02-08 18:43:11 -05:00
Tom Lane	0231f83856	Re-pgindent varlena.c. Just to make sure previous commit worked ...	2016-02-08 15:17:40 -05:00
Tom Lane	58e797216f	Rename typedef "string" to "VarString". Since pgindent treats typedef names as global, the original coding of `b47b4dbf68` would have had rather nasty effects on the formatting of other files in which "string" is used as a variable or field name. Use a less generic name for this typedef, and rename some other identifiers to match. Peter Geoghegan, per gripe from me	2016-02-08 15:15:56 -05:00
Tom Lane	3bb3f42f37	Fix some regex issues with out-of-range characters and large char ranges. Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a bad choice because it is outside the range of type "celt" (int32). Characters approaching that limit could lead to infinite loops in logic such as "for (c = a; c <= b; c++)" where c is of type celt but the range bounds are chr. Such loops will work safely only if CHR_MAX+1 is representable in celt, since c must advance to beyond b before the loop will exit. Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe. It's highly unlikely that Unicode will ever assign codes that high, and none of our other backend encodings need characters beyond that either. In addition to modifying the macro, we have to explicitly enforce character range restrictions on the values of \u, \U, and \x escape sequences, else the limit is trivially bypassed. Also, the code for expanding case-independent character ranges in bracket expressions had a potential integer overflow in its calculation of the number of characters it could generate, which could lead to allocating too small a character vector and then overwriting memory. An attacker with the ability to supply arbitrary regex patterns could easily cause transient DOS via server crashes, and the possibility for privilege escalation has not been ruled out. Quite aside from the integer-overflow problem, the range expansion code was unnecessarily inefficient in that it always produced a result consisting of individual characters, abandoning the knowledge that we had a range to start with. If the input range is large, this requires excessive memory. Change it so that the original range is reported as-is, and then we add on any case-equivalent characters that are outside that range. With this approach, we can bound the number of individual characters allowed without sacrificing much. This patch allows at most 100000 individual characters, which I believe to be more than the number of case pairs existing in Unicode, so that the restriction will never be hit in practice. It's still possible for range() to take awhile given a large character code range, so also add statement-cancel detection to its loop. The downstream function dovec() also lacked cancel detection, and could take a long time given a large output from range(). Per fuzz testing by Greg Stark. Back-patch to all supported branches. Security: CVE-2016-0773	2016-02-08 10:25:40 -05:00
Andres Freund	a6897efab9	Fix overeager pushdown of HAVING clauses when grouping sets are used. In `61444bfb` we started to allow HAVING clauses to be fully pushed down into WHERE, even when grouping sets are in use. That turns out not to work correctly, because grouping sets can "produce" NULLs, meaning that filtering in WHERE and HAVING can have different results, even when no aggregates or volatile functions are involved. Instead only allow pushdown of empty grouping sets. It'd be nice to do better, but the exact mechanics of deciding which cases are safe are still being debated. It's important to give correct results till we find a good solution, and such a solution might not be appropriate for backpatching anyway. Bug: #13863 Reported-By: 'wrb' Diagnosed-By: Dean Rasheed Author: Andrew Gierth Reviewed-By: Dean Rasheed and Andres Freund Discussion: 20160113183558.12989.56904@wrigleys.postgresql.org Backpatch: 9.5, where grouping sets were introduced	2016-02-08 11:03:31 +01:00
Tom Lane	cc2ca9319a	Fix deparsing of ON CONFLICT arbiter WHERE clauses. The parser doesn't allow qualification of column names appearing in these clauses, but ruleutils.c would sometimes qualify them, leading to dump/reload failures. Per bug #13891 from Onder Kalaci. (In passing, make stanzas in ruleutils.c that save/restore varprefix more consistent.) Peter Geoghegan	2016-02-07 14:57:24 -05:00
Tom Lane	f867ce5518	ExecHashRemoveNextSkewBucket must physically copy tuples to main hashtable. Commit `45f6240a8f` added an assumption in ExecHashIncreaseNumBatches and ExecHashIncreaseNumBuckets that they could find all tuples in the main hash table by iterating over the "dense storage" introduced by that patch. However, ExecHashRemoveNextSkewBucket continued its old practice of simply re-linking deleted skew tuples into the main table's hashchains. Hence, such tuples got lost during any subsequent increase in nbatch or nbuckets, and would never get joined, as reported in bug #13908 from Seth P. I (tgl) think that the aforesaid commit has got multiple design issues and should be reworked rather completely; but there is no time for that right now, so band-aid the problem by making ExecHashRemoveNextSkewBucket physically copy deleted skew tuples into the "dense storage" arena. The added test case is able to exhibit the problem by means of fooling the planner with a WHERE condition that it will underestimate the selectivity of, causing the initial nbatch estimate to be too small. Tomas Vondra and Tom Lane. Thanks to David Johnston for initial investigation into the bug report.	2016-02-07 12:29:32 -05:00
Robert Haas	7c944bd903	Introduce a new GUC force_parallel_mode for testing purposes. When force_parallel_mode = true, we enable the parallel mode restrictions for all queries for which this is believed to be safe. For the subset of those queries believed to be safe to run entirely within a worker, we spin up a worker and run the query there instead of running it in the original process. When force_parallel_mode = regress, make additional changes to allow the regression tests to run cleanly even though parallel workers have been injected under the hood. Taken together, this facilitates both better user testing and better regression testing of the parallelism code. Robert Haas, with help from Amit Kapila and Rushabh Lathia.	2016-02-07 11:41:33 -05:00
Robert Haas	a1c1af2a1f	Introduce group locking to prevent parallel processes from deadlocking. For locking purposes, we now regard heavyweight locks as mutually non-conflicting between cooperating parallel processes. There are some possible pitfalls to this approach that are not to be taken lightly, but it works OK for now and can be changed later if we find a better approach. Without this, it's very easy for parallel queries to silently self-deadlock if the user backend holds strong relation locks. Robert Haas, with help from Amit Kapila. Thanks to Noah Misch and Andres Freund for extensive discussion of possible issues with this approach.	2016-02-07 10:16:13 -05:00
Tom Lane	aa2387e2fd	Improve speed of timestamp/time/date output functions. It seems that sprintf(), at least in glibc's version, is unreasonably slow compared to hand-rolled code for printing integers. Replacing most uses of sprintf() in the datetime.c output functions with special-purpose code turns out to give more than a 2X speedup in COPY of a table with a single timestamp column; which is pretty impressive considering all the other logic in that code path. David Rowley and Andres Freund, reviewed by Peter Geoghegan and myself	2016-02-06 23:11:28 -05:00
Tom Lane	b921aeb167	Fix comment block trashed by pgindent. Looks like I put the protective dashes in the wrong place in `f4e4b32743`.	2016-02-06 15:13:36 -05:00
Tom Lane	be11f8400d	Improve HJDEBUG code a bit. Commit `30d7ae3c76` introduced an HJDEBUG stanza that probably didn't compile at the time, and definitely doesn't compile now, because it refers to a nonexistent variable. It doesn't seem terribly useful anyway, so just get rid of it. While I'm fooling with it, use %z modifier instead of the obsolete hack of casting size_t to unsigned long, and include the HashJoinTable's address in each printout so that it's possible to distinguish the activities of multiple hashjoins occurring in one query. Noted while trying to use HJDEBUG to investigate bug #13908. Back-patch to 9.5, because code that doesn't compile is certainly not very helpful.	2016-02-06 15:05:23 -05:00
Noah Misch	41baee7a93	Comment on dead code in AtAbort_Portals() and AtSubAbort_Portals(). Reviewed by Tom Lane and Robert Haas.	2016-02-05 20:23:40 -05:00
Noah Misch	f4aa3a18a2	Force certain "pljava" custom GUCs to be PGC_SUSET. Future PL/Java versions will close CVE-2016-0766 by making these GUCs PGC_SUSET. This PostgreSQL change independently mitigates that PL/Java vulnerability, helping sites that update PostgreSQL more frequently than PL/Java. Back-patch to 9.1 (all supported versions).	2016-02-05 20:22:51 -05:00
Robert Haas	e98fd78607	Fix typo in comment. Michael Paquier	2016-02-05 08:11:00 -05:00
Robert Haas	e0e7b8fa22	Remove parallel-safety check from GetExistingLocalJoinPath. Commit `a104a017fc` has this check because I added it to the submitted patch before commit, but that was entirely wrongheaded, as explained to me by Ashutosh Bapat, who also wrote this patch.	2016-02-05 08:07:38 -05:00
Robert Haas	63f39b9148	Fix small goof in comment. Peter Geoghegan	2016-02-05 08:04:48 -05:00
Tom Lane	6819514fca	Add num_nulls() and num_nonnulls() to count NULL arguments. An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that exactly one of a,b,c isn't NULL. The functions are variadic, so they can also be pressed into service to count the number of null or nonnull elements in an array. Marko Tiikkaja, reviewed by Pavel Stehule	2016-02-04 23:03:37 -05:00
Robert Haas	9418d79a76	When modifying a foreign table, initialize tableoid field properly. Failure to do this can cause AFTER ROW triggers or RETURNING expressions that reference this field to misbehave. Etsuro Fujita, reviewed by Thom Brown	2016-02-04 21:17:53 -05:00
Peter Eisentraut	f8003e07f9	Improve error message	2016-02-04 20:41:32 -05:00
Robert Haas	a104a017fc	Add some additional core functions to support join pushdown for FDWs. GetExistingLocalJoinPath() is useful for handling EvalPlanQual rechecks properly, and GetUserMappingById() is needed to make sure you're using the right credentials. Shigeru Hanada, Etsuro Fujita, Ashutosh Bapat, Robert Haas	2016-02-04 17:05:09 -05:00
Robert Haas	c1772ad922	Change the way that LWLocks for extensions are allocated. The previous RequestAddinLWLocks() method had several disadvantages. First, the locks would be in the main tranche; we've recently decided that it's useful for LWLocks used for separate purposes to have separate tranche IDs. Second, there wasn't any correlation between what code called RequestAddinLWLocks() and what code called LWLockAssign(); when multiple modules are in use, it could become quite difficult to troubleshoot problems where LWLockAssign() ran out of locks. To fix, create a concept of named LWLock tranches which can be used either by extension or by core code. Amit Kapila and Robert Haas	2016-02-04 16:43:04 -05:00
Robert Haas	b47b4dbf68	Extend sortsupport for text to more opclasses. Have varlena.c expose an interface that allows the char(n), bytea, and bpchar types to piggyback on a now-generalized SortSupport for text. This pushes a little more knowledge of the bpchar/char(n) type into varlena.c than might be preferred, but that seems like the approach that creates least friction. Also speed things up for index builds that use text_pattern_ops or varchar_pattern_ops. This patch does quite a bit of renaming, but it seems likely to be worth it, so as to avoid future confusion about the fact that this code is now more generally used than the old names might have suggested. Peter Geoghegan, reviewed by Álvaro Herrera and Andreas Karlsson, with small tweaks by me.	2016-02-03 14:29:53 -05:00
Robert Haas	69d34408e5	Allow parallel custom and foreign scans. This patch doesn't put the new infrastructure to use anywhere, and indeed it's not clear how it could ever be used for something like postgres_fdw which has to send an SQL query and wait for a reply, but there might be FDWs or custom scan providers that are CPU-bound, so let's give them a way to join club parallel. KaiGai Kohei, reviewed by me.	2016-02-03 12:49:46 -05:00
Robert Haas	f2305d40ec	Remove CustomPath's TextOutCustomPath method. You can't really do anything useful with this in the form it currently exists; among other problems, there's no way to reread whatever information might be produced when the path is output. Work is underway to replace this with a more useful and more general system of extensible nodes, but let's start by getting rid of this bit. Extracted from a larger patch by KaiGai Kohei.	2016-02-03 10:38:50 -05:00
Tom Lane	e6ecc93a17	Fix IsValidJsonNumber() to notice trailing non-alphanumeric garbage. Commit `e09996ff8d` was one brick shy of a load: it didn't insist that the detected JSON number be the whole of the supplied string. This allowed inputs such as "2016-01-01" to be misdetected as valid JSON numbers. Per bug #13906 from Dmitry Ryabov. In passing, be more wary of zero-length input (I'm not sure this can happen given current callers, but better safe than sorry), and do some minor cosmetic cleanup.	2016-02-03 01:39:48 -05:00
Peter Eisentraut	7d17e683fc	Add support for systemd service notifications Insert sd_notify() calls at server start and stop for integration with systemd. This allows the use of systemd service units of type "notify", which greatly simplifies the systemd configuration. Reviewed-by: Pavel Stěhule <pavel.stehule@gmail.com>	2016-02-02 21:04:29 -05:00
Peter Eisentraut	ac7238dc0f	Improve error reporting when location specified by postgres -D does not exist Previously, the first error seen would be that postgresql.conf does not exist. But for the case where the whole directory does not exist, give an error message about that, together with a hint for how to create one.	2016-02-02 21:03:19 -05:00
Alvaro Herrera	3cb5867b7d	Don't test for system columns on join relations create_foreignscan_plan needs to know whether any system columns are requested from a relation (this flag is needed by ForeignNext during execution). However, for join relations this is a pointless test, because it's not possible to request system columns from them, so remove the check. Author: Etsuro Fujita Discussion: http://www.postgresql.org/message-id/56AA0FC5.9000207@lab.ntt.co.jp Reviewed-by: David Rowley, Robert Haas	2016-02-02 19:20:02 +01:00
Teodor Sigaev	f25d07d99f	Fix lossy KNN GiST when ordering operator returns non-float8 value. KNN GiST with recheck flag should return to executor the same type as ordering operator, GiST detects this type by looking to return type of function which implements ordering operator. But occasionally detecting code works after replacing ordering operator function to distance support function. Distance support function always returns float8, so, detecting code get float8 instead of actual return type of ordering operator. Built-in opclasses don't have ordering operator which doesn't return non-float8 value, so, tests are impossible here, at least now. Backpatch to 9.5 where lozzy KNN was introduced. Author: Alexander Korotkov Report by: Artur Zakirov	2016-02-02 15:20:33 +03:00
Robert Haas	7191ce8bea	Make all built-in lwlock tranche IDs fixed. This makes the values more stable, which seems like a good thing for anybody who needs to look at at them. Alexander Korotkov and Amit Kapila	2016-02-02 06:45:55 -05:00
Magnus Hagander	e51ab85cd9	Fix typos in comments Author: Michael Paquier	2016-02-01 11:43:48 +01:00
Heikki Linnakangas	61ce1e8f15	Fix misspelled function name in comment.	2016-02-01 10:10:24 +02:00
Peter Eisentraut	9217bf3961	Fix whitespace	2016-01-30 15:58:20 -05:00
Robert Haas	2251179e6a	Migrate replication slot I/O locks into a separate tranche. This is following in a long train of similar changes and for the same reasons - see `b319356f0e` and `fe702a7b3f` inter alia. Author: Amit Kapila Reviewed-by: Alexander Korotkov, Robert Haas	2016-01-29 09:45:38 -05:00
Robert Haas	b319356f0e	Migrate PGPROC's backendLock into PGPROC itself, using a new tranche. Previously, each PGPROC's backendLock was part of the main tranche, and the PGPROC just contained a pointer. Now, the actual LWLock is part of the PGPROC. As with previous, similar patches, this makes it significantly easier to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output and improves modularity. Author: Ildus Kurbangaliev Reviewed-by: Amit Kapila, Robert Haas	2016-01-29 08:14:28 -05:00
Robert Haas	fbe5a3fb73	Only try to push down foreign joins if the user mapping OIDs match. Previously, the foreign join pushdown infrastructure left the question of security entirely up to individual FDWs, but it would be easy for a foreign data wrapper to inadvertently open up subtle security holes that way. So, make it the core code's job to determine which user mapping OID is relevant, and don't attempt join pushdown unless it's the same for all relevant relations. Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat, reviewed by Etsuro Fujita and KaiGai Kohei, with some further changes by me.	2016-01-28 14:05:36 -05:00
Robert Haas	96198d94cb	Avoid multiple foreign server connections when all use same user mapping. Previously, postgres_fdw's connection cache was keyed by user OID and server OID, but this can lead to multiple connections when it's not really necessary. In particular, if all relevant users are mapped to the public user mapping, then their connection options are certainly the same, so one connection can be used for all of them. While we're cleaning things up here, drop the "server" argument to GetConnection(), which isn't really needed. This saves a few cycles because callers no longer have to look this up; the function itself does, but only when establishing a new connection, not when reusing an existing one. Ashutosh Bapat, with a few small changes by me.	2016-01-28 12:05:19 -05:00
Fujii Masao	62e2ddd4ca	Fix typos in comments and doc overriden -> overridden The misspelling in create_extension.sgml was introduced in `b67aaf2`, so no need to backpatch.	2016-01-28 16:47:36 +09:00
Fujii Masao	7f46eaf035	Add gin_clean_pending_list function to clean up GIN pending list This function cleans up the pending list of the GIN index by moving entries in it to the main GIN data structure in bulk. It returns the number of pages cleaned up from the pending list. This function is useful, for example, when the pending list needs to be cleaned up quickly to improve the performance of the search using GIN index. VACUUM can do the same thing, too, but it may take days to run on a large table. Jeff Janes, reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me. Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com	2016-01-28 12:57:52 +09:00
Robert Haas	eaf7b1f643	Assert that create_unique_path returns non-NULL. Per off-list discussion with Tom Lane and Michael Paquier, Coverity gets unhappy if this is not done.	2016-01-27 22:03:18 -05:00
Tom Lane	b8682a7155	Fix startup so that log prefix %h works for the log_connections message. We entirely randomly chose to initialize port->remote_host just after printing the log_connections message, when we could perfectly well do it just before, allowing %h and %r to work for that message. Per gripe from Artem Tomyuk.	2016-01-26 15:38:33 -05:00
Tom Lane	cc988fbb0b	Improve ResourceOwners' behavior for large numbers of owned objects. The original coding was quite fast so long as objects were always released in reverse order of addition; otherwise, it degenerated into O(N^2) behavior due to searching for the array element to delete. Improve matters by switching to hashed storage when the number of objects of a given type exceeds 64. (The cutover point is open to discussion, of course, but some simple performance testing suggests that hashing has enough overhead to be a loser below there.) Also, refactor resowner.c so that we don't need N copies of the array management code. Since all the resource IDs the code currently needs to deal with are either pointers or integers, it seems sufficient to create a one-size-fits-all infrastructure in which everything is converted to a Datum for storage. Aleksander Alekseev, reviewed by Stas Kelvich, further fixes by me	2016-01-26 15:20:30 -05:00
Simon Riggs	1129c2b0ad	Correct comment in GetConflictingVirtualXIDs() We use Share lock because it is safe to do so.	2016-01-24 10:22:11 -08:00
Tom Lane	00347575e2	Yet further adjust degree-based trig functions for more portability. Buildfarm member cockatiel is still saying that cosd(60) isn't 0.5. What seems likely is that the subexpression (1.0 - cos(x)) isn't being rounded to double width before more arithmetic is done on it, so force that by storing it into a variable.	2016-01-24 12:53:03 -05:00
Tom Lane	360f67d31a	Still further adjust degree-based trig functions for more portability. Indeed, the non-static declaration foreseen in my previous commit message is necessary. Per Noah Misch.	2016-01-23 18:12:54 -05:00
Tom Lane	65abaab547	Further adjust degree-based trig functions for more portability. The last round didn't do it. Per Noah Misch, the problem on at least some machines is that the compiler pre-evaluates trig functions having constant arguments using code slightly different from what will be used at runtime. Therefore, we must prevent the compiler from seeing constant arguments to any of the libm trig functions used in this code. The method used here might still fail if init_degree_constants() gets inlined into the call sites. That probably won't happen given the large number of call sites; but if it does, we could probably fix it by making init_degree_constants() non-static. I'll avoid that till proven necessary, though.	2016-01-23 16:17:31 -05:00
Tom Lane	73193d82d7	Adjust degree-based trig functions for more portability. The buildfarm isn't very happy with the results of commit `e1bd684a34`. To try to get the expected exact results everywhere: * Replace M_PI / 180 subexpressions with a precomputed constant, so that the compiler can't decide to rearrange that division with an adjacent operation. Hopefully this will fix failures to get exactly 0.5 from sind(30) and cosd(60). * Add scaling to ensure that tand(45) and cotd(45) give exactly 1; there was nothing particularly guaranteeing that before. * Replace minus zero by zero when tand() or cotd() would output that; many machines did so for tand(180) and cotd(270), but not all. We could alternatively deem both results valid, but that doesn't seem likely to be what users will want.	2016-01-23 11:26:07 -05:00
Tom Lane	e1bd684a34	Add trigonometric functions that work in degrees. The implementations go to some lengths to deliver exact results for values where an exact result can be expected, such as sind(30) = 0.5 exactly. Dean Rasheed, reviewed by Michael Paquier	2016-01-22 15:46:22 -05:00
Tom Lane	fd5200c3dc	Improve cross-platform consistency of Inf/NaN handling in trig functions. Ensure that the trig functions return NaN for NaN input regardless of what the underlying C library functions might do. Also ensure that an error is thrown for Inf (or otherwise out-of-range) input, except for atan/atan2 which should accept it. All these behaviors should now conform to the POSIX spec; previously, all our popular platforms deviated from that in one case or another. The main remaining platform dependency here is whether the C library might choose to throw a domain error for sin/cos/tan inputs that are large but less than infinity. (Doing so is not unreasonable, since once a single unit-in-the-last-place exceeds PI, there can be no significance at all in the result; however there doesn't seem to be any suggestion in POSIX that such an error is allowed.) We will report such errors if they are reported via "errno", but not if they are reported via "fetestexcept" which is the other mechanism sanctioned by POSIX. Some preliminary experiments with fetestexcept indicated that it might also report errors we could do without, such as complaining about underflow at an unreasonably large threshold. So let's skip that complexity for now. Dean Rasheed, reviewed by Michael Paquier	2016-01-22 14:50:51 -05:00
Tom Lane	a396144ac0	Remove new coupling between NAMEDATALEN and MAX_LEVENSHTEIN_STRLEN. Commit `e529cd4ffa` introduced an Assert requiring NAMEDATALEN to be less than MAX_LEVENSHTEIN_STRLEN, which has been 255 for a long time. Since up to that instant we had always allowed NAMEDATALEN to be substantially more than that, this was ill-advised. It's debatable whether we need MAX_LEVENSHTEIN_STRLEN at all (versus putting a CHECK_FOR_INTERRUPTS into the loop), or whether it has to be so tight; but this patch takes the narrower approach of just not applying the MAX_LEVENSHTEIN_STRLEN limit to calls from the parser. Trusting the parser for this seems reasonable, first because the strings are limited to NAMEDATALEN which is unlikely to be hugely more than 256, and second because the maximum distance is tightly constrained by MAX_FUZZY_DISTANCE (though we'd forgotten to make use of that limit in one place). That means the cost is not really O(mn) but more like O(max(m,n)). Relaxing the limit for user-supplied calls is left for future research; given the lack of complaints to date, it doesn't seem very high priority. In passing, fix confusion between lengths-in-bytes and lengths-in-chars in comments and error messages. Per gripe from Kevin Day; solution suggested by Robert Haas. Back-patch to 9.5 where the unwanted restriction was introduced.	2016-01-22 11:53:06 -05:00
Tom Lane	647d87c56a	Make extract() do something more reasonable with infinite datetimes. Historically, extract() just returned zero for any case involving an infinite timestamp[tz] input; even cases in which the unit name was invalid. This is not very sensible. Instead, return infinity or -infinity as appropriate when the requested field is one that is monotonically increasing (e.g, year, epoch), or NULL when it is not (e.g., day, hour). Also, throw the expected errors for bad unit names. BACKWARDS INCOMPATIBLE CHANGE Vitaly Burovoy, reviewed by Vik Fearing	2016-01-21 22:26:20 -05:00
Tom Lane	d9b9289c83	Suppress compiler warning. Given the limited range of i, these shifts should not cause any problem, but that apparently doesn't stop some compilers from whining about them. David Rowley	2016-01-21 21:14:07 -05:00
Tom Lane	be44ed27b8	Improve index AMs' opclass validation procedures. The amvalidate functions added in commit `65c5fcd353` were on the crude side. Improve them in a few ways: * Perform signature checking for operators and support functions. * Apply more thorough checks for missing operators and functions, where possible. * Instead of reporting problems as ERRORs, report most problems as INFO messages and make the amvalidate function return FALSE. This allows more than one problem to be discovered per run. * Report object names rather than OIDs, and work a bit harder on making the messages understandable. Also, remove a few more opr_sanity regression test queries that are now superseded by the amvalidate checks.	2016-01-21 19:47:15 -05:00
Tom Lane	b99551832e	Add defenses against putting expanded objects into Const nodes. Putting a reference to an expanded-format value into a Const node would be a bad idea for a couple of reasons. It'd be possible for the supposedly immutable Const to change value, if something modified the referenced variable ... in fact, if the Const's reference were R/W, any function that has the Const as argument might itself change it at runtime. Also, because datumIsEqual() is pretty simplistic, the Const might fail to compare equal to other Consts that it should compare equal to, notably including copies of itself. This could lead to unexpected planner behavior, such as "could not find pathkey item to sort" errors or inferior plans. I have not been able to find any way to get an expanded value into a Const within the existing core code; but Paul Ramsey was able to trigger the problem by writing a datatype input function that returns an expanded value. The best fix seems to be to establish a rule that varlena values being placed into Const nodes should be passed through pg_detoast_datum(). That will do nothing (and cost little) in normal cases, but it will flatten expanded values and thereby avoid the above problems. Also, it will convert short-header or compressed values into canonical format, which will avoid possible unexpected lack-of-equality issues for those cases too. And it provides a last-ditch defense against putting a toasted value into a Const, which we already knew was dangerous, cf commit `2b0c86b665`. (In the light of this discussion, I'm no longer sure that that commit provided 100% protection against such cases, but this fix should do it.) The test added in commit `65c3d05e18` to catch datatype input functions with unstable results would fail for functions that returned expanded values; but it seems a bit uncharitable to deem a result unstable just because it's expressed in expanded form, so revise the coding so that we check for bitwise equality only after applying pg_detoast_datum(). That's a sufficient condition anyway given the new rule about detoasting when forming a Const. Back-patch to 9.5 where the expanded-object facility was added. It's possible that this should go back further; but in the absence of clear evidence that there's any live bug in older branches, I'll refrain for now.	2016-01-21 12:56:08 -05:00
Fujii Masao	38710a374e	Remove unused argument from ginInsertCleanup() It's an oversight in commit `dc943ad`.	2016-01-22 01:22:56 +09:00
Simon Riggs	c80b31d557	Refactor headers to split out standby defs Jeff Janes	2016-01-20 18:51:34 -08:00
Simon Riggs	978b2f65aa	Speedup 2PC by skipping two phase state files in normal path 2PC state info is written only to WAL at PREPARE, then read back from WAL at COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr checkpoint cycle will be written to disk in the same form as previously. Crash recovery path is not altered. Measured performance gains of 50-100% for short 2PC transactions by completely avoiding writing files and fsyncing. Other optimizations still available, further patches in related areas expected. Stas Kelvich and heavily edited by Simon Riggs Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas, a concrete example of how Postgres-XC has fed back ideas into PostgreSQL. Reviewed by Michael Paquier, Jeff Janes and Andres Freund Performance testing by Jesper Pedersen	2016-01-20 18:40:44 -08:00
Simon Riggs	422a55a687	Refactor to create generic WAL page read callback Previously we didn’t have a generic WAL page read callback function, surprisingly. Logical decoding has logical_read_local_xlog_page(), which was actually generic, so move that to xlogfunc.c and rename to read_local_xlog_page(). Maintain logical_read_local_xlog_page() so existing callers still work. As requested by Michael Paquier, Alvaro Herrera and Andres Freund	2016-01-20 17:18:58 -08:00
Robert Haas	45be99f8cd	Support parallel joins, and make related improvements. The core innovation of this patch is the introduction of the concept of a partial path; that is, a path which if executed in parallel will generate a subset of the output rows in each process. Gathering a partial path produces an ordinary (complete) path. This allows us to generate paths for parallel joins by joining a partial path for one side (which at the baserel level is currently always a Partial Seq Scan) to an ordinary path on the other side. This is subject to various restrictions at present, especially that this strategy seems unlikely to be sensible for merge joins, so only nested loops and hash joins paths are generated. This also allows an Append node to be pushed below a Gather node in the case of a partitioned table. Testing revealed that early versions of this patch made poor decisions in some cases, which turned out to be caused by the fact that the original cost model for Parallel Seq Scan wasn't very good. So this patch tries to make some modest improvements in that area. There is much more to be done in the area of generating good parallel plans in all cases, but this seems like a useful step forward. Patch by me, reviewed by Dilip Kumar and Amit Kapila.	2016-01-20 14:40:26 -05:00
Robert Haas	a7de3dc5c3	Support multi-stage aggregation. Aggregate nodes now have two new modes: a "partial" mode where they output the unfinalized transition state, and a "finalize" mode where they accept unfinalized transition states rather than individual values as input. These new modes are not used anywhere yet, but they will be necessary for parallel aggregation. The infrastructure also figures to be useful for cases where we want to aggregate local data and remote data via the FDW interface, and want to bring back partial aggregates from the remote side that can then be combined with locally generated partial aggregates to produce the final value. It may also be useful even when neither FDWs nor parallelism are in play, as explained in the comments in nodeAgg.c. David Rowley and Simon Riggs, reviewed by KaiGai Kohei, Heikki Linnakangas, Haribabu Kommi, and me.	2016-01-20 13:46:50 -05:00
Bruce Momjian	216d568432	Properly install dynloader.h on MSVC builds This will enable PL/Java to be cleanly compiled, as dynloader.h is a requirement. Report by Chapman Flack Patch by Michael Paquier Backpatch through 9.1	2016-01-19 23:30:29 -05:00
Alvaro Herrera	948c97958b	Add two HyperLogLog functions New functions initHyperLogLogError() and freeHyperLogLog() simplify using this module from elsewhere. Author: Tomáš Vondra Review: Peter Geoghegan	2016-01-19 17:40:15 -03:00
Tom Lane	9ff60273e3	Fix assorted inconsistencies in GiST opclass support function declarations. The conventions specified by the GiST SGML documentation were widely ignored. For example, the strategy-number argument for "consistent" and "distance" functions is specified to be a smallint, but most of the built-in support functions declared it as an integer, and for that matter the core code passed it using Int32GetDatum not Int16GetDatum. None of that makes any real difference at runtime, but it's quite confusing for newcomers to the code, and it makes it very hard to write an amvalidate() function that checks support function signatures. So let's try to instill some consistency here. Another similar issue is that the "query" argument is not of a single well-defined type, but could have different types depending on the strategy (corresponding to search operators with different righthand-side argument types). Some of the functions threw up their hands and declared the query argument as being of "internal" type, which surely isn't right ("any" would have been more appropriate); but the majority position seemed to be to declare it as being of the indexed data type, corresponding to a search operator with both input types the same. So I've specified a convention that that's what to do always. Also, the result of the "union" support function actually must be of the index's storage type, but the documentation suggested declaring it to return "internal", and some of the functions followed that. Standardize on telling the truth, instead. Similarly, standardize on declaring the "same" function's inputs as being of the storage type, not "internal". Also, somebody had forgotten to add the "recheck" argument to both the documentation of the "distance" support function and all of their SQL declarations, even though the C code was happily using that argument. Clean that up too. Fix up some other omissions in the docs too, such as documenting that union's second input argument is vestigial. So far as the errors in core function declarations go, we can just fix pg_proc.h and bump catversion. Adjusting the erroneous declarations in contrib modules is more debatable: in principle any change in those scripts should involve an extension version bump, which is a pain. However, since these changes are purely cosmetic and make no functional difference, I think we can get away without doing that.	2016-01-19 12:04:36 -05:00
Tom Lane	49b4950650	Add explicit cast to amcostestimate call. My compiler doesn't complain here, but David Rowley's does ...	2016-01-17 22:56:16 -05:00
Tom Lane	65c5fcd353	Restructure index access method API to hide most of it at the C level. This patch reduces pg_am to just two columns, a name and a handler function. All the data formerly obtained from pg_am is now provided in a C struct returned by the handler function. This is similar to the designs we've adopted for FDWs and tablesample methods. There are multiple advantages. For one, the index AM's support functions are now simple C functions, making them faster to call and much less error-prone, since the C compiler can now check function signatures. For another, this will make it far more practical to define index access methods in installable extensions. A disadvantage is that SQL-level code can no longer see attributes of index AMs; in particular, some of the crosschecks in the opr_sanity regression test are no longer possible from SQL. We've addressed that by adding a facility for the index AM to perform such checks instead. (Much more could be done in that line, but for now we're content if the amvalidate functions more or less replace what opr_sanity used to do.) We might also want to expose some sort of reporting functionality, but this patch doesn't do that. Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily editorialized on by me.	2016-01-17 19:36:59 -05:00
Tom Lane	8d290c8ec6	Re-pgindent a few files. In preparation for landing index AM interface changes.	2016-01-17 19:13:18 -05:00
Robert Haas	23c2dd03d5	Fix spelling mistakes. Same patch submitted independently by David Rowley and Peter Geoghegan.	2016-01-14 23:16:40 -05:00
Tom Lane	a923af382c	Fix build_grouping_chain() to not clobber its input lists. There's no good reason for stomping on the input data; it makes the logic in this function no simpler, in fact probably the reverse. And it makes it impossible to separate path generation from plan generation, as I'm working towards doing; that will require more than one traversal of these lists.	2016-01-14 11:51:57 -05:00
Magnus Hagander	6a61d1ff9d	Properly close token in sspi authentication We can never leak more than one token, but we shouldn't do that. We don't bother closing it in the error paths since the process will exit shortly anyway. Christian Ullrich	2016-01-14 13:06:03 +01:00
Simon Riggs	e63bb4549a	Add new user fn pg_current_xlog_flush_location() Tomas Vondra, reviewed by Michael Paquier and Amit Kapila Minor edits by me	2016-01-12 07:54:52 +00:00
Simon Riggs	1e29e6324c	Maintain local LogwrtResult consistently Teach GetFlushRecPtr() to update LogwrtResult cache as performed by all other functions in xlog.c	2016-01-12 07:33:20 +00:00
Robert Haas	950ab82c3d	Remove obsolete comment. Noted while reviewing a question from Dickson S. Guedes.	2016-01-10 21:35:33 -05:00
Tom Lane	820bdccc1b	Remove a useless PG_GETARG_DATUM() call from jsonb_build_array. This loop uselessly fetched the argument after the one it's currently looking at. No real harm is done since we couldn't possibly fetch off the end of memory, but it's confusing to the reader. Also remove a duplicate (and therefore confusing) PG_ARGISNULL check in jsonb_build_object. I happened to notice these things while trolling for missed null-arg checks earlier today. Back-patch to 9.5, not because there is any real bug, but just because 9.5 and HEAD are still in sync in this file and we might as well keep them so. In passing, re-pgindent.	2016-01-09 17:39:45 -05:00
Tom Lane	26d538dc93	Clean up some lack-of-STRICT issues in the core code, too. A scan for missed proisstrict markings in the core code turned up these functions: brin_summarize_new_values pg_stat_reset_single_table_counters pg_stat_reset_single_function_counters pg_create_logical_replication_slot pg_create_physical_replication_slot pg_drop_replication_slot The first three of these take OID, so a null argument will normally look like a zero to them, resulting in "ERROR: could not open relation with OID 0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX functions. The other three will dump core on a null argument, though this is mitigated by the fact that they won't do so until after checking that the caller is superuser or has rolreplication privilege. In addition, the pg_logical_slot_get/peek[_binary]_changes family was intentionally marked nonstrict, but failed to make nullness checks on all the arguments; so again a null-pointer-dereference crash is possible but only for superusers and rolreplication users. Add the missing ARGISNULL checks to the latter functions, and mark the former functions as strict in pg_proc. Make that change in the back branches too, even though we can't force initdb there, just so that installations initdb'd in future won't have the issue. Since none of these bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX functions hardly misbehave at all), it seems sufficient to do this. In addition, fix some order-of-operations oddities in the slot_get_changes family, mostly cosmetic, but not the part that moves the function's last few operations into the PG_TRY block. As it stood, there was significant risk for an error to exit without clearing historical information from the system caches. The slot_get_changes bugs go back to 9.4 where that code was introduced. Back-patch appropriate subsets of the pg_proc changes into all active branches, as well.	2016-01-09 16:58:32 -05:00
Simon Riggs	b602842613	Revoke change to rmgr desc of btree vacuum Per discussion with Andres Freund	2016-01-09 18:31:08 +00:00
Simon Riggs	687f2cd7a0	Avoid pin scan for replay of XLOG_BTREE_VACUUM Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the “pin scan” that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here. The current commit still performs the pin scan for toast indexes, though this can also be avoided if we recheck scans on toast indexes. Later patch will address this. No tests included. Manual tests using an additional patch to view WAL records and their timing have shown the change in WAL records and their handling has successfully reduced replication delay.	2016-01-09 10:10:08 +00:00
Tom Lane	a54676acad	Marginal cleanup of GROUPING SETS code in grouping_planner(). Improve comments and make it a shade less messy. I think we might want to move all of this somewhere else later, but it needs to be more readable first. In passing, re-pgindent the file, affecting some recently-added comments concerning parallel query planning.	2016-01-07 20:32:35 -05:00
Tom Lane	c44d013835	Delay creation of subplan tlist until after create_plan(). Once upon a time it was necessary for grouping_planner() to determine the tlist it wanted from the scan/join plan subtree before it called query_planner(), because query_planner() would actually make a Plan using that. But we refactored things a long time ago to delay construction of the Plan tree till later, so there's no need to build that tlist until (and indeed unless) we're ready to plaster it onto the Plan. The only thing query_planner() cares about is what Vars are going to be needed for the tlist, and it can perfectly well get that by looking at the real tlist rather than some masticated version. Well, actually, there is one minor glitch in that argument, which is that make_subplanTargetList also adds Vars appearing only in HAVING to the tlist it produces. So now we have to account for HAVING explicitly in build_base_rel_tlists. But that just adds a few lines of code, and I doubt it moves the needle much on processing time; we might be doing pull_var_clause() twice on the havingQual, but before we had it scanning dummy tlist entries instead. This is a very small down payment on rationalizing grouping_planner enough so it can be refactored.	2016-01-07 20:23:57 -05:00
Alvaro Herrera	b1a9bad9e7	pgstat: add WAL receiver status view & SRF This new view provides insight into the state of a running WAL receiver in a HOT standby node. The information returned includes the PID of the WAL receiver process, its status (stopped, starting, streaming, etc), start LSN and TLI, last received LSN and TLI, timestamp of last message send and receipt, latest end-of-WAL LSN and time, and the name of the slot (if any). Access to the detailed data is only granted to superusers; others only get the PID. Author: Michael Paquier Reviewer: Haribabu Kommi	2016-01-07 16:21:19 -03:00
Tom Lane	6b1a837f69	Remove vestigial CHECK_FOR_INTERRUPTS call. Commit `e710b65c` inserted code in md5_crypt_verify to disable and later re-enable interrupts, with a CHECK_FOR_INTERRUPTS call as part of the second step, to process any interrupts that had been held off. Commit `6647248e` removed the interrupt disable/re-enable code, but left behind the CHECK_FOR_INTERRUPTS, even though this is now an entirely random, pointless place for one. md5_crypt_verify doesn't run long enough to need such a check, and if it did, this would still be the wrong place to put one.	2016-01-07 11:26:54 -05:00
Tom Lane	5e0b5dcab6	Provide more detail in postmaster log for password authentication failures. We tell people to examine the postmaster log if they're unsure why they are getting auth failures, but actually only a few relatively-uncommon failure cases were given their own log detail messages in commit `64e43c59b8`. Expand on that so that every failure case detected within md5_crypt_verify gets a specific log detail message. This should cover pretty much every ordinary password auth failure cause. So far I've not noticed user demand for a similar level of auth detail for the other auth methods, but sooner or later somebody might want to work on them. This is not that patch, though.	2016-01-07 11:19:33 -05:00
Alvaro Herrera	a967613911	Windows: Make pg_ctl reliably detect service status pg_ctl is using isatty() to verify whether the process is running in a terminal, and if not it sends its output to Windows' Event Log ... which does the wrong thing when the output has been redirected to a pipe, as reported in bug #13592. To fix, make pg_ctl use the code we already have to detect service-ness: in the master branch, move src/backend/port/win32/security.c to src/port (with suitable tweaks so that it runs properly in backend and frontend environments); pg_ctl already has access to pgport so it Just Works. In older branches, that's likely to cause trouble, so instead duplicate the required code in pg_ctl.c. Author: Michael Paquier Bug report and diagnosis: Egon Kocjan Backpatch: all supported branches	2016-01-07 11:59:08 -03:00
Tom Lane	4bf87169cc	Comment typo fix. Per Amit Langote.	2016-01-06 11:06:51 -05:00
Alvaro Herrera	abb1733922	Add scale(numeric) Author: Marko Tiikkaja	2016-01-05 19:02:13 -03:00
Tom Lane	419400c5da	Remove some ancient and unmaintained encoding-conversion test cruft. In commit `921191912c` I claimed that we weren't testing encoding conversion functions, but further poking around reveals that we did have an equivalent though hard-wired set of tests in conversion.sql. AFAICS there is no advantage to doing it like that as compared to letting the catalog contents drive the test, so let the opr_sanity addition stand and remove the now-redundant tests in conversion.sql. Also, remove some infrastructure in src/backend/utils/mb/conversion_procs for building conversion.sql's list of tests. That was unmaintained, and had not corresponded to the actual contents of conversion.sql since 2007 or perhaps even further back.	2016-01-05 16:43:40 -05:00
Tom Lane	ea0d494dae	Make the to_reg*() functions accept text not cstring. Using cstring as the input type was a poor decision, because that's not really a full-fledged type. In particular, it lacks implicit coercions from text or varchar, meaning that usages like to_regproc('foo'\|\|'bar') wouldn't work; basically the only case that did work without explicit casting was a simple literal constant argument. The lack of field complaints about this suggests that hardly anyone is using these functions, so hopefully fixing it won't cause much of a compatibility problem. They've only been there since 9.4, anyway. Petr Korobeinikov	2016-01-05 13:02:43 -05:00
Alvaro Herrera	efa318bcfa	Make pg_shseclabel available in early backend startup While the in-core authentication mechanism doesn't need to access pg_shseclabel at all, it's reasonable to think that an authentication hook will want to look at the label for the role logging in, or for rows in other catalogs used during the authentication phase of startup. Catalog version bumped, because this changes the "is nailed" status for pg_shseclabel. Author: Adam Brightwell	2016-01-05 14:50:53 -03:00
Tom Lane	5d35438273	Adjust behavior of row_security GUC to match the docs. Some time back we agreed that row_security=off should not be a way to bypass RLS entirely, but only a way to get an error if it was being applied. However, the code failed to act that way for table owners. Per discussion, this is a must-fix bug for 9.5.0. Adjust the logic in rls.c to behave as expected; also, modify the error message to be more consistent with the new interpretation. The regression tests need minor corrections as well. Also update the comments about row_security in ddl.sgml to be correct. (The official description of the GUC in config.sgml is already correct.) I failed to resist the temptation to do some other very minor cleanup as well, such as getting rid of a duplicate extern declaration.	2016-01-04 12:21:41 -05:00
Tom Lane	b0cadc08fe	Fix regrole and regnamespace output functions to do quoting, too. We discussed this but somehow failed to implement it...	2016-01-04 01:53:24 -05:00
Tom Lane	fb1227af67	Fix regrole and regnamespace types to honor quoting like other reg* types. Aside from any consistency arguments, this is logically necessary because the I/O functions for these types also handle numeric OID values. Without a quoting rule it is impossible to distinguish numeric OIDs from role or namespace names that happen to contain only digits. Also change the to_regrole and to_regnamespace functions to dequote their arguments. While not logically essential, this seems like a good idea since the other to_reg* functions do it. Anyone who really wants raw lookup of an uninterpreted name can fall back on the time-honored solution of (SELECT oid FROM pg_namespace WHERE nspname = whatever). Report and patch by Jim Nasby, reviewed by Michael Paquier	2016-01-04 01:03:53 -05:00
Tom Lane	f47b602df8	Fix bogus lock release in RemovePolicyById and RemoveRoleFromObjectPolicy. Can't release the AccessExclusiveLock on the target table until commit. Otherwise there is a race condition whereby other backends might service our cache invalidation signals before they can actually see the updated catalog rows. Just to add insult to injury, RemovePolicyById was closing the rel (with incorrect lock drop) and then passing the now-dangling rel pointer to CacheInvalidateRelcache. Probably the reason this doesn't fall over on CLOBBER_CACHE buildfarm members is that some outer level of the DROP logic is still holding the rel open ... but it'd have bit us on the arse eventually, no doubt.	2016-01-03 20:53:35 -05:00
Tom Lane	939d10cd87	Guard against null arguments in binary_upgrade_create_empty_extension(). The CHECK_IS_BINARY_UPGRADE macro is not sufficient security protection if we're going to dereference pass-by-reference arguments before it. But in any case we really need to explicitly check PG_ARGISNULL for all the arguments of a non-strict function, not only the ones we expect null values for. Oversight in commits `30982be4e5` and `f92fc4c95d`. Found by Andreas Seltenreich. (The other usages in pg_upgrade_support.c seem safe.)	2016-01-03 16:26:38 -05:00
Tom Lane	90e61df813	Fix treatment of lpNumberOfBytesRecvd == 0: that's a completion condition. pgwin32_recv() has treated a non-error return of zero bytes from WSARecv() as being a reason to block ever since the current implementation was introduced in commit `a4c40f140d`. However, so far as one can tell from Microsoft's documentation, that is just wrong: what it means is graceful connection closure (in stream protocols) or receipt of a zero-length message (in message protocols), and neither case should result in blocking here. The only reason the code worked at all was that control then fell into the retry loop, which did not* treat zero bytes specially, so we'd get out after only wasting some cycles. But as of 9.5 we do not normally reach the retry loop and so the bug is exposed, as reported by Shay Rojansky and diagnosed by Andres Freund. Remove the unnecessary test on the byte count, and rearrange the code in the retry loop so that it looks identical to the initial sequence. Back-patch to 9.5. The code is wrong all the way back, AFAICS, but since it's relatively harmless in earlier branches we'll leave it alone.	2016-01-03 13:56:29 -05:00
Tom Lane	7157fe80f4	Fix overly-strict assertions in spgtextproc.c. spg_text_inner_consistent is capable of reconstructing an empty string to pass down to the next index level; this happens if we have an empty string coming in, no prefix, and a dummy node label. (In practice, what is needed to trigger that is insertion of a whole bunch of empty-string values.) Then, we will arrive at the next level with in->level == 0 and a non-NULL (but zero length) in->reconstructedValue, which is valid but the Assert tests weren't expecting it. Per report from Andreas Seltenreich. This has no impact in non-Assert builds, so should not be a problem in production, but back-patch to all affected branches anyway. In passing, remove a couple of useless variable initializations and shorten the code by not duplicating DatumGetPointer() calls.	2016-01-02 16:24:50 -05:00
Bruce Momjian	ee94300446	Update copyright for 2016 Backpatch certain files through 9.1	2016-01-02 13:33:40 -05:00
Noah Misch	dfcd9cb302	Cover heap_page_prune_opt()'s cleanup lock tactic in README. Jeff Janes, reviewed by Jim Nasby.	2016-01-01 21:52:22 -05:00
Tom Lane	c7e27becd2	Teach flatten_reloptions() to quote option values safely. flatten_reloptions() supposed that it didn't really need to do anything beyond inserting commas between reloption array elements. However, in principle the value of a reloption could be nearly anything, since the grammar allows a quoted string there. Any restrictions on it would come from validity checking appropriate to the particular option, if any. A reloption value that isn't a simple identifier or number could thus lead to dump/reload failures due to syntax errors in CREATE statements issued by pg_dump. We've gotten away with not worrying about this so far with the core-supported reloptions, but extensions might allow reloption values that cause trouble, as in bug #13840 from Kouhei Sutou. To fix, split the reloption array elements explicitly, and then convert any value that doesn't look like a safe identifier to a string literal. (The details of the quoting rule could be debated, but this way is safe and requires little code.) While we're at it, also quote reloption names if they're not safe identifiers; that may not be a likely problem in the field, but we might as well try to be bulletproof here. It's been like this for a long time, so back-patch to all supported branches. Kouhei Sutou, adjusted some by me	2016-01-01 15:27:53 -05:00
Tom Lane	3c93a60f60	Add some more defenses against silly estimates to gincostestimate(). A report from Andy Colson showed that gincostestimate() was not being nearly paranoid enough about whether to believe the statistics it finds in the index metapage. The problem is that the metapage stats (other than the pending-pages count) are only updated by VACUUM, and in the worst case could still reflect the index's original empty state even when it has grown to many entries. We attempted to deal with that by scaling up the stats to match the current index size, but if nEntries is zero then scaling it up still gives zero. Moreover, the proportion of pages that are entry pages vs. data pages vs. pending pages is unlikely to be estimated very well by scaling if the index is now orders of magnitude larger than before. We can improve matters by expanding the use of the rule-of-thumb estimates I introduced in commit 7fb008c5ee59b040: if the index has grown by more than a cutoff amount (here set at 4X growth) since VACUUM, then use the rule-of-thumb numbers instead of scaling. This might not be exactly right but it seems much less likely to produce insane estimates. I also improved both the scaling estimate and the rule-of-thumb estimate to account for numPendingPages, since it's reasonable to expect that that is accurate in any case, and certainly pages that are in the pending list are not either entry or data pages. As a somewhat separate issue, adjust the estimation equations that are concerned with extra fetches for partial-match searches. These equations suppose that a fraction partialEntries / numEntries of the entry and data pages will be visited as a consequence of a partial-match search. Now, it's physically impossible for that fraction to exceed one, but our estimate of partialEntries is mostly bunk, and our estimate of numEntries isn't exactly gospel either, so we could arrive at a silly value. In the example presented by Andy we were coming out with a value of 100, leading to insane cost estimates. Clamp the fraction to one to avoid that. Like the previous patch, back-patch to all supported branches; this problem can be demonstrated in one form or another in all of them.	2016-01-01 13:42:21 -05:00
Noah Misch	3cd1ba147e	Fix comments about WAL rule "write xlog before data" versus pg_multixact. Recovery does not achieve its goal of zeroing all pg_multixact entries whose accompanying WAL records never reached disk. Remove that claim and justify its expendability. Detail the need for TrimMultiXact(), which has little in common with the TrimCLOG() rationale. Merge two tightly-related comments. Stop presenting pg_multixact as specific to heap_lock_tuple(); PostgreSQL 9.3 extended its use to heap_update(). Noticed while investigating a report from Andres Freund.	2016-01-01 01:46:46 -05:00
Tom Lane	0dab5ef39b	Fix ALTER OPERATOR to update dependencies properly. Fix an oversight in commit 321eed5f0f7563a0: replacing an operator's selectivity functions needs to result in a corresponding update in pg_depend. We have a function that can handle that, but it was not called by AlterOperator(). To fix this without enlarging pg_operator.h's #include list beyond what clients can safely include, split off the function definitions into a new file pg_operator_fn.h, similarly to what we've done for some other catalog header files. It's not entirely clear whether any client-side code needs to include pg_operator.h, but it seems prudent to assume that there is some such code somewhere.	2015-12-31 17:37:31 -05:00
Tom Lane	e5d06f2b12	Dept of second thoughts: the !scan_all exit mustn't increase scanned_pages. In the extreme edge case where contended pages are the only ones that escape being scanned, the previous commit would have allowed us to think that relfrozenxid could be advanced, which is exactly wrong.	2015-12-30 17:32:23 -05:00
Tom Lane	e842908233	Avoid useless truncation attempts during VACUUM. VACUUM can skip heap pages altogether when there's a run of consecutive pages that are all-visible according to the visibility map. This causes it to not update its nonempty_pages count, just as if those pages were empty, which means that at the end we will think they are candidates for deletion. Thus, we may take the table's AccessExclusive lock only to find that no pages are really truncatable. This usually causes no real problems on a master server, thanks to the lock being acquired only conditionally; but on hot-standby servers, the same lock must be acquired unconditionally which can result in unnecessary query cancellations. To improve matters, force examination of the table's last page whenever we reach there with a nonempty_pages count that would allow a truncation attempt. If it's not empty, we'll advance nonempty_pages and thereby prevent the truncation attempt. If we are unable to acquire cleanup lock on that page, there's no need to force it, unless we're doing an anti-wraparound vacuum. We can just check for tuples with a shared buffer lock and then give up. (When we are doing an anti-wraparound vacuum, and decide it's okay to skip the page because it contains no freezable tuples, this patch still improves matters because nonempty_pages is properly updated, which it was not before.) Since only the last page is special-cased in this way, we might attempt a truncation that will release many fewer pages than the normal heuristic would suggest; at worst, only one page would be truncated. But that seems all right, because the situation won't repeat during the next vacuum. The real problem with the old logic is that the useless truncation attempt happens every time we vacuum, so long as the state of the last few dozen pages doesn't change. This is a longstanding deficiency, but since the consequences aren't very severe in most scenarios, I'm not going to risk a back-patch. Jeff Janes and Tom Lane	2015-12-30 17:13:15 -05:00
Tom Lane	efe4c9d704	Add some comments about division of labor between rewriter and planner. The rationale for the way targetlist processing is done wasn't clearly stated anywhere, and I for one had forgotten some of the details. Having just painfully re-learned them, add some breadcrumbs for the next person.	2015-12-29 18:50:35 -05:00
Tom Lane	fd19525756	Put back one copyObject() in rewriteTargetView(). Commit `6f8cb1e234` tried to centralize rewriteTargetView's copying of a target view's Query struct. However, it ignored the fact that the jointree->quals field was used twice. This only accidentally failed to fail immediately because the same ChangeVarNodes mutation is applied in both cases, so that we end up with logically identical expression trees for both uses (and, as the code stands, the second ChangeVarNodes call actually does nothing). However, we end up linking physically identical expression trees into both an RTE's securityQuals list and the WithCheckOption list. That's pretty dangerous, mainly because prepsecurity.c is utterly cavalier about further munging such structures without copying them first. There may be no live bug in HEAD as a consequence of the fact that we apply preprocess_expression in between here and prepsecurity.c, and that will make a copy of the tree anyway. Or it may just be that the regression tests happen to not trip over it. (I noticed this only because things fell over pretty badly when I tried to relocate the planner's call of expand_security_quals to before expression preprocessing.) In any case it's very fragile because if anyone tried to make the securityQuals and WithCheckOption trees diverge before we reach preprocess_expression, it would not work. The fact that the current code will preprocess securityQuals and WithCheckOptions lists at completely different times in different query levels does nothing to increase my trust that that can't happen. In view of the fact that 9.5.0 is almost upon us and the aforesaid commit has seen exactly zero field testing, the prudent course is to make an extra copy of the quals so that the behavior is not different from what has been in the field during beta.	2015-12-29 16:45:47 -05:00
Joe Conway	241448b23a	Rename (new\|old)estCommitTs to (new\|old)estCommitTsXid The variables newestCommitTs and oldestCommitTs sound as if they are timestamps, but in fact they are the transaction Ids that correspond to the newest and oldest timestamps rather than the actual timestamps. Rename these variables to reflect that they are actually xids: to wit newestCommitTsXid and oldestCommitTsXid respectively. Also modify related code in a similar fashion, particularly the user facing output emitted by pg_controldata and pg_resetxlog. Complaint and patch by me, review by Tom Lane and Alvaro Herrera. Backpatch to 9.5 where these variables were first introduced.	2015-12-28 12:34:11 -08:00
Tom Lane	fec1ad94df	Include typmod when complaining about inherited column type mismatches. MergeAttributes() rejects cases where columns to be merged have the same type but different typmod, which is correct; but the error message it printed didn't show either typmod, which is unhelpful. Changing this requires using format_type_with_typemod() in place of TypeNameToString(), which will have some minor side effects on the way some type names are printed, but on balance this is an improvement: the old code sometimes printed one type according to one set of rules and the other type according to the other set, which could be confusing in its own way. Oddly, there were no regression test cases covering any of this behavior, so add some. Complaint and fix by Amit Langote	2015-12-26 13:41:29 -05:00
Tom Lane	3d2b31e30e	Fix brin_summarize_new_values() to check index type and ownership. brin_summarize_new_values() did not check that the passed OID was for an index at all, much less that it was a BRIN index, and would fail in obscure ways if it wasn't (possibly damaging data first?). It also lacked any permissions test; by analogy to VACUUM, we should only allow the table's owner to summarize. Noted by Jeff Janes, fix by Michael Paquier and me	2015-12-26 12:56:09 -05:00
Robert Haas	bc7fcab5e3	Read from the same worker repeatedly until it returns no tuple. The original coding read tuples from workers in round-robin fashion, but performance testing shows that it works much better to read enough to empty one queue before moving on to the next. I believe the reason for this is that, with the old approach, we could easily wake up a worker repeatedly to write only one new tuple into the shm_mq each time. With this approach, by the time the process gets scheduled, it has a decent chance of being able to fill the entire buffer in one go. Patch by me. Dilip Kumar helped with performance testing.	2015-12-23 14:06:52 -05:00
Robert Haas	51d152f18e	Change Gather not to use a physical tlist. This should have been part of the original commit, but was missed. Pushing data between processes is expensive, so we definitely want to project away unneeded columns here, just as we do for other nodes like Sort and Hash that care about the volume of data.	2015-12-23 13:41:06 -05:00
Peter Eisentraut	30c0c4bf12	Remove unnecessary escaping in C character literals '\"' is more commonly written simply as '"'.	2015-12-22 22:43:46 -05:00
Tom Lane	6efbded6e4	Allow omitting one or both boundaries in an array slice specifier. Omitted boundaries represent the upper or lower limit of the corresponding array subscript. This allows simpler specification of many common use-cases. (Revised version of commit `9246af6799`) YUriy Zhuravlev	2015-12-22 21:05:29 -05:00
Robert Haas	0ba3f3bc65	Comment improvements for abbreviated keys. Peter Geoghegan and Robert Haas	2015-12-22 13:57:18 -05:00
Robert Haas	ccd8f97922	postgres_fdw: Consider requesting sorted data so we can do a merge join. When use_remote_estimate is enabled, consider adding ORDER BY to the query we sending to the remote server so that we can use that ordered data for a merge join. Commit `f18c944b61` arranges to push down the query pathkeys, which seems like the case mostly likely to be a win, but testing shows this can sometimes win, too. For a regular table, we know which indexes are present and therefore test whether the ordering provided by each such index is useful. Here, we take the opposite approach: guess what orderings would be useful if they could be generated cheaply, and then ask the remote side what those will cost. Ashutosh Bapat, with very substantial cosmetic revisions by me. Also reviewed by Rushabh Lathia.	2015-12-22 13:46:40 -05:00
Stephen Frost	6f8cb1e234	Make viewquery a copy in rewriteTargetView() Rather than expect the Query returned by get_view_query() to be read-only and then copy bits and pieces of it out, simply copy the entire structure when we get it. This addresses an issue where AcquireRewriteLocks, which is called by acquireLocksOnSubLinks(), scribbles on the parsetree passed in, which was actually an entry in relcache, leading to segfaults with certain view definitions. This also future-proofs us a bit for anyone adding more code to this path. The acquireLocksOnSubLinks() was added in commit `c3e0ddd40`. Back-patch to 9.3 as that commit was.	2015-12-21 10:34:14 -05:00
Teodor Sigaev	bbbd807097	Revert `9246af6799` because I miss too much. Patch is returned to commitfest process.	2015-12-18 21:35:22 +03:00
Robert Haas	6e7b335930	Remove duplicate word. Kyotaro Horiguchi	2015-12-18 12:43:52 -05:00
Robert Haas	2bdfcb52c5	Fix TupleQueueReaderNext not to ignore its nowait argument. This was a silly goof on my (rhaas's) part. Report and fix by Rushabh Lathia.	2015-12-18 12:37:43 -05:00
Robert Haas	4496226782	Fix copy-and-paste error in logical decoding callback. This could result in the error context misidentifying where the error actually occurred. Craig Ringer	2015-12-18 12:17:35 -05:00
Robert Haas	9a51698bae	Fix typo in comment. Amit Langote	2015-12-18 12:03:15 -05:00
Teodor Sigaev	9246af6799	Allow to omit boundaries in array subscript Allow to omiy lower or upper or both boundaries in array subscript for selecting slice of array. Author: YUriy Zhuravlev	2015-12-18 15:18:58 +03:00
Tom Lane	c4a8812cf6	Use just one standalone-backend session for initdb's post-bootstrap steps. Previously, each subroutine in initdb fired up its own standalone backend session. Over time we'd grown as many as fifteen of these sessions, and the cumulative startup and shutdown work for them was getting pretty noticeable. Combining things so that all these steps share a single backend session cuts a good 10% off the total runtime of initdb, more if you're not fsync'ing. The main stumbling block to doing this before was that some of the sessions were run with -j and some not. The improved definition of -j mode implemented by my previous commit makes it possible to fix that by running all the post-bootstrap steps with -j; we just have to use double instead of single newlines to end command strings. (This is only absolutely necessary around the VACUUM and CREATE DATABASE steps, since those can't be run in a transaction block. But it seems best to make them all use double newlines so that the commands remain separate for error-reporting purposes.) A minor disadvantage is that since initdb can't tell how much of its output the backend has executed, we can no longer have the per-step progress reporting initdb used to print. But things are fast enough nowadays that that's not really all that useful anyway. In passing, add more const decoration to some of the static arrays in initdb.c.	2015-12-17 19:38:21 -05:00
Tom Lane	66d947b9d3	Adjust behavior of single-user -j mode for better initdb error reporting. Previously, -j caused the entire input file to be read in and executed as a single command string. That's undesirable, not least because any error causes the entire file to be regurgitated as the "failing query". Some experimentation suggests a better rule: end the command string when we see a semicolon immediately followed by two newlines, ie, an empty line after a query. This serves nicely to break up the existing examples such as information_schema.sql and system_views.sql. A limitation is that it's no longer possible to write such a sequence within a string literal or multiline comment in a file meant to be read with -j; but there are no instances of such a problem within the data currently used by initdb. (If someone does make such a mistake in future, it'll be obvious because they'll get an unterminated-literal or unterminated-comment syntax error.) Other than that, there shouldn't be any negative consequences; you're not forced to end statements that way, it's just a better idea in most cases. In passing, remove src/include/tcop/tcopdebug.h, which is dead code because it's not included anywhere, and hasn't been for more than ten years. One of the debug-support symbols it purported to describe has been unreferenced for at least the same amount of time, and the other is removed by this commit on the grounds that it was useless: forcing -j mode all the time would have broken initdb. The lack of complaints about that, or about the missing inclusion, shows that no one has tried to use TCOP_DONTUSENEWLINE in many years.	2015-12-17 19:34:15 -05:00
Alvaro Herrera	756e7b4c9d	Rework internals of changing a type's ownership This is necessary so that REASSIGN OWNED does the right thing with composite types, to wit, that it also alters ownership of the type's pg_class entry -- previously, the pg_class entry remained owned by the original user, which caused later other failures such as the new owner's inability to use ALTER TYPE to rename an attribute of the affected composite. Also, if the original owner is later dropped, the pg_class entry becomes owned by a non-existant user which is bogus. To fix, create a new routine AlterTypeOwner_oid which knows whether to pass the request to ATExecChangeOwner or deal with it directly, and use that in shdepReassignOwner rather than calling AlterTypeOwnerInternal directly. AlterTypeOwnerInternal is now simpler in that it only modifies the pg_type entry and recurses to handle a possible array type; higher-level tasks are handled by either AlterTypeOwner directly or AlterTypeOwner_oid. I took the opportunity to add a few more objects to the test rig for REASSIGN OWNED, so that more cases are exercised. Additional ones could be added for superuser-only-ownable objects (such as FDWs and event triggers) but I didn't want to push my luck by adding a new superuser to the tests on a backpatchable bug fix. Per bug #13666 reported by Chris Pacejo. Backpatch to 9.5. (I would back-patch this all the way back, except that it doesn't apply cleanly in 9.4 and earlier because `59367fdf9` wasn't backpatched. If we decide that we need this in earlier branches too, we should backpatch both.)	2015-12-17 14:25:41 -03:00
Robert Haas	b648b70342	Speed up CREATE INDEX CONCURRENTLY's TID sort. Encode TIDs as 64-bit integers to speed up comparisons. This seems to speed things up on all platforms, but is even more beneficial when 8-byte integers are passed by value. Peter Geoghegan. Design suggestions and review by Tom Lane. Review also by Simon Riggs and by me.	2015-12-16 15:23:45 -05:00
Robert Haas	f27a6b15e6	Mark CHECK constraints declared NOT VALID valid if created with table. FOREIGN KEY constraints have behaved this way for a long time, but for some reason the behavior of CHECK constraints has been inconsistent up until now. Amit Langote and Amul Sul, with assorted tweaks by me.	2015-12-16 07:43:56 -05:00
Robert Haas	049469e7e7	Teach mdnblocks() not to create zero-length files. It's entirely surprising that mdnblocks() has the side effect of creating new files on disk, so let's make it not do that. One consequence of the old behavior is that, if running on a damaged cluster that is missing a file, mdnblocks() can recreate the file and allow a subsequent _mdfd_getseg() for a higher segment to succeed. This happens because, while mdnblocks() stops when it finds a segment that is shorter than 1GB, _mdfd_getseg() has no such check, and thus the empty file created by mdnblocks() can allow it to continue its traversal and find higher-numbered segments which remain. It might be a good idea for _mdfd_getseg() to actually verify that each segment it finds is exactly 1GB before proceeding to the next one, but that would involve some additional system calls, so for now I'm just doing this much. Patch by me, per off-list analysis by Kevin Grittner and Rahila Syed. Review by Andres Freund.	2015-12-15 13:57:45 -05:00
Robert Haas	6150a1b08a	Move buffer I/O and content LWLocks out of the main tranche. Move the content lock directly into the BufferDesc, so that locking and pinning a buffer touches only one cache line rather than two. Adjust the definition of BufferDesc slightly so that this doesn't make the BufferDesc any larger than one cache line (at least on platforms where a spinlock is only 1 or 2 bytes). We can't fit the I/O locks into the BufferDesc and stay within one cache line, so move those to a completely separate tranche. This leaves a relatively limited number of LWLocks in the main tranche, so increase the padding of those remaining locks to a full cache line, rather than allowing adjacent locks to share a cache line, hopefully reducing false sharing. Performance testing shows that these changes make little difference on laptop-class machines, but help significantly on larger servers, especially those with more than 2 sockets. Andres Freund, originally based on an earlier patch by Simon Riggs. Review and cosmetic adjustments (including heavy rewriting of the comments) by me.	2015-12-15 13:32:54 -05:00
Robert Haas	3fed417452	Provide a way to predefine LWLock tranche IDs. It's a bit cumbersome to use LWLockNewTrancheId(), because the returned value needs to be shared between backends so that each backend can call LWLockRegisterTranche() with the correct ID. So, for built-in tranches, use a hard-coded value instead. This is motivated by an upcoming patch adding further built-in tranches. Andres Freund and Robert Haas	2015-12-15 11:48:19 -05:00
Stephen Frost	e5e11c8cca	Collect the global OR of hasRowSecurity flags for plancache We carry around information about if a given query has row security or not to allow the plancache to use that information to invalidate a planned query in the event that the environment changes. Previously, the flag of one of the subqueries was simply being copied into place to indicate if the query overall included RLS components. That's wrong as we need the global OR of all subqueries. Fix by changing the code to match how fireRIRules works, which is results in OR'ing all of the flags. Noted by Tom. Back-patch to 9.5 where RLS was introduced.	2015-12-14 20:05:43 -05:00
Alvaro Herrera	0d8f3d5d11	Add missing CHECK_FOR_INTERRUPTS in lseg_inside_poly Apparently, there are bugs in this code that cause it to loop endlessly. That bug still needs more research, but in the meantime it's clear that the loop is missing a check for interrupts so that it can be cancelled timely. Backpatch to 9.1 -- this has been missing since `49475aab8d`.	2015-12-14 16:44:40 -03:00
Andres Freund	cca705a5d9	Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure. Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would install 0 into MultiXactState->offsetStopLimit if it previously succeeded. Luckily, there are no known cases where find_multixact_start() will return an error in 9.5 and above. But if it were to happen, for example due to filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId() could continue allocating mxids even if close to a wraparound, or it could erroneously stop allocating mxids, even if no wraparound is looming. The wrong value would be corrected the next time SetOffsetVacuumLimit() is called, or by a restart. Reported-By: Noah Misch, although this is not his preferred fix Discussion: 20151210140450.GA22278@alap3.anarazel.de Backpatch: 9.5, where the bug was introduced as part of 4f627f	2015-12-14 11:34:16 +01:00
Andres Freund	2a3544960e	Correct statement to actually be the intended assert statement. `e3f4cfc7` introduced a LWLockHeldByMe() call, without the corresponding Assert() surrounding it. Spotted by Coverity. Backpatch: 9.1+, like the previous commit	2015-12-14 11:23:24 +01:00
Magnus Hagander	a91bdf67c4	Consistently set all fields in pg_stat_replication to null instead of 0 Previously the "sent" field would be set to 0 and all other xlog pointers be set to NULL if there were no valid values (such as when in a backup sending walsender).	2015-12-13 16:53:38 +01:00
Magnus Hagander	263c19572b	Properly initialize write, flush and replay locations in walsender slots These would leak random xlog positions if a walsender used for backup would a walsender slot previously used by a replication walsender. In passing also fix a couple of cases where the xlog pointer is directly compared to zero instead of using XLogRecPtrIsInvalid, noted by Michael Paquier.	2015-12-13 16:46:56 +01:00
Andres Freund	f54d0629ec	Fix ALTER TABLE ... SET TABLESPACE for unlogged relations. Changing the tablespace of an unlogged relation did not WAL log the creation and content of the init fork. Thus, after a standby is promoted, unlogged relation cannot be accessed anymore, with errors like: ERROR: 58P01: could not open file "pg_tblspc/...": No such file or directory Additionally the init fork was not synced to disk, independent of the configured wal_level, a relatively small durability risk. Investigation of that problem also brought to light that, even for permanent relations, the creation of !main forks was not WAL logged, i.e. no XLOG_SMGR_CREATE record were emitted. That mostly turns out not to be a problem, because these files were created when the actual relation data is copied; nonexistent files are not treated as an error condition during replay. But that doesn't work for empty files, and generally feels a bit haphazard. Luckily, outside init and main forks, empty forks don't occur often or are not a problem. Add the required WAL logging and syncing to disk. Reported-By: Michael Paquier Author: Michael Paquier and Andres Freund Discussion: 20151210163230.GA11331@alap3.anarazel.de Backpatch: 9.1, where unlogged relations were introduced	2015-12-12 14:17:39 +01:00
Alvaro Herrera	8c1615531f	For REASSIGN OWNED for foreign user mappings As reported in bug #13809 by Alexander Ashurkov, the code for REASSIGN OWNED hadn't gotten word about user mappings. Deal with them in the same way default ACLs do, which is to ignore them altogether; they are handled just fine by DROP OWNED. The other foreign object cases are already handled correctly by both commands. Also add a REASSIGN OWNED statement to foreign_data test to exercise the foreign data objects. (The changes are just before the "cleanup" phase, so it shouldn't remove any existing live test.) Reported by Alexander Ashurkov, then independently by Jaime Casanova.	2015-12-11 18:39:09 -03:00
Stephen Frost	833728d4c8	Handle policies during DROP OWNED BY DROP OWNED BY handled GRANT-based ACLs but was not removing roles from policies. Fix that by having DROP OWNED BY remove the role specified from the list of roles the policy (or policies) apply to, or the entire policy (or policies) if it only applied to the role specified. As with ACLs, the DROP OWNED BY caller must have permission to modify the policy or a WARNING is thrown and no change is made to the policy.	2015-12-11 16:12:25 -05:00
Tom Lane	4fcf48450d	Get rid of the planner's LateralJoinInfo data structure. I originally modeled this data structure on SpecialJoinInfo, but after commit `acfcd45cac` that looks like a pretty poor decision. All we really need is relid sets identifying laterally-referenced rels; and most of the time, what we want to know about includes indirect lateral references, a case the LateralJoinInfo data was unsuited to compute with any efficiency. The previous commit redefined RelOptInfo.lateral_relids as the transitive closure of lateral references, so that it easily supports checking indirect references. For the places where we really do want just direct references, add a new RelOptInfo field direct_lateral_relids, which is easily set up as a copy of lateral_relids before we perform the transitive closure calculation. Then we can just drop lateral_info_list and LateralJoinInfo and the supporting code. This makes the planner's handling of lateral references noticeably more efficient, and shorter too. Such a change can't be back-patched into stable branches for fear of breaking extensions that might be looking at the planner's data structures; but it seems not too late to push it into 9.5, so I've done so.	2015-12-11 15:52:38 -05:00
Stephen Frost	ed8bec915e	Handle dependencies properly in ALTER POLICY ALTER POLICY hadn't fully considered partial policy alternation (eg: change just the roles on the policy, or just change one of the expressions) when rebuilding the dependencies. Instead, it would happily remove all dependencies which existed for the policy and then only recreate the dependencies for the objects referred to in the specific ALTER POLICY command. Correct that by extracting and building the dependencies for all objects referenced by the policy, regardless of if they were provided as part of the ALTER POLICY command or were already in place as part of the pre-existing policy.	2015-12-11 15:43:03 -05:00
Tom Lane	acfcd45cac	Still more fixes for planner's handling of LATERAL references. More fuzz testing by Andreas Seltenreich exposed that the planner did not cope well with chains of lateral references. If relation X references Y laterally, and Y references Z laterally, then we will have to scan X on the inside of a nestloop with Z, so for all intents and purposes X is laterally dependent on Z too. The planner did not understand this and would generate intermediate joins that could not be used. While that was usually harmless except for wasting some planning cycles, under the right circumstances it would lead to "failed to build any N-way joins" or "could not devise a query plan" planner failures. To fix that, convert the existing per-relation lateral_relids and lateral_referencers relid sets into their transitive closures; that is, they now show all relations on which a rel is directly or indirectly laterally dependent. This not only fixes the chained-reference problem but allows some of the relevant tests to be made substantially simpler and faster, since they can be reduced to simple bitmap manipulations instead of searches of the LateralJoinInfo list. Also, when a PlaceHolderVar that is due to be evaluated at a join contains lateral references, we should treat those references as indirect lateral dependencies of each of the join's base relations. This prevents us from trying to join any individual base relations to the lateral reference source before the join is formed, which again cannot work. Andreas' testing also exposed another oversight in the "dangerous PlaceHolderVar" test added in commit `85e5e222b1`. Simply rejecting unsafe join paths in joinpath.c is insufficient, because in some cases we will end up rejecting all possible paths for a particular join, again leading to "could not devise a query plan" failures. The restriction has to be known also to join_is_legal and its cohort functions, so that they will not select a join for which that will happen. I chose to move the supporting logic into joinrels.c where the latter functions are. Back-patch to 9.3 where LATERAL support was introduced.	2015-12-11 14:22:20 -05:00
Alvaro Herrera	69e7235c93	Fix commit timestamp initialization This module needs explicit initialization in order to replay WAL records in recovery, but we had broken this recently following changes to make other (stranger) scenarios work correctly. To fix, rework the initialization sequence so that it always takes place before WAL replay commences for both master and standby. I could have gone for a more localized fix that just added a "startup" call for the master server, but it seemed better to restructure the existing callers as well so that the whole thing made more sense. As a drawback, there is more control logic in xlog.c now than previously, but doing otherwise meant passing down the ControlFile flag, which seemed uglier as a whole. This also meant adding a check to not re-execute ActivateCommitTs if it had already been called. Reported by Fujii Masao. Backpatch to 9.5.	2015-12-11 14:30:43 -03:00
Peter Eisentraut	a351705d8a	Improve some messages	2015-12-10 22:05:27 -05:00
Andres Freund	84ac126ee7	Fix ON CONFLICT UPDATE bug breaking AFTER UPDATE triggers. ExecOnConflictUpdate() passed t_ctid of the to-be-updated tuple to ExecUpdate(). That's problematic primarily because of two reason: First and foremost t_ctid could point to a different tuple. Secondly, and that's what triggered the complaint by Stanislav, t_ctid is changed by heap_update() to point to the new tuple version. The behavior of AFTER UPDATE triggers was therefore broken, with NEW.* and OLD.* tuples spuriously identical within AFTER UPDATE triggers. To fix both issues, pass a pointer to t_self of a on-stack HeapTuple instead. Fixing this bug lead to one change in regression tests, which previously failed due to the first issue mentioned above. There's a reasonable expectation that test fails, as it updates one row repeatedly within one INSERT ... ON CONFLICT statement. That is only possible if the second update is triggered via ON CONFLICT ... SET, ON CONFLICT ... WHERE, or by a WITH CHECK expression, as those are executed after ExecOnConflictUpdate() does a visibility check. That could easily be prohibited, but given it's allowed for plain UPDATEs and a rare corner case, it doesn't seem worthwhile. Reported-By: Stanislav Grozev Author: Andres Freund and Peter Geoghegan Discussion: CAA78GVqy1+LisN-8DygekD_Ldfy=BJLarSpjGhytOsgkpMavfQ@mail.gmail.com Backpatch: 9.5, where ON CONFLICT was introduced	2015-12-10 16:29:26 +01:00
Andres Freund	e3f4cfc7aa	Fix bug leading to restoring unlogged relations from empty files. At the end of crash recovery, unlogged relations are reset to the empty state, using their init fork as the template. The init fork is copied to the main fork without going through shared buffers. Unfortunately WAL replay so far has not necessarily flushed writes from shared buffers to disk at that point. In normal crash recovery, and before the introduction of 'fast promotions' in `fd4ced523` / 9.3, the END_OF_RECOVERY checkpoint flushes the buffers out in time. But with fast promotions that's not the case anymore. To fix, force WAL writes targeting the init fork to be flushed immediately (using the new FlushOneBuffer() function). In 9.5+ that flush can centrally be triggered from the code dealing with restoring full page writes (XLogReadBufferForRedoExtended), in earlier releases that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay function. Backpatch to 9.1, even if this currently is only known to trigger in 9.3+. Flushing earlier is more robust, and it is advantageous to keep the branches similar. Typical symptoms of this bug are errors like 'ERROR: index "..." contains unexpected zero page at block 0' shortly after promoting a node. Reported-By: Thom Brown Author: Andres Freund and Michael Paquier Discussion: 20150326175024.GJ451@alap3.anarazel.de Backpatch: 9.1-	2015-12-10 16:29:26 +01:00
Robert Haas	b287df70e4	Allow EXPLAIN (ANALYZE, VERBOSE) to display per-worker statistics. The original parallel sequential scan commit included only very limited changes to the EXPLAIN output. Aggregated totals from all workers were displayed, but there was no way to see what each individual worker did or to distinguish the effort made by the workers from the effort made by the leader. Per a gripe by Thom Brown (and maybe others). Patch by me, reviewed by Amit Kapila.	2015-12-09 13:21:19 -05:00
Kevin Grittner	25c5392330	Improve performance in freeing memory contexts The single linked list of memory contexts could result in O(N^2) performance to free a set of contexts if they were not freed in reverse order of creation. In many cases the reverse order was used, but there were some significant exceptions that caused real- world performance problems. Rather than requiring all callers to care about the order in which contexts were freed, and hunting down and changing all existing cases where the wrong order was used, we add one pointer per memory context so that the implementation details are not so visible. Jan Wieck	2015-12-08 17:32:49 -06:00
Robert Haas	385f337c9f	Allow foreign and custom joins to handle EvalPlanQual rechecks. Commit `e7cb7ee145` provided basic infrastructure for allowing a foreign data wrapper or custom scan provider to replace a join of one or more tables with a scan. However, this infrastructure failed to take into account the need for possible EvalPlanQual rechecks, and ExecScanFetch would fail an assertion (or just overwrite memory) if such a check was attempted for a plan containing a pushed-down join. To fix, adjust the EPQ machinery to skip some processing steps when scanrelid == 0, making those the responsibility of scan's recheck method, which also has the responsibility in this case of correctly populating the relevant slot. To allow foreign scans to gain control in the right place to make use of this new facility, add a new, optional RecheckForeignScan method. Also, allow a foreign scan to have a child plan, which can be used to correctly populate the slot (or perhaps for something else, but this is the only use currently envisioned). KaiGai Kohei, reviewed by Robert Haas, Etsuro Fujita, and Kyotaro Horiguchi.	2015-12-08 12:31:03 -05:00
Tom Lane	edca44b152	Simplify LATERAL-related calculations within add_paths_to_joinrel(). While convincing myself that commit `7e19db0c09` would solve both of the problems recently reported by Andreas Seltenreich, I realized that add_paths_to_joinrel's handling of LATERAL restrictions could be made noticeably simpler and faster if we were to retain the minimum possible parameterization for each joinrel (that is, the set of relids supplying unsatisfied lateral references in it). We already retain that for baserels, in RelOptInfo.lateral_relids, so we can use that field for joinrels too. I re-pgindent'd the files touched here, which affects some unrelated comments. This is, I believe, just a minor optimization not a bug fix, so no back-patch.	2015-12-07 18:56:17 -05:00
Tom Lane	7e19db0c09	Fix another oversight in checking if a join with LATERAL refs is legal. It was possible for the planner to decide to join a LATERAL subquery to the outer side of an outer join before the outer join itself is completed. Normally that's fine because of the associativity rules, but it doesn't work if the subquery contains a lateral reference to the inner side of the outer join. In such a situation the outer join must be done first. join_is_legal() missed this consideration and would allow the join to be attempted, but the actual path-building code correctly decided that no valid join path could be made, sometimes leading to planner errors such as "failed to build any N-way joins". Per report from Andreas Seltenreich. Back-patch to 9.3 where LATERAL support was added.	2015-12-07 17:42:11 -05:00
Alvaro Herrera	820ddb2c2f	Further tweak commit_timestamp behavior As pointed out by Fujii Masao, we weren't quite there on a standby behaving sanely: first because we were failing to acquire the correct state in the case where no XLOG_PARAMETER_CHANGE message was sent (because a checkpoint had already happened after the setting was changed in the master, and then the standby was restarted); and second because promoting the standby with the feature enabled failed to activate it if the master had the feature disabled. This patch fixes both those misbehaviors hopefully without re-introducing any old problems. Also change the hint emitted in a standby together with the error message about the feature being disabled, to make it point out that the place to chance the setting is the master. Otherwise, if the setting is already enabled in the standby, it is very confusing to have it say that the setting must be enabled ... Authors: Álvaro Herrera, Petr Jelínek. Backpatch to 9.5.	2015-12-03 19:22:31 -03:00
Robert Haas	c7485a82c3	Add handling for GatherPath to print_path. Peter Geoghegan	2015-12-02 08:19:50 -05:00
Tom Lane	7fb008c5ee	Make gincostestimate() cope with hypothetical GIN indexes. We tried to fetch statistics data from the index metapage, which does not work if the index isn't actually present. If the index is hypothetical, instead extrapolate some plausible internal statistics based on the index page count provided by the index-advisor plugin. There was already some code in gincostestimate() to invent internal stats in this way, but since it was only meant as a stopgap for pre-9.1 GIN indexes that hadn't been vacuumed since upgrading, it was pretty crude. If we want it to support index advisors, we should try a little harder. A small amount of testing says that it's better to estimate the entry pages as 90% of the index, not 100%. Also, estimating the number of entries (keys) as equal to the heap tuple count could be wildly wrong in either direction. Instead, let's estimate 100 entries per entry page. Perhaps someday somebody will want the index advisor to be able to provide these numbers more directly, but for the moment this should serve. Problem report and initial patch by Julien Rouhaud; modified by me to invent less-bogus internal statistics. Back-patch to all supported branches, since we've supported index advisors since 9.0.	2015-12-01 16:24:34 -05:00
Robert Haas	3690dc6b03	Fix obsolete comment. It's amazing how fast things become obsolete these days. Amit Langote	2015-11-30 12:54:46 -05:00
Tom Lane	ec7eef6b11	Avoid caching expression state trees for domain constraints across queries. In commit `8abb3cda0d` I attempted to cache the expression state trees constructed for domain CHECK constraints for the life of the backend (assuming the domain's constraints don't get redefined). However, this turns out not to work very well, because execQual.c will run those state trees with ecxt_per_query_memory pointing to a query-lifespan context, and in some situations we'll end up with pointers into that context getting stored into the state trees. This happens in particular with SQL-language functions, as reported by Emre Hasegeli, but there are many other cases. To fix, keep only the expression plan trees for domain CHECK constraints in the typcache's data structure, and revert to performing ExecInitExpr (at least) once per query to set up expression state trees in the query's context. Eventually it'd be nice to undo this, but that will require some careful thought about memory management for expression state trees, and it seems far too late for any such redesign in 9.5. This way is still much more efficient than what happened before `8abb3cda0`.	2015-11-29 18:18:42 -05:00
Tom Lane	8d32717b6b	Avoid doing encoding conversions by double-conversion via MULE_INTERNAL. Previously, we did many conversions for Cyrillic and Central European single-byte encodings by converting to a related MULE_INTERNAL coding scheme before converting to the destination. This seems unnecessarily inefficient. Moreover, if the conversion encounters an untranslatable character, the error message will confusingly complain about failure to convert to or from MULE_INTERNAL, rather than the user-visible encodings. Worse still, this approach results in some completely unnecessary conversion failures; there are cases where the chosen MULE subset lacks characters that exist in both of the user-visible encodings, causing a conversion failure that need not occur. This patch fixes the first two of those deficiencies by introducing a new local2local() conversion support subroutine for direct conversion between any two single-byte character sets, and adding new conversion tables where needed. However, I generated the new conversion tables by testing PG 9.5's behavior, so that the actual conversion behavior is bug-compatible with previous releases; the only user-visible behavior change is that the error messages for conversion failures are saner. Changes in the conversion behavior will probably ensue after discussion. Interestingly, although this approach requires more tables, the .so files actually end up smaller (at least on my x86_64 machine); the tables are smaller than the management code needed for double conversion. Per a complaint from Albe Laurenz.	2015-11-28 13:42:27 -05:00
Tom Lane	5afdfc9cbb	Update UCS_to_GB18030.pl with info about origin of the reference file.	2015-11-27 17:31:26 -05:00
Tom Lane	e17dab53ea	Auto-generate file header comments in Unicode mapping files. Some of the Unicode/.map files had identification comments added to them, evidently by hand. Others did not. Modify the generating scripts to produce these comments automatically, and update the generated files that lacked them. This is just minor cleanup as a by-product of trying to verify that the .map files can indeed be reproduced from authoritative data. There are a depressingly large number that fail to reproduce from the claimed sources. I have not touched those in this commit, except for the JIS 2004-related files which required only a single comment update to match. Since this only affects comments, no need to consider a back-patch.	2015-11-27 16:50:47 -05:00
Teodor Sigaev	92e38182d7	COPY (INSERT/UPDATE/DELETE .. RETURNING ..) Attached is a patch for being able to do COPY (query) without a CTE. Author: Marko Tiikkaja Review: Michael Paquier	2015-11-27 19:11:22 +03:00
Tom Lane	0da3a9bef7	Fix failure to consider failure cases in GetComboCommandId(). Failure to initially palloc the comboCids array, or to realloc it bigger when needed, left combocid's data structures in an inconsistent state that would cause trouble if the top transaction continues to execute. Noted while examining a user complaint about the amount of memory used for this. (There's not much we can do about that, but it does point up that repalloc failure has a non-negligible chance of occurring here.) In HEAD/9.5, also avoid possible invocation of memcpy() with a null pointer in SerializeComboCIDState; cf commit `13bba0227`.	2015-11-26 13:23:02 -05:00
Tom Lane	46166197c3	Improve div_var_fast(), mostly by making comments better. The integer overflow situation in div_var_fast() is a great deal more complicated than the pre-existing comments would suggest. Moreover, the comments were also flat out incorrect as to the precise statement of the maxdiv loop invariant. Upon clarifying that, it becomes apparent that the way in which we updated maxdiv after a carry propagation pass was overly slow, complex, and conservative: we can just reset it to one, which is much easier and also reduces the number of times carry propagation occurs. Fix that and improve the relevant comments. Since this is mostly a comment fix, with only a rather marginal performance boost, no need for back-patch. Tom Lane and Dean Rasheed	2015-11-25 16:05:57 -05:00
Tom Lane	00cdd83521	Adopt the GNU convention for handling tar-archive members exceeding 8GB. The POSIX standard for tar headers requires archive member sizes to be printed in octal with at most 11 digits, limiting the representable file size to 8GB. However, GNU tar and apparently most other modern tars support a convention in which oversized values can be stored in base-256, allowing any practical file to be a tar member. Adopt this convention to remove two limitations: * pg_dump with -Ft output format failed if the contents of any one table exceeded 8GB. * pg_basebackup failed if the data directory contained any file exceeding 8GB. (This would be a fatal problem for installations configured with a table segment size of 8GB or more, and it has also been seen to fail when large core dump files exist in the data directory.) File sizes under 8GB are still printed in octal, so that no compatibility issues are created except in cases that would have failed entirely before. In addition, this patch fixes several bugs in the same area: * In 9.3 and later, we'd defined tarCreateHeader's file-size argument as size_t, which meant that on 32-bit machines it would write a corrupt tar header for file sizes between 4GB and 8GB, even though no error was raised. This broke both "pg_dump -Ft" and pg_basebackup for such cases. * pg_restore from a tar archive would fail on tables of size between 4GB and 8GB, on machines where either "size_t" or "unsigned long" is 32 bits. This happened even with an archive file not affected by the previous bug. * pg_basebackup would fail if there were files of size between 4GB and 8GB, even on 64-bit machines. * In 9.3 and later, "pg_basebackup -Ft" failed entirely, for any file size, on 64-bit big-endian machines. In view of these potential data-loss bugs, back-patch to all supported branches, even though removal of the documented 8GB limit might otherwise be considered a new feature rather than a bug fix.	2015-11-21 20:21:31 -05:00
Tom Lane	074c5cfbfb	Fix handling of inherited check constraints in ALTER COLUMN TYPE (again). The previous way of reconstructing check constraints was to do a separate "ALTER TABLE ONLY tab ADD CONSTRAINT" for each table in an inheritance hierarchy. However, that way has no hope of reconstructing the check constraints' own inheritance properties correctly, as pointed out in bug #13779 from Jan Dirk Zijlstra. What we should do instead is to do a regular "ALTER TABLE", allowing recursion, at the topmost table that has a particular constraint, and then suppress the work queue entries for inherited instances of the constraint. Annoyingly, we'd tried to fix this behavior before, in commit `5ed6546cf`, but we failed to notice that it wasn't reconstructing the pg_constraint field values correctly. As long as I'm touching pg_get_constraintdef_worker anyway, tweak it to always schema-qualify the target table name; this seems like useful backup to the protections installed by commit `5f173040`. In HEAD/9.5, get rid of get_constraint_relation_oids, which is now unused. (I could alternatively have modified it to also return conislocal, but that seemed like a pretty single-purpose API, so let's not pretend it has some other use.) It's unused in the back branches as well, but I left it in place just in case some third-party code has decided to use it. In HEAD/9.5, also rename pg_get_constraintdef_string to pg_get_constraintdef_command, as the previous name did nothing to explain what that entry point did differently from others (and its comment was equally useless). Again, that change doesn't seem like material for back-patching. I did a bit of re-pgindenting in tablecmds.c in HEAD/9.5, as well. Otherwise, back-patch to all supported branches.	2015-11-20 14:55:47 -05:00
Robert Haas	6c878a7553	Avoid server crash when worker registration fails at execution time. The previous coding attempts to destroy the DSM in this case, but child nodes might have stored data there and still be holding onto pointers in this case. So don't do that. Also, free the reader array instead of leaking it. Extracted from two different patch versions both by Amit Kapila.	2015-11-20 13:03:39 -05:00
Robert Haas	74d0d5f3eb	Fix typo in comment. Amit Langote	2015-11-19 16:45:39 -05:00
Robert Haas	fea2b642fd	Remove numbers from incorrectly-numbered list. Reported by Andres Freund.	2015-11-19 16:45:13 -05:00
Robert Haas	bc4996e61b	Make ALTER .. SET SCHEMA do nothing, instead of throwing an ERROR. This was already true for CREATE EXTENSION, but historically has not been true for other object types. Therefore, this is a backward incompatibility. Per discussion on pgsql-hackers, everyone seems to agree that the new behavior is better. Marti Raudsepp, reviewed by Haribabu Kommi and myself	2015-11-19 10:49:25 -05:00
Robert Haas	7907a949ab	Fix incomplete set_foreignscan_references handling for fdw_recheck_quals KaiGai Kohei	2015-11-18 22:12:21 -05:00
Andres Freund	d3c8ac114f	Remove function names from some elog() calls in heapam.c. At least one of the names was, due to a function renaming late in the development of ON CONFLICT, wrong. Since including function names in error messages is against the message style guide anyway, remove them from the messages. Discussion: CAM3SWZT8paz=usgMVHm0XOETkQvzjRtAUthATnmaHQQY0obnGw@mail.gmail.com Backpatch: 9.5, where ON CONFLICT was introduced	2015-11-19 01:37:58 +01:00
Robert Haas	166b61a88e	Avoid aggregating worker instrumentation multiple times. Amit Kapila, per design ideas from me.	2015-11-18 12:35:25 -05:00
Robert Haas	adeee97486	Fix dumb bug in tqueue.c When I wrote this code originally, the intention was to recompute the remapinfo only when the tupledesc changes. This presumably only happens once per query, but I copied the design pattern from other DestReceivers. However, due to a silly oversight on my part, tqueue->tupledesc never got set, leading to recomputation for every tuple. This should improve the performance of parallel scans that return a significant number of tuples. Report by Amit Kapila; patch by me, reviewed by him.	2015-11-18 08:25:33 -05:00
Tom Lane	5f10b7a604	Fix possible internal overflow in numeric division. div_var_fast() postpones propagating carries in the same way as mul_var(), so it has the same corner-case overflow risk we fixed in `246693e5ae`, namely that the size of the carries has to be accounted for when setting the threshold for executing a carry propagation step. We've not devised a test case illustrating the brokenness, but the required fix seems clear enough. Like the previous fix, back-patch to all active branches. Dean Rasheed	2015-11-17 15:46:47 -05:00
Peter Eisentraut	c5ec406412	Message style fix from Euler Taveira	2015-11-17 06:53:07 -05:00
Peter Eisentraut	5db837d3f2	Message improvements	2015-11-16 21:39:23 -05:00
Robert Haas	e93b62985f	Remove volatile qualifiers from bufmgr.c and freelist.c Prior to commit `0709b7ee72`, access to variables within a spinlock-protected critical section had to be done through a volatile pointer, but that should no longer be necessary. Review by Andres Freund	2015-11-16 18:50:06 -05:00
Tom Lane	8004953b5a	Speed up ruleutils' name de-duplication code, and fix overlength-name case. Since commit `11e131854f`, ruleutils.c has attempted to ensure that each RTE in a query or plan tree has a unique alias name. However, the code that was added for this could be quite slow, even as bad as O(N^3) if N identical RTE names must be replaced, as noted by Jeff Janes. Improve matters by building a transient hash table within set_rtable_names. The hash table in itself reduces the cost of detecting a duplicate from O(N) to O(1), and we can save another factor of N by storing the number of de-duplicated names already created for each entry, so that we don't have to re-try names already created. This way is probably a bit slower overall for small range tables, but almost by definition, such cases should not be a performance problem. In principle the same problem applies to the column-name-de-duplication code; but in practice that seems to be less of a problem, first because N is limited since we don't support extremely wide tables, and second because duplicate column names within an RTE are fairly rare, so that in practice the cost is more like O(N^2) not O(N^3). It would be very much messier to fix the column-name code, so for now I've left that alone. An independent problem in the same area was that the de-duplication code paid no attention to the identifier length limit, and would happily produce identifiers that were longer than NAMEDATALEN and wouldn't be unique after truncation to NAMEDATALEN. This could result in dump/reload failures, or perhaps even views that silently behaved differently than before. We can fix that by shortening the base name as needed. Fix it for both the relation and column name cases. In passing, check for interrupts in set_rtable_names, just in case it's still slow enough to be an issue. Back-patch to 9.3 where this code was introduced.	2015-11-16 13:45:17 -05:00
Robert Haas	179c97bf58	Remove accidentally-committed debugging code. Amit Kapila	2015-11-15 18:07:57 -05:00
Tom Lane	7745bc352a	Fix ruleutils.c's dumping of whole-row Vars in ROW() and VALUES() contexts. Normally ruleutils prints a whole-row Var as "foo.". We already knew that that doesn't work at top level of a SELECT list, because the parser would treat the "" as a directive to expand the reference into separate columns, not a whole-row Var. However, Joshua Yanovski points out in bug #13776 that the same thing happens at top level of a ROW() construct; and some nosing around in the parser shows that the same is true in VALUES(). Hence, apply the same workaround already devised for the SELECT-list case, namely to add a forced cast to the appropriate rowtype in these cases. (The alternative of just printing "foo" was rejected because it is difficult to avoid ambiguity against plain columns named "foo".) Back-patch to all supported branches.	2015-11-15 14:41:09 -05:00
Tom Lane	7d9a4737c2	Improve type numeric's calculations for ln(), log(), exp(), pow(). Set the "rscales" for intermediate-result calculations to ensure that suitable numbers of significant digits are maintained throughout. The previous coding hadn't thought this through in any detail, and as a result could deliver results with many inaccurate digits, or in the worst cases even fail with divide-by-zero errors as a result of losing all nonzero digits of intermediate results. In exp_var(), get rid entirely of the logic that separated the calculation into integer and fractional parts: that was neither accurate nor particularly fast. The existing range-reduction method of dividing by 2^n can be applied across the full input range instead of only 0..1, as long as we are careful to set an appropriate rscale for each step. Also fix the logic in mul_var() for shortening the calculation when the caller asks for fewer output digits than an exact calculation would require. This bug doesn't affect simple multiplications since that code path asks for an exact result, but it does contribute to accuracy issues in the transcendental math functions. In passing, improve performance of mul_var() a bit by forcing the shorter input to be on the left, thus reducing the number of iterations of the outer loop and probably also reducing the number of carry-propagation steps needed. This is arguably a bug fix, but in view of the lack of field complaints, it does not seem worth the risk of back-patching. Dean Rasheed	2015-11-14 14:55:46 -05:00

... 8 9 10 11 12 ...

16346 Commits