Commit Graph

7397 Commits

Author SHA1 Message Date
Tom Lane 75c24d0f74 Further reduce the number of semaphores used under --disable-spinlocks.
Per discussion, there doesn't seem to be much value in having
NUM_SPINLOCK_SEMAPHORES set to 1024: under any scenario where you are
running more than a few backends concurrently, you really had better have a
real spinlock implementation if you want tolerable performance.  And 1024
semaphores is a sizable fraction of the system-wide SysV semaphore limit
on many platforms.  Therefore, reduce this setting's default value to 128
to make it less likely to cause out-of-semaphores problems.
2016-04-18 13:33:06 -04:00
Robert Haas 5702277ca9 Tweak EXPLAIN for parallel query to show workers launched.
The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched.  This has already confused
several people, so display both numbers and label them clearly.

Julien Rouhaud, reviewed by me.
2016-04-15 11:52:18 -04:00
Tom Lane 6b85d4ba9b Fix portability problem induced by commit a6f6b7819.
pg_xlogdump includes bufmgr.h.  With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage().  So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere.  Per buildfarm member pademelon.
2016-04-15 10:44:28 -04:00
Andres Freund 4b74c6a40e Make init_spin_delay() C89 compliant #2.
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().

Give up on using a macro, and use a static inline instead.

Discussion: 4369.1460435533@sss.pgh.pa.us
2016-04-14 19:26:13 -07:00
Andres Freund 533cd2303a Remove trailing commas in enums.
These aren't valid C89. Found thanks to gcc's -Wc90-c99-compat. These
exist in differing places in most supported branches.
2016-04-14 19:25:16 -07:00
Tom Lane c2dc194bdb Adjust signature of walrcv_receive hook.
Commit 314cbfc5da redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);

But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.
2016-04-14 13:49:37 -04:00
Tom Lane 22989a8e34 Fix prototype of pgwin32_bind().
I (tgl) had copied-and-pasted this from pgwin32_accept(), failing to
notice that the third parameter should be "int" not "int *".

David Rowley
2016-04-14 09:44:21 -04:00
Andres Freund be65eddd80 Add required database and origin filtering for logical messages.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.

Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.

Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.

Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
2016-04-13 17:38:54 -07:00
Andres Freund 80abbeba23 Make init_spin_delay() C89 compliant and change stuck spinlock reporting.
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one.  This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.

The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).

This also fixes what file/line number is reported for waits via
s_lock().

As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.

Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
2016-04-13 17:00:53 -07:00
Andres Freund 6b93fcd149 Avoid atomic operation in MarkLocalBufferDirty().
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.

On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail.  As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.

In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.

Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
2016-04-13 15:28:29 -07:00
Tom Lane 95ef43c430 Widen amount-to-flush arguments of FileWriteback and callers.
It's silly to define these counts as narrower than they might someday
need to be.  Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.
2016-04-13 18:12:06 -04:00
Tom Lane d1b7d4877b Provide errno-translation wrappers around bind() and listen() on Windows.
I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members.  Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno.  Put in a wrapper layer to do that.

If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.

Discussion: <4065.1452450340@sss.pgh.pa.us>
2016-04-12 19:52:21 -04:00
Robert Haas deb71fa971 Fix costing for parallel aggregation.
The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed.  This patch
fixes that oversight.

David Rowley
2016-04-12 16:25:55 -04:00
Tom Lane f1f01de145 Redefine create_upper_paths_hook as being invoked once per upper relation.
Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.
2016-04-12 15:23:14 -04:00
Tom Lane 5713f03973 Improve API of GenericXLogRegister().
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future.  Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:42:06 -04:00
Kevin Grittner a6f6b78196 Use static inline function for BufferGetPage()
I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases.  Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call.  So make it readable.

Per gripes from Álvaro Herrera and Tom Lane
2016-04-11 16:47:50 -05:00
Andres Freund 008608b9d5 Avoid the use of a separate spinlock to protect a LWLock's wait queue.
Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.

Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.

Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com
    20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de
2016-04-10 20:12:32 -07:00
Andres Freund 48354581a4 Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).

To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).

As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths.  There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.

As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.

This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me.  Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.

On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.

Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
2016-04-10 20:12:32 -07:00
Tom Lane 08e785436f Get rid of GenericXLogUnregister().
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.

Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.

While at it, do some wordsmithing on the SGML docs for generic WAL.
2016-04-09 16:39:30 -04:00
Kevin Grittner 381200be4b Fix typo in C comment. 2016-04-09 09:07:42 -05:00
Kevin Grittner 56dffb5a73 Turn special page pointer validation to static inline function
Inclusion of multiple macros inside another macro was pushing MSVC
past its size liimit.  Reported by buildfarm.
2016-04-09 08:17:22 -05:00
Alvaro Herrera c09b18f21c Support \crosstabview in psql
\crosstabview is a completely different way to display results from a
query: instead of a vertical display of rows, the data values are placed
in a grid where the column and row headers come from the data itself,
similar to a spreadsheet.

The sort order of the horizontal header can be specified by using
another column in the query, and the vertical header determines its
ordering from the order in which they appear in the query.

This only allows displaying a single value in each cell.  If more than
one value correspond to the same cell, an error is thrown.  Merging of
values can be done in the query itself, if necessary.  This may be
revisited in the future.

Author: Daniel Verité
Reviewed-by: Pavel Stehule, Dean Rasheed
2016-04-08 20:23:18 -03:00
Andres Freund c1ddd2361f Expose more out/readfuncs support functions.
Previously bcac23d exposed a subset of support functions, namely the
ones Kaigai found useful. In
20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that
there's some functions missing to use the facility in an external
project.

To avoid having to add functions piecemeal, add all the functions which
are used to define READ_* and WRITE_* macros; users of the extensible
node functionality are likely to need these. Additionally expose
outDatum(), which doesn't have it's own WRITE_ macro, as it needs
information from the embedding struct.

Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de
2016-04-08 14:26:36 -07:00
Stephen Frost 7a542700df Create default roles
This creates an initial set of default roles which administrators may
use to grant access to, historically, superuser-only functions.  Using
these roles instead of granting superuser access reduces the number of
superuser roles required for a system.  Documention for each of the
default roles has been added to user-manag.sgml.

Bump catversion to 201604082, as we had a commit that bumped it to
201604081 and another that set it back to 201604071...

Reviews by José Luis Tallón and Robert Haas
2016-04-08 16:56:27 -04:00
Stephen Frost 293007898d Reserve the "pg_" namespace for roles
This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.

This will allow for default roles to be provided at initdb time.

Reviews by José Luis Tallón and Robert Haas
2016-04-08 16:56:27 -04:00
Kevin Grittner 848ef42bb8 Add the "snapshot too old" feature
This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
2016-04-08 14:36:30 -05:00
Kevin Grittner 8b65cf4c5e Modify BufferGetPage() to prepare for "snapshot too old" feature
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.
2016-04-08 14:30:10 -05:00
Teodor Sigaev 8b99edefca Revert CREATE INDEX ... INCLUDING ...
It's not ready yet, revert two commits
690c543550 - unstable test output
386e3d7609 - patch itself
2016-04-08 21:52:13 +03:00
Magnus Hagander 35e2e357cb Add authentication parameters compat_realm and upn_usename for SSPI
These parameters are available for SSPI authentication only, to make
it possible to make it behave more like "normal gssapi", while
making it possible to maintain compatibility.

compat_realm is on by default, but can be turned off to make the
authentication use the full Kerberos realm instead of the NetBIOS name.

upn_username is off by default, and can be turned on to return the users
Kerberos UPN rather than the SAM-compatible name (a user in Active
Directory can have both a legacy SAM-compatible username and a new
Kerberos one. Normally they are the same, but not always)

Author: Christian Ullrich
Reviewed by: Robbie Harwood, Alvaro Herrera, me
2016-04-08 20:28:38 +02:00
Tom Lane 34c33a1f00 Add BSD authentication method.
Create a "bsd" auth method that works the same as "password" so far as
clients are concerned, but calls the BSD Authentication service to
check the password.  This is currently only available on OpenBSD.

Marisa Emerson, reviewed by Thomas Munro
2016-04-08 13:52:06 -04:00
Robert Haas af025eed53 Add combine functions for various floating-point aggregates.
This allows parallel aggregation to use them.  It may seem surprising
that we use float8_combine for both float4_accum and float8_accum
transition functions, but that's because those functions differ only
in the type of the non-transition-state argument.

Haribabu Kommi, reviewed by David Rowley and Tomas Vondra
2016-04-08 13:47:06 -04:00
Teodor Sigaev 1ec4c7c055 Restore original tsquery operation numbering.
As noticed by Tom Lane changing operation's number in commit
bb140506df causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.

Author: Dmitry Ivanov
2016-04-08 20:11:30 +03:00
Teodor Sigaev 386e3d7609 CREATE INDEX ... INCLUDING (column[, ...])
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
2016-04-08 19:45:59 +03:00
Robert Haas 25fe8b5f1a Add a 'parallel_degree' reloption.
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.

Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me.  Some further hacking by me.
2016-04-08 11:14:56 -04:00
Peter Eisentraut 2f1d2b7a75 Set PAM_RHOST item for PAM authentication
The PAM_RHOST item is set to the remote IP address or host name and can
be used by PAM modules.  A pg_hba.conf option is provided to choose
between IP address and resolved host name.

From: Grzegorz Sampolski <grzsmp@gmail.com>
Reviewed-by: Haribabu Kommi <kommi.haribabu@gmail.com>
2016-04-08 10:48:44 -04:00
Teodor Sigaev 4e55b3f033 Rename comparePos() to compareWordEntryPos()
Rename comparePos() to compareWordEntryPos() to prevent export of too
generic name.

Per gripe from Tom Lane.
2016-04-08 12:04:15 +03:00
Robert Haas 0711803775 Use quicksort, not replacement selection, for external sorting.
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small.  Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort.  This tends to be faster except perhaps for very small
amounts of working memory.

Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
2016-04-08 02:36:26 -04:00
Robert Haas 719c84c1be Extend relations multiple blocks at a time to improve scalability.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate.  Experimentation shows the extending the relation
multiple blocks at a time improves scalability.

Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
2016-04-08 02:04:46 -04:00
Simon Riggs 137805f89a Use Foreign Key relationships to infer multi-column join selectivity
In cases where joins use multiple columns we currently assess each join
separately causing gross mis-estimates for join cardinality.

This patch adds use of FK information for the first time into the
planner. When FKs are present and we have multi-column join information,
plan estimates will be drastically improved. Cases with multiple FKs
are handled, though partial matches are ignored currently.

Net effect is substantial performance improvements for joins in many
common cases. Additional planning time is isolated to cases that are
currently performing poorly, measured at 0.08 - 0.15 ms.

Please watch for planner performance regressions; circumstances seem
unlikely but the law of unintended consequences may apply somewhen.
Additional complex tests welcome to prove this before release.

Tests can be performed using SET enable_fkey_estimates = on | off
using scripts provided during Hackers discussions, message id:
552335D9.3090707@2ndquadrant.com

Authors: Tomas Vondra and David Rowley
Reviewed and tested by Simon Riggs, adding comments only
2016-04-08 02:51:09 +01:00
Teodor Sigaev bb140506df Phrase full text search.
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
  should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
  readable

Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
2016-04-07 18:44:18 +03:00
Simon Riggs 015e88942a Load FK defs into relcache for use by planner
Fastpath ignores this if no triggers defined.

Author: Tomas Vondra, with fastpath and comments added by me
Reviewers: David Rowley, Simon Riggs
2016-04-07 12:08:33 +01:00
Stephen Frost 29dd1504a1 Bump catversion for pg_dump dump catalog ACL patches
Pointed out by Tom.
2016-04-06 23:04:48 -04:00
Stephen Frost 23f34fa4ba In pg_dump, include pg_catalog and extension ACLs, if changed
Now that all of the infrastructure exists, add in the ability to
dump out the ACLs of the objects inside of pg_catalog or the ACLs
for objects which are members of extensions, but only if they have
been changed from their original values.

The original values are tracked in pg_init_privs.  When pg_dump'ing
9.6-and-above databases, we will dump out the ACLs for all objects
in pg_catalog and the ACLs for all extension members, where the ACL
has been changed from the original value which was set during either
initdb or CREATE EXTENSION.

This should not change dumps against pre-9.6 databases.

Reviews by Alexander Korotkov, Jose Luis Tallon
2016-04-06 21:45:32 -04:00
Stephen Frost 6c268df127 Add new catalog called pg_init_privs
This new catalog holds the privileges which the system was
initialized with at initdb time, along with any permissions set
by extensions at CREATE EXTENSION time.  This allows pg_dump
(and any other similar use-cases) to detect when the privileges
set on initdb-created or extension-created objects have been
changed from what they were set to at initdb/extension-creation
time and handle those changes appropriately.

Reviews by Alexander Korotkov, Jose Luis Tallon
2016-04-06 21:45:32 -04:00
Teodor Sigaev 0b62fd036e Add jsonb_insert
It inserts a new value into an jsonb array at arbitrary position or
a new key to jsonb object.

Author: Dmitry Dolgov
Reviewers: Petr Jelinek, Vitaly Burovoy, Andrew Dunstan
2016-04-06 19:25:00 +03:00
Tom Lane de94e2af18 Run pgindent on a batch of (mostly-planner-related) source files.
Getting annoyed at the amount of unrelated chatter I get from pgindent'ing
Rowley's unique-joins patch.  Re-indent all the files it touches.
2016-04-06 11:34:02 -04:00
Simon Riggs 3fe3511d05 Generic Messages for Logical Decoding
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.

Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.

(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)

Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
2016-04-06 10:05:41 +01:00
Fujii Masao 989be0810d Support multiple synchronous standby servers.
Previously synchronous replication offered only the ability to confirm
that all changes made by a transaction had been transferred to at most
one synchronous standby server.

This commit extends synchronous replication so that it supports multiple
synchronous standby servers. It enables users to consider one or more
standby servers as synchronous, and increase the level of transaction
durability by ensuring that transaction commits wait for replies from
all of those synchronous standbys.

Multiple synchronous standby servers are configured in
synchronous_standby_names which is extended to support new syntax of
'num_sync ( standby_name [ , ... ] )', where num_sync specifies
the number of synchronous standbys that transaction commits need to
wait for replies from and standby_name is the name of a standby
server.

The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before
is also still supported. It's the same as new syntax with num_sync=1.

This commit doesn't include "quorum commit" feature which was discussed
in pgsql-hackers. Synchronous standbys are chosen based on their priorities.
synchronous_standby_names determines the priority of each standby for
being chosen as a synchronous standby. The standbys whose names appear
earlier in the list are given higher priority and will be considered as
synchronous. Other standby servers appearing later in this list
represent potential synchronous standbys.

The regression test for multiple synchronous standbys is not included
in this commit. It should come later.

Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao
Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs,
Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen,
Rajeev Rastogi

Many thanks to the various individuals who were involved in
discussing and developing this feature.
2016-04-06 17:18:25 +09:00
Alvaro Herrera f2fcad27d5 Support ALTER THING .. DEPENDS ON EXTENSION
This introduces a new dependency type which marks an object as depending
on an extension, such that if the extension is dropped, the object
automatically goes away; and also, if the database is dumped, the object
is included in the dump output.  Currently the grammar supports this for
indexes, triggers, materialized views and functions only, although the
utility code is generic so adding support for more object types is a
matter of touching the parser rules only.

Author: Abhijit Menon-Sen
Reviewed-by: Alexander Korotkov, Álvaro Herrera
Discussion: http://www.postgresql.org/message-id/20160115062649.GA5068@toroid.org
2016-04-05 18:38:54 -03:00
Robert Haas 41ea0c2376 Fix parallel-safety code for parallel aggregation.
has_parallel_hazard() was ignoring the proparallel markings for
aggregates, which is no good.  Fix that.  There was no way to mark
an aggregate as actually being parallel-safe, either, so add a
PARALLEL option to CREATE AGGREGATE.

Patch by me, reviewed by David Rowley.
2016-04-05 16:06:15 -04:00
Robert Haas 11c8669c0c Add parallel query support functions for assorted aggregates.
This lets us use parallel aggregate for a variety of useful cases
that didn't work before, like sum(int8), sum(numeric), several
versions of avg(), and various other functions.

Add some regression tests, as well, testing the general sanity of
these and future catalog entries.

David Rowley, reviewed by Tomas Vondra, with a few further changes
by me.
2016-04-05 14:32:53 -04:00
Magnus Hagander 7117685461 Implement backup API functions for non-exclusive backups
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.

Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.

The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.

Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.

Reviewed by David Steele and Amit Kapila
2016-04-05 20:03:49 +02:00
Tom Lane 66229ac004 Introduce a LOG_SERVER_ONLY ereport level, which is never sent to client.
This elevel is useful for logging audit messages and similar information
that should not be passed to the client.  It's equivalent to LOG in terms
of decisions about logging priority in the postmaster log, but messages
with this elevel will never be sent to the client.

In the current implementation, it's just an alias for the longstanding
COMMERROR elevel (or more accurately, we've made COMMERROR an alias for
this).  At some point it might be interesting to allow a LOG_ONLY flag to
be attached to any elevel, but that would be considerably more complicated,
and it's not clear there's enough use-cases to justify the extra work.
For now, let's just take the easy 90% solution.

David Steele, reviewed by Fabien Coelho, Petr Jelínek, and myself
2016-04-04 12:32:42 -04:00
Teodor Sigaev 65578341af Add Generic WAL interface
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.

Bump XLOG_PAGE_MAGIC because of new record type.

Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
	minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
	Michael Paquier
2016-04-01 12:21:48 +03:00
Tom Lane f9aefcb91f Support using index-only scans with partial indexes in more cases.
Previously, the planner would reject an index-only scan if any restriction
clause for its table used a column not available from the index, even
if that restriction clause would later be dropped from the plan entirely
because it's implied by the index's predicate.  This is a fairly common
situation for partial indexes because predicates using columns not included
in the index are often the most useful kind of predicate, and we have to
duplicate (or at least imply) the predicate in the WHERE clause in order
to get the index to be considered at all.  So index-only scans were
essentially unavailable with such partial indexes.

To fix, we have to do detection of implied-by-predicate clauses much
earlier in the planner.  This patch puts it in check_index_predicates
(nee check_partial_indexes), meaning it gets done for every partial index,
whereas we previously only considered this issue at createplan time,
so that the work was only done for an index actually selected for use.
That could result in a noticeable planning slowdown for queries against
tables with many partial indexes.  However, testing suggested that there
isn't really a significant cost, especially not with reasonable numbers
of partial indexes.  We do get a small additional benefit, which is that
cost_index is more accurate since it correctly discounts the evaluation
cost of clauses that will be removed.  We can also avoid considering such
clauses as potential indexquals, which saves useless matching cycles in
the case where the predicate columns aren't in the index, and prevents
generating bogus plans that double-count the clause's selectivity when
the columns are in the index.

Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and
Konstantin Knizhnik, and whacked around a little by me
2016-03-31 14:49:10 -04:00
Alvaro Herrera 3dd0792ae0 Blind attempt at fixing Win32 issue on 24c5f1a103
As best as I can tell, MyReplicationSlot needs to be PGDLLIMPORT in
order for the new test_slot_timelines test module to compile.

Per buildfarm
2016-03-30 23:12:20 -03:00
Alvaro Herrera 24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera 3b02ea4f07 XLogReader general code cleanup
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.

Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
2016-03-30 18:56:13 -03:00
Tom Lane 50861cd683 Improve portability of I/O behavior for the geometric types.
Formerly, the geometric I/O routines such as box_in and point_out relied
directly on strtod() and sprintf() for conversion of the float8 component
values of their data types.  However, the behavior of those functions is
pretty platform-dependent, especially for edge-case values such as
infinities and NaNs.  This was exposed by commit acdf2a8b37, which
added test cases involving boxes with infinity endpoints, and immediately
failed on Windows and AIX buildfarm members.  We solved these problems
years ago in the main float8in and float8out functions, so let's fix it
by making the geometric types use that code instead of depending directly
on the platform-supplied functions.

To do this, refactor the float8in code so that it can be used to parse
just part of a string, and as a convenience make the guts of float8out
usable without going through DirectFunctionCall.

While at it, get rid of geo_ops.c's fairly shaky assumptions about the
maximum output string length for a double, by having it build results in
StringInfo buffers instead of fixed-length strings.

In passing, convert all the "invalid input syntax for type foo" messages
in this area of the code into "invalid input syntax for type %s" to reduce
the number of distinct translatable strings, per recent discussion.
We would have needed a fair number of the latter anyway for code-sharing
reasons, so we might as well just go whole hog.

Note: this patch is by no means intended to guarantee that the geometric
types uniformly behave sanely for infinity or NaN component values.
But any bugs we have in that line were there all along, they were just
harder to reach in a platform-independent way.
2016-03-30 17:25:03 -04:00
Teodor Sigaev 2d02a856e8 Bump catalog version, forget in acdf2a8b37 2016-03-30 18:56:21 +03:00
Teodor Sigaev acdf2a8b37 Introduce SP-GiST operator class over box.
Patch implements quad-tree over boxes, naive approach of 2D quad tree will not
work for any non-point objects because splitting space on node is not
efficient. The idea of pathc is treating 2D boxes as 4D points, so,
object will not overlap (in 4D space).

The performance tests reveal that this technique especially beneficial
with too much overlapping objects, so called "spaghetti data".

Author: Alexander Lebedev with editorization by Emre Hasegeli and me
2016-03-30 18:42:36 +03:00
Teodor Sigaev ccd6eb49a4 Introduce traversalValue for SP-GiST scan
During scan sometimes it would be very helpful to know some information about
parent node or all 	ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).

traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.

Subsequent patches for range opclass and quad4d tree will use it.

Author: Alexander Lebedev, Teodor Sigaev
2016-03-30 18:29:28 +03:00
Robert Haas 314cbfc5da Add new replication mode synchronous_commit = 'remote_apply'.
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk.  That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.

To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby.  But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.

Thomas Munro, reviewed by Michael Paquier and by me.  Some additional
tweaks by me.
2016-03-29 21:29:49 -04:00
Tom Lane e511d878f3 Allow to_timestamp(float8) to convert float infinity to timestamp infinity.
With the original SQL-function implementation, such cases failed because
we don't support infinite intervals.  Converting the function to C lets
us bypass the interval representation, which should be a bit faster as
well as more flexible.

Vitaly Burovoy, reviewed by Anastasia Lubennikova
2016-03-29 17:09:29 -04:00
Robert Haas 5fe5a2cee9 Allow aggregate transition states to be serialized and deserialized.
This is necessary infrastructure for supporting parallel aggregation
for aggregates whose transition type is "internal".  Such values
can't be passed between cooperating processes, because they are
just pointers.

David Rowley, reviewed by Tomas Vondra and by me.
2016-03-29 15:04:05 -04:00
Robert Haas f9143d102f Rework custom scans to work more like the new extensible node stuff.
Per discussion, the new extensible node framework is thought to be
better designed than the custom path/scan/scanstate stuff we added
in PostgreSQL 9.5.  Rework the latter to be more like the former.

This is not backward-compatible, but we generally don't promise that
for C APIs, and there probably aren't many people using this yet
anyway.

KaiGai Kohei, reviewed by Petr Jelinek and me.  Some further
cosmetic changes by me.
2016-03-29 11:28:04 -04:00
Robert Haas 5d4171d1c7 Don't require a user mapping for FDWs to work.
Commit fbe5a3fb73 accidentally changed
this behavior; put things back the way they were, and add some
regression tests.

Report by Andres Freund; patch by Ashutosh Bapat, with a bit of
kibitzing by me.
2016-03-28 21:50:28 -04:00
Robert Haas 868628e4fd On all Windows platforms, not just Cygwin, use _timezone and _tzname.
Up until now, we've been using timezone and tzname, but Visual Studio
2015 (for which we wish to add support) no longer declares those
symbols.  All versions since Visual Studio 2003 apparently support the
underscore-equipped names, and we don't support anything older than
Visual Studio 2005, so this should work OK everywhere.  But let's see
what the buildfarm thinks.

Michael Paquier, reviewed by Petr Jelinek
2016-03-28 20:59:25 -04:00
Andres Freund 1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Andres Freund af4472bcb8 Change various Gin*Is* macros to return 0/1.
Returning the direct result of bit arithmetic, in a macro intended to be
used in a boolean manner, can be problematic if the return value is
stored in a variable of type 'bool'. If bool is implemented using C99's
_Bool, that can lead to comparison failures if the variable is then
compared again with the expression (see ginStepRight() for an example
that fails), as _Bool forces the result to be 0/1. That happens in some
configurations of newer MSVC compilers.  It's also problematic when
storing the result of such an expression in a narrower type.

Several gin macros have been declared in that style since gin's initial
commit in 8a3631f8d8.

There's a lot more macros like this, but this is the only one causing
regression test failures; and I don't want to commit and backpatch a
larger patch with lots of conflicts just before the next set of minor
releases.

Discussion: 20150811154237.GD17575@awork2.anarazel.de
Backpatch: All supported branches
2016-03-27 17:46:48 +02:00
Tom Lane c94959d411 Fix DROP OPERATOR to reset oprcom/oprnegate links to the dropped operator.
This avoids leaving dangling links in pg_operator; which while fairly
harmless are also unsightly.

While we're at it, simplify OperatorUpd, which went through
heap_modify_tuple for no very good reason considering it had already made
a tuple copy it could just scribble on.

Roma Sokolov, reviewed by Tomas Vondra, additional hacking by Robert Haas
and myself.
2016-03-25 12:33:16 -04:00
Tom Lane c1156411ad Move psql's psqlscan.l into src/fe_utils.
This completes (at least for now) the project of getting rid of ad-hoc
linkages among the src/bin/ subdirectories.  Everything they share is now
in src/fe_utils/ and is included from a static library at link time.

A side benefit is that we can restore the FLEX_NO_BACKUP check for
psqlscanslash.l.  We might need to think of another way to do that check
if we ever need to build two lexers with that property in the same source
directory, but there's no foreseeable reason to need that.
2016-03-24 20:28:47 -04:00
Tom Lane d65bea26a8 Move psql's print.c and mbprint.c into src/fe_utils.
Just turning the crank ...
2016-03-24 18:27:28 -04:00
Tom Lane 588d963b00 Create src/fe_utils/, and move stuff into there from pg_dump's dumputils.
Per discussion, we want to create a static library and put the stuff into
it that until now has been shared across src/bin/ directories by ad-hoc
methods like symlinking a source file.  This commit creates the library and
populates it with a couple of files that contain the widely-useful portions
of pg_dump's dumputils.c file.  dumputils.c survives, because it has some
stuff that didn't seem appropriate for fe_utils, but it's significantly
smaller and is no longer referenced from any other directory.

Follow-on patches will move more stuff into fe_utils.

The Mkvcbuild.pm hacking here is just a best guess; we'll see how the
buildfarm likes it.
2016-03-24 15:55:57 -04:00
Tom Lane c2d1eea9e7 Avoid PGDLLIMPORT for simple local references in frontend programs.
I was wondering if this would be an issue, and buildfarm member frogmouth
says it is.
2016-03-23 23:26:44 -04:00
Alvaro Herrera 473b932870 Support CREATE ACCESS METHOD
This enables external code to create access methods.  This is useful so
that extensions can add their own access methods which can be formally
tracked for dependencies, so that DROP operates correctly.  Also, having
explicit support makes pg_dump work correctly.

Currently only index AMs are supported, but we expect different types to
be added in the future.

Authors: Alexander Korotkov, Petr Jelínek
Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby
Commitfest-URL: https://commitfest.postgresql.org/9/353/
Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com
2016-03-23 23:01:35 -03:00
Tom Lane 2c6af4f442 Move keywords.c/kwlookup.c into src/common/.
Now that we have src/common/ for code shared between frontend and backend,
we can get rid of (most of) the klugy ways that the keyword table and
keyword lookup code were formerly shared between different uses.
This is a first step towards a more general plan of getting rid of
special-purpose kluges for sharing code in src/bin/.

I chose to merge kwlookup.c back into keywords.c, as it once was, and
always has been so far as keywords.h is concerned.  We could have
kept them separate, but there is noplace that uses ScanKeywordLookup
without also wanting access to the backend's keyword list, so there
seems little point.

ecpg is still a bit weird, but at least now the trickiness is documented.

I think that the MSVC build script should require no adjustments beyond
what's done here ... but we'll soon find out.
2016-03-23 20:22:08 -04:00
Robert Haas e06a38965b Support parallel aggregation.
Parallel workers can now partially aggregate the data and pass the
transition values back to the leader, which can combine the partial
results to produce the final answer.

David Rowley, based on earlier work by Haribabu Kommi.  Reviewed by
Álvaro Herrera, Tomas Vondra, Amit Kapila, James Sewell, and me.
2016-03-21 09:30:18 -04:00
Andres Freund 7fa0064092 Properly declare FeBeWaitSet.
Surprising that this worked on a number of systems. Reported by
buildfarm member longfin.
2016-03-21 12:58:18 +01:00
Andres Freund 98a64d0bd7 Introduce WaitEventSet API.
Commit ac1d794 ("Make idle backends exit if the postmaster dies.")
introduced a regression on, at least, large linux systems. Constantly
adding the same postmaster_alive_fds to the OSs internal datastructures
for implementing poll/select can cause significant contention; leading
to a performance regression of nearly 3x in one example.

This can be avoided by using e.g. linux' epoll, which avoids having to
add/remove file descriptors to the wait datastructures at a high rate.
Unfortunately the current latch interface makes it hard to allocate any
persistent per-backend resources.

Replace, with a backward compatibility layer, WaitLatchOrSocket with a
new WaitEventSet API. Users can allocate such a Set across multiple
calls, and add more than one file-descriptor to wait on. The latter has
been added because there's upcoming postgres features where that will be
helpful.

In addition to the previously existing poll(2), select(2),
WaitForMultipleObjects() implementations also provide an epoll_wait(2)
based implementation to address the aforementioned performance
problem. Epoll is only available on linux, but that is the most likely
OS for machines large enough (four sockets) to reproduce the problem.

To actually address the aforementioned regression, create and use a
long-lived WaitEventSet for FE/BE communication.  There are additional
places that would benefit from a long-lived set, but that's a task for
another day.

Thanks to Amit Kapila, who helped make the windows code I blindly wrote
actually work.

Reported-By: Dmitry Vasilyev Discussion:
CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com
20160114143931.GG10941@awork2.anarazel.de
2016-03-21 12:22:54 +01:00
Andres Freund 72e2d21c12 Combine win32 and unix latch implementations.
Previously latches for windows and unix had been implemented in
different files. A later patch introduce an expanded wait
infrastructure, keeping the implementation separate would introduce too
much duplication.

This basically just moves the functions, without too much change. The
reason to keep this separate is that it allows blame to continue working
a little less badly; and to make review a tiny bit easier.

Discussion: 20160114143931.GG10941@awork2.anarazel.de
2016-03-21 11:03:26 +01:00
Peter Eisentraut b555ed8102 Merge wal_level "archive" and "hot_standby" into new name "replica"
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
2016-03-18 23:56:03 +01:00
Andres Freund fad0f9d8c9 Remove unused, and dangerous, TestLatch() macro.
The macro has not seen any in-tree use since latches had been introduced
in 2746e5f, in 2010.
2016-03-18 11:46:42 -07:00
Robert Haas 0bf3ae88af Directly modify foreign tables.
postgres_fdw can now sent an UPDATE or DELETE statement directly to
the foreign server in simple cases, rather than sending a SELECT FOR
UPDATE statement and then updating or deleting rows one-by-one.

Etsuro Fujita, reviewed by Rushabh Lathia, Shigeru Hanada, Kyotaro
Horiguchi, Albe Laurenz, Thom Brown, and me.
2016-03-18 13:55:52 -04:00
Teodor Sigaev 3187d6de0e Introduce parse_ident()
SQL-layer function to split qualified identifier into array parts.

Author: Pavel Stehule with minor editorization by me and Jim Nasby
2016-03-18 18:16:14 +03:00
Teodor Sigaev f4ceed6ceb Improve support of Hunspell
- allow to use non-ascii characters as affix flag. Non-numeric affix flags now
  are stored as string instead of numeric value of character.
- allow to use 0 as affix flag in numeric encoded affixes

That adds support for arabian, hungarian, turkish and
brazilian portuguese languages.

Author: Artur Zakirov with heavy editorization by me
2016-03-17 17:23:38 +03:00
Peter Eisentraut fc201dfd95 Add syslog_split_messages parameter
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2016-03-16 23:21:44 -04:00
Peter Eisentraut f4c454e9ba Add syslog_sequence_numbers parameter
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2016-03-16 23:21:44 -04:00
Tom Lane a70e13a39e Be more careful about out-of-range dates and timestamps.
Tighten the semantics of boundary-case timestamptz so that we allow
timestamps >= '4714-11-24 00:00+00 BC' and < 'ENDYEAR-01-01 00:00+00 AD'
exactly, no more and no less, but it is allowed to enter timestamps
within that range using non-GMT timezone offsets (which could make the
nominal date 4714-11-23 BC or ENDYEAR-01-01 AD).  This eliminates
dump/reload failure conditions for timestamps near the endpoints.
To do this, separate checking of the inputs for date2j() from the
final range check, and allow the Julian date code to handle a range
slightly wider than the nominal range of the datatypes.

Also add a bunch of checks to detect out-of-range dates and timestamps
that formerly could be returned by operations such as date-plus-integer.
All C-level functions that return date, timestamp, or timestamptz should
now be proof against returning a value that doesn't pass IS_VALID_DATE()
or IS_VALID_TIMESTAMP().

Vitaly Burovoy, reviewed by Anastasia Lubennikova, and substantially
whacked around by me
2016-03-16 19:09:28 -04:00
Robert Haas c6dda1f48e Add idle_in_transaction_session_timeout.
Vik Fearing, reviewed by Stéphane Schildknecht and me, and revised
slightly by me.
2016-03-16 11:30:45 -04:00
Robert Haas 3aff33aa68 Fix typos.
Oskari Saarenmaa
2016-03-15 18:06:11 -04:00
Robert Haas c16dc1aca5 Add simple VACUUM progress reporting.
There's a lot more that could be done here yet - in particular, this
reports only very coarse-grained information about the index vacuuming
phase - but even as it stands, the new pg_stat_progress_vacuum can
tell you quite a bit about what a long-running vacuum is actually
doing.

Amit Langote and Robert Haas, based on earlier work by Vinayak Pokale
and Rahila Syed.
2016-03-15 13:32:56 -04:00
Tom Lane 0e9b89986b Cope if platform declares mbstowcs_l(), but not locale_t, in <xlocale.h>.
Previously, we included <xlocale.h> only if necessary to get the definition
of type locale_t.  According to notes in PGAC_TYPE_LOCALE_T, this is
important because on some versions of glibc that file supplies an
incompatible declaration of locale_t.  (This info may be obsolete, because
on my RHEL6 box that seems to be the *only* definition of locale_t; but
there may still be glibc's in the wild for which it's a live concern.)

It turns out though that on FreeBSD and maybe other BSDen, you can get
locale_t from stdlib.h or locale.h but mbstowcs_l() and friends only from
<xlocale.h>.  This was leaving us compiling calls to mbstowcs_l() and
friends with no visible prototype, which causes a warning and could
possibly cause actual trouble, since it's not declared to return int.

Hence, adjust the configure checks so that we'll include <xlocale.h>
either if it's necessary to get type locale_t or if it's necessary to
get a declaration of mbstowcs_l().

Report and patch by Aleksander Alekseev, somewhat whacked around by me.
Back-patch to all supported branches, since we have been using
mbstowcs_l() since 9.1.
2016-03-15 13:19:57 -04:00
Tom Lane 101fd9349e Add a GetForeignUpperPaths callback function for FDWs.
This is basically like the just-added create_upper_paths_hook, but
control is funneled only to the FDW responsible for all the baserels
of the current query; so providing such a callback is much less likely
to add useless overhead than using the hook function is.

The documentation is a bit sketchy.  We'll likely want to improve it,
and/or adjust the call conventions, when we get some experience with
actually using this callback.  Hopefully somebody will find time to
experiment with it before 9.6 feature freeze.
2016-03-14 20:04:48 -04:00
Peter Eisentraut be6de4c121 Add missing include for self-containment 2016-03-14 19:56:33 -04:00
Tom Lane 5864d6a4b6 Provide a planner hook at a suitable place for creating upper-rel Paths.
In the initial revision of the upper-planner pathification work, the only
available way for an FDW or custom-scan provider to inject Paths
representing post-scan-join processing was to insert them during scan-level
GetForeignPaths or similar processing.  While that's not impossible, it'd
require quite a lot of duplicative processing to look forward and see if
the extension would be capable of implementing the whole query.  To improve
matters for custom-scan providers, provide a hook function at the point
where the core code is about to start filling in upperrel Paths.  At this
point Paths are available for the whole scan/join tree, which should reduce
the amount of redundant effort considerably.

(An alternative design that was suggested was to provide a separate hook
for each post-scan-join processing step, but that seems messy and not
clearly more useful.)

Following our time-honored tradition, there's no documentation for this
hook outside the source code.

As-is, this hook is only meant for custom scan providers, which we can't
assume very much about.  A followon patch will implement an FDW callback
to let FDWs do the same thing in a somewhat more structured fashion.
2016-03-14 19:23:29 -04:00
Tom Lane 28048cbaa2 Allow callers of create_foreignscan_path to specify nondefault PathTarget.
Although the default choice of rel->reltarget should typically be
sufficient for scan or join paths, it's not at all sufficient for the
purposes PathTargets were invented for; in particular not for
upper-relation Paths.  So break API compatibility by adding a PathTarget
argument to create_foreignscan_path().  To ease updating of existing
code, accept a NULL value of the argument as selecting rel->reltarget.
2016-03-14 17:31:28 -04:00
Tom Lane 307c78852f Rethink representation of PathTargets.
In commit 19a541143a I did not make PathTarget a subtype of Node,
and embedded a RelOptInfo's reltarget directly into it rather than having
a separately-allocated Node.  In hindsight that was misguided
micro-optimization, enabled by the fact that at that point we didn't have
any Paths with custom PathTargets.  Now that PathTarget processing has
been fleshed out some more, it's easier to see that it's better to have
PathTarget as an indepedent Node type, even if it does cost us one more
palloc to create a RelOptInfo.  So change it while we still can.

This commit just changes the representation, without doing anything more
interesting than that.
2016-03-14 16:59:59 -04:00
Robert Haas 6be84eeb8d Update more comments for 96198d94cb.
Etsuro Fujita, reviewed (though not completely endorsed) by Ashutosh
Bapat, and slightly expanded by me.
2016-03-14 14:29:12 -04:00
Tom Lane 2da7549987 pg_stat_get_progress_info() should be marked STRICT.
I didn't bother with a catversion bump.

Report and patch by Thomas Munro
2016-03-14 12:51:55 -04:00
Tom Lane 23a27b039d Widen query numbers-of-tuples-processed counters to uint64.
This patch widens SPI_processed, EState's es_processed field, PortalData's
portalPos field, FuncCallContext's call_cntr and max_calls fields,
ExecutorRun's count argument, PortalRunFetch's result, and the max number
of rows in a SPITupleTable to uint64, and deals with (I hope) all the
ensuing fallout.  Some of these values were declared uint32 before, and
others "long".

I also removed PortalData's posOverflow field, since that logic seems
pretty useless given that portalPos is now always 64 bits.

The user-visible results are that command tags for SELECT etc will
correctly report tuple counts larger than 4G, as will plpgsql's GET
GET DIAGNOSTICS ... ROW_COUNT command.  Queries processing more tuples
than that are still not exactly the norm, but they're becoming more
common.

Most values associated with FETCH/MOVE distances, such as PortalRun's count
argument and the count argument of most SPI functions that have one, remain
declared as "long".  It's not clear whether it would be worth promoting
those to int64; but it would definitely be a large dollop of additional
API churn on top of this, and it would only help 32-bit platforms which
seem relatively less likely to see any benefit.

Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me
2016-03-12 16:05:29 -05:00
Tom Lane 570be1f73f Re-export a few of createplan.c's make_xxx() functions.
CitusDB is using these and don't wish to redesign their code right now.
I am not on board with this being a good idea, or a good precedent,
but I lack the energy to fight about it.
2016-03-12 12:12:59 -05:00
Teodor Sigaev a9eb6c83ef Bump catalog version missed in 6943a946c7 2016-03-11 19:31:04 +03:00
Teodor Sigaev 6943a946c7 Tsvector editing functions
Adds several tsvector editting function: convert tsvector to/from text array,
set weight for given lexemes, delete lexeme(s), unnest, filter lexemes
with given weights

Author: Stas Kelvich with some editorization by me
Reviewers: Tomas Vondram, Teodor Sigaev
2016-03-11 19:22:36 +03:00
Tom Lane 49635d7b3e Minor additional refactoring of planner.c's PathTarget handling.
Teach make_group_input_target() and make_window_input_target() to work
entirely with the PathTarget representation of tlists, rather than
constructing a tlist and immediately deconstructing it into PathTarget
format.  In itself this only saves a few palloc's; the bigger picture is
that it opens the door for sharing cost_qual_eval work across all of
planner.c's constructions of PathTargets.  I'll come back to that later.

In support of this, flesh out tlist.c's infrastructure for PathTargets
a bit more.
2016-03-11 10:24:55 -05:00
Simon Riggs 73e7e49da3 Allow emit_log_hook to see original message text
emit_log_hook could only see the translated text, making it harder to identify
which message was being sent. Pass original text to allow the exact message to
be identified, whichever language is used for logging.

Discussion: 20160216.184755.59721141.horiguchi.kyotaro@lab.ntt.co.jp
Author: Kyotaro Horiguchi
2016-03-11 09:53:06 +00:00
Andres Freund 9cd00c457e Checkpoint sorting and balancing.
Up to now checkpoints were written in the order they're in the
BufferDescriptors. That's nearly random in a lot of cases, which
performs badly on rotating media, but even on SSDs it causes slowdowns.

To avoid that, sort checkpoints before writing them out. We currently
sort by tablespace, relfilenode, fork and block number.

One of the major reasons that previously wasn't done, was fear of
imbalance between tablespaces. To address that balance writes between
tablespaces.

The other prime concern was that the relatively large allocation to sort
the buffers in might fail, preventing checkpoints from happening. Thus
pre-allocate the required memory in shared memory, at server startup.

This particularly makes it more efficient to have checkpoint flushing
enabled, because that'll often result in a lot of writes that can be
coalesced into one flush.

Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-03-10 17:05:09 -08:00
Andres Freund 428b1d6b29 Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches.  When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively.  This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.

On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.

Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache.  Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.

While desirable and likely possible this patch does not contain an
implementation for windows.

With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.

A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences.  This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.

Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-03-10 17:04:34 -08:00
Tom Lane c82c92b111 Give pull_var_clause() reject/recurse/return behavior for WindowFuncs too.
All along, this function should have treated WindowFuncs in a manner
similar to Aggrefs, ie with an option whether or not to recurse into them.
By not considering the case, it was always recursing, which is OK for most
callers (although I suspect that the case in prepare_sort_from_pathkeys
might represent a bug).  But now we need return-without-recursing behavior
as well.  There are also more than a few callers that should never see a
WindowFunc, and now we'll get some error checking on that.
2016-03-10 16:23:52 -05:00
Tom Lane 364a9f47ab Refactor pull_var_clause's API to make it less tedious to extend.
In commit 1d97c19a0f and later c1d9579dd8, we extended
pull_var_clause's API by adding enum-type arguments.  That's sort of a pain
to maintain, though, because it means every time we add a new behavior we
must touch every last one of the call sites, even if there's a reasonable
default behavior that most of them could use.  Let's switch over to using a
bitmask of flags, instead; that seems more maintainable and might save a
nanosecond or two as well.  This commit changes no behavior in itself,
though I'm going to follow it up with one that does add a new behavior.

In passing, remove flatten_tlist(), which has not been used since 9.1
and would otherwise need the same API changes.

Removing these enums means that optimizer/tlist.h no longer needs to
depend on optimizer/var.h.  Changing that caused a number of C files to
need addition of #include "optimizer/var.h" (probably we can thank old
runs of pgrminclude for that); but on balance it seems like a good change
anyway.
2016-03-10 15:53:07 -05:00
Simon Riggs 37c54863cf Rework wait for AccessExclusiveLocks on Hot Standby
Earlier version committed in 9.0 caused spurious waits in some cases.
New infrastructure for lock waits in 9.3 used to correct and improve this.

Jeff Janes based upon a proposal by Simon Riggs, who also reviewed
Additional review comments from Amit Kapila
2016-03-10 19:26:24 +00:00
Robert Haas 53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Andres Freund 606e0f9841 Introduce durable_rename() and durable_link_or_rename().
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory.  This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.

Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.

This commit does not yet make use of the new functions; they're used in
a followup commit.

Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Robert Haas b6fb6471f6 Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.

As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else.  A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting.  But this gets the basic framework in place.

Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 12:08:58 -05:00
Tom Lane 51c0f63e4d Improve handling of pathtargets in planner.c.
Refactor so that the internal APIs in planner.c deal in PathTargets not
targetlists, and establish a more regular structure for deriving the
targets needed for successive steps.

There is more that could be done here; calculating the eval costs of each
successive target independently is both inefficient and wrong in detail,
since we won't actually recompute values available from the input node's
tlist.  But it's no worse than what happened before the pathification
rewrite.  In any case this seems like a good starting point for considering
how to handle Konstantin Knizhnik's function-evaluation-postponement patch.
2016-03-09 01:12:16 -05:00
Tom Lane 9e8b99420f Improve handling of group-column indexes in GroupingSetsPath.
Instead of having planner.c compute a groupColIdx array and store it in
GroupingSetsPaths, make create_groupingsets_plan() find the grouping
columns by searching in the child plan node's tlist.  Although that's
probably a bit slower for create_groupingsets_plan(), it's more like
the way every other plan node type does this, and it provides positive
confirmation that we know which child output columns we're supposed to be
grouping on.  (Indeed, looking at this now, I'm not at all sure that it
wasn't broken before, because create_groupingsets_plan() isn't demanding
an exact tlist match from its child node.)  Also, this allows substantial
simplification in planner.c, because it no longer needs to compute the
groupColIdx array at all; no other cases were using it.

I'd intended to put off this refactoring until later (like 9.7), but
in view of the likely bug fix and the need to rationalize planner.c's
tlist handling so we can do something sane with Konstantin Knizhnik's
function-evaluation-postponement patch, I think it can't wait.
2016-03-08 22:32:11 -05:00
Tom Lane 8c314b9853 Finish refactoring make_foo() functions in createplan.c.
This patch removes some redundant cost calculations that I left for later
cleanup in commit 3fc6e2d7f5.  There's now a uniform policy that the
make_foo() convenience functions don't do any cost calculations.  Most of
their callers copy costs from the source Path node, and for those that
don't, the calculation in the make_foo() function wasn't necessarily right
anyhow.  (make_result() was particularly a mess, as it was serving multiple
callers using cost calcs designed for only the first one or two that had
ever existed.)  Aside from saving a few cycles, this ensures that what
EXPLAIN prints matches the costs we used for planning purposes.  It does
not change any planner decisions, since the decisions are already made.
2016-03-08 16:28:34 -05:00
Robert Haas 070140ee48 Add some functions to fd.c for the convenience of extensions.
For example, if you want to perform an ioctl() on a file descriptor
opened through the fd.c routines, there's no way to do that without
being able to get at the underlying fd.

KaiGai Kohei
2016-03-08 10:09:50 -05:00
Robert Haas 77a1d1e798 Department of second thoughts: remove PD_ALL_FROZEN.
Commit a892234f83 added a second bit per
page to the visibility map, which still seems like a good idea, but it
also added a second page-level bit alongside PD_ALL_VISIBLE to track
whether the visibility map bit was set.  That no longer seems like a
clever plan, because we don't really need that bit for anything.  We
always clear both bits when the page is modified anyway.

Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.
2016-03-08 08:46:48 -05:00
Tom Lane 3fc6e2d7f5 Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is.  This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps.  Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step.  We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.

In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan.  It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation.  (A couple of regression test outputs change in consequence of
that.  However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)

There is a great deal left to do here.  This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations.  (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.)  I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.

Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 15:58:22 -05:00
Andres Freund c8f621c43a logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds.  Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.

Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.

This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size.  As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.

Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-05 18:02:20 -08:00
Andres Freund d9e903f3cb logical decoding: Tell reorderbuffer about all xids.
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids.  That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.

This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".

It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.

Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
2016-03-05 18:02:20 -08:00
Joe Conway dc7d70ea05 Expose control file data via SQL accessible functions.
Add four new SQL accessible functions: pg_control_system(),
pg_control_checkpoint(), pg_control_recovery(), and pg_control_init()
which expose a subset of the control file data.

Along the way move the code to read and validate the control file to
src/common, where it can be shared by the new backend functions
and the original pg_controldata frontend program.

Patch by me, significant input, testing, and review by Michael Paquier.
2016-03-05 11:10:19 -08:00
Teodor Sigaev d78a7d9c7f Improve support of Hunspell in ispell dictionary.
Now it's possible to load recent version of Hunspell for several languages.
To handle these dictionaries Hunspell patch adds support for:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65535)
* AF parameter - alias for flag's set

Also it moves test dictionaries into separate directory.

Author: Artur Zakirov with editorization by me
2016-03-04 20:08:47 +03:00
Simon Riggs c7111d11b1 Revert buggy optimization of index scans
606c0123d6 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.

Revert whole patch until we find a safe optimization.
2016-03-03 09:53:43 +00:00
Tom Lane eb43e851d6 Create stub functions to support pg_upgrade of old contrib/tsearch2.
Commits 9ff60273e3 and dbe2328959 adjusted the declarations
of some core functions referenced by contrib/tsearch2's install script,
forgetting that in a pg_upgrade situation, we'll be trying to restore
operator class definitions that reference the old signatures.  We've
hit this problem before; solve it in the same way as before, namely by
installing stub functions that have the expected signature and just
invoke the correct function.  Per report from Jeff Janes.

(Someday we ought to stop supporting contrib/tsearch2, but I'm not
sure today is that day.)
2016-03-02 17:37:54 -05:00
Robert Haas a892234f83 Change the format of the VM fork to add a second bit per page.
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.

A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations.  A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be.  This commit does not attempt
that optimization, although that optimization is the goal here.  It
seems better to get the basic infrastructure in place first.

Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format.  That,
too, will be handled in a follow-on patch.

Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
2016-03-01 21:49:41 -05:00
Robert Haas 35746bc348 Add new FDW API to test for parallel-safety.
This is basically a bug fix; the old code assumes that a ForeignScan
is always parallel-safe, but for postgres_fdw, for example, this is
definitely false.  It should be true for file_fdw, though, since a
worker can read a file from the filesystem just as well as any other
backend process.

Original patch by Thomas Munro.  Documentation, and changes to the
comments, by me.
2016-02-26 16:14:46 +05:30
Alvaro Herrera 343f709c06 Fix typos
Backpatch to: 9.4
2016-02-25 20:50:20 -03:00
Tom Lane 52f5d578d6 Create a function to reliably identify which sessions block which others.
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether.  (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)

The new function has the following behaviors that are painful or impossible
to get right via pg_locks:

1. Correctly understands which lock modes block which other ones.

2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.

3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.

The motivation for doing this right now is mostly to fix the isolation
tests.  Commit 38f8bdcac4 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly.  But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions.  Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds.  That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
2016-02-22 14:31:43 -05:00
Tom Lane 73bf8715aa Remove redundant PGPROC.lockGroupLeaderIdentifier field.
We don't really need this field, because it's either zero or redundant with
PGPROC.pid.  The use of zero to mark "not a group leader" is not necessary
since we can just as well test whether lockGroupLeader is NULL.  This does
not save very much, either as to code or data, but the simplification seems
worthwhile anyway.
2016-02-22 11:20:35 -05:00
Tom Lane c7a1c5a6b6 Cosmetic improvements in new config_info code.
Coverity griped about use of unchecked strcpy() into a local variable.
There's unlikely to be any actual bug there, since no caller would be
passing a path longer than MAXPGPATH, but nonetheless use of strlcpy()
seems preferable.

While at it, get rid of unmaintainable separation between list of
field names and list of field values in favor of initializing them
in parallel.  And we might as well declare get_configdata()'s path
argument as const char *, even though no current caller needs that.
2016-02-21 11:38:24 -05:00
Robert Haas d91a4a6c85 Cosmetic improvements to group locking.
Reflow text in lock manager README so that it fits within 80 columns.
Correct some mistakes.  Expand the README to explain not only why group
locking exists but also the data structures that support it.  Improve
comments related to group locking several files.  Change the name of a
macro argument for improved clarity.

Most of these problems were reported by Tom Lane, but I found a few
of them myself.

Robert Haas and Tom Lane
2016-02-21 15:42:02 +05:30
Dean Rasheed 53874c5228 Add pg_size_bytes() to parse human-readable size strings.
This will parse strings in the format produced by pg_size_pretty() and
return sizes in bytes. This allows queries to be written with clauses
like "pg_total_relation_size(oid) > pg_size_bytes('10 GB')".

Author: Pavel Stehule with various improvements by Vitaly Burovoy
Discussion: http://www.postgresql.org/message-id/CAFj8pRD-tGoDKnxdYgECzA4On01_uRqPrwF-8LdkSE-6bDHp0w@mail.gmail.com
Reviewed-by: Vitaly Burovoy, Oleksandr Shulgin, Kyotaro Horiguchi,
    Michael Paquier and Robert Haas
2016-02-20 09:57:27 +00:00
Noah Misch 5882ca6686 Call xlc __isync() after, not before, associated compare-and-swap.
Architecture reference material specifies this order, and s_lock.h
inline assembly agrees.  The former order failed to provide mutual
exclusion to lwlock.c and perhaps to other clients.  The two xlc
buildfarm members, hornet and mandrill, have failed sixteen times with
duplicate key errors involving pg_class_oid_index or pg_type_oid_index.
Back-patch to 9.5, where commit b64d92f1a5
introduced atomics.

Reviewed by Andres Freund and Tom Lane.
2016-02-19 22:47:50 -05:00
Tom Lane 19a541143a Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist).  However, there are good
reasons to remove that assumption.  For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free".  While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node).  Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next.  Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit.  It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.

While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist.  To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)

I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation.  Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't.  Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths.  This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more.  This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-18 20:02:03 -05:00
Joe Conway a5c43b8869 Add new system view, pg_config
Move and refactor the underlying code for the pg_config client
application to src/common in support of sharing it with a new
system information SRF called pg_config() which makes the same
information available via SQL. Additionally wrap the SRF with a
new system view, as called pg_config.

Patch by me with extensive input and review by Michael Paquier
and additional review by Alvaro Herrera.
2016-02-17 09:12:06 -08:00
Robert Haas f1f5ec1efa Reuse abbreviated keys in ordered [set] aggregates.
When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.

Peter Geoghegan, reviewd by Andreas Karlsson and by me.
2016-02-17 15:40:00 +05:30
Joe Conway 851636bfda Move DATA entry to correct position
In commit 7b4bfc87 the DATA and DESCR entries for the new
row_security_active() function were inadvertantly put after
the PROVOLATILE defines, rather than before as they should
have been placed. Move them up where they belong.

Backpatch to 9.5 where the new entries were introduced.
2016-02-15 16:38:47 -08:00
Andres Freund 7975c5e0a9 Allow the WAL writer to flush WAL at a reduced rate.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.

This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.

To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.

Discussion: 20160118163908.GW10941@awork2.anarazel.de
2016-02-16 00:56:34 +01:00
Tom Lane 8c95ae81fa Suppress compiler warnings about useless comparison of unsigned to zero.
Reportedly, some compilers warn about tests like "c < 0" if c is unsigned,
and hence complain about the character range checks I added in commit
3bb3f42f37.  This is a bit of a pain since
the regex library doesn't really want to assume that chr is unsigned.
However, since any such reconfiguration would involve manual edits of
regcustom.h anyway, we can put it on the shoulders of whoever wants to
do that to adjust this new range-checking macro correctly.

Per gripes from Coverity and Andres.
2016-02-15 17:12:16 -05:00
Joe Conway cfafd8bead Correct Copyright year from 2015 to 2016
Looks like this patch went in after Copyright messages
were updated for 2016 and it missed the boat. Fixed.
2016-02-15 13:19:35 -08:00
Noah Misch 9449c4b1ec Replace broken link in comment. 2016-02-15 02:35:52 -05:00
Robert Haas bcac23de73 Introduce extensible node types.
An extensible node is always tagged T_Extensible, but the extnodename
field identifies it more specifically; it may also include arbitrary
private data.  Extensible nodes can be copied, tested for equality,
serialized, and deserialized, but the core system doesn't know
anything about them otherwise.  Some extensions may find it useful to
include these nodes in fdw_private or custom_private lists in lieu of
arm-wrestling their data into a format that the core code can
understand.

Along the way, so as not to burden the authors of such extensible
node types too much, expose the functions for writing serialized
tokens, and for serializing and deserializing bitmapsets.

KaiGai Kohei, per a design suggested by me.  Reviewed by Andres Freund
and by me, and further edited by me.
2016-02-12 09:38:11 -05:00
Tom Lane d4c3a156cb Remove GROUP BY columns that are functionally dependent on other columns.
If a GROUP BY clause includes all columns of a non-deferred primary key,
as well as other columns of the same relation, those other columns are
redundant and can be dropped from the grouping; the pkey is enough to
ensure that each row of the table corresponds to a separate group.
Getting rid of the excess columns will reduce the cost of the sorting or
hashing needed to implement GROUP BY, and can indeed remove the need for
a sort step altogether.

This seems worth testing for since many query authors are not aware of
the GROUP-BY-primary-key exception to the rule about queries not being
allowed to reference non-grouped-by columns in their targetlists or
HAVING clauses.  Thus, redundant GROUP BY items are not uncommon.  Also,
we can make the test pretty cheap in most queries where it won't help
by not looking up a rel's primary key until we've found that at least
two of its columns are in GROUP BY.

David Rowley, reviewed by Julien Rouhaud
2016-02-11 17:34:59 -05:00
Tom Lane 72eee410d4 Move pg_constraint.h function declarations to new file pg_constraint_fn.h.
A pending patch requires exporting a function returning Bitmapset from
catalog/pg_constraint.c.  As things stand, that would mean including
nodes/bitmapset.h in pg_constraint.h, which might be hazardous for the
client-side includability of that header.  It's not entirely clear whether
any client-side code needs to include pg_constraint.h, but it seems prudent
to assume that there is some such code somewhere.  Therefore, split off the
function definitions into a new file pg_constraint_fn.h, similarly to what
we've done for some other catalog header files.
2016-02-11 15:51:28 -05:00
Robert Haas c319991bca Use separate lwlock tranches for buffer, lock, and predicate lock managers.
This finishes the work - spread across many commits over the last
several months - of putting each type of lock other than the named
individual locks into a separate tranche.

Amit Kapila
2016-02-11 14:07:33 -05:00
Robert Haas a455878d99 Rename PGPROC fields related to group XID clearing again.
Commit 0e141c0fbb introduced a new
facility to reduce ProcArrayLock contention by clearing several XIDs
from the ProcArray under a single lock acquisition.  The names
initially chosen were deemed not to be very good choices, so commit
4aec49899e renamed them.  But now it
seems like we still didn't get it right.  A pending patch wants to
add similar infrastructure for batching CLOG updates, so the names
need to be clear enough to allow a new set of structure members with
a related purpose.

Amit Kapila
2016-02-11 08:55:24 -05:00
Tom Lane c5e9b77127 Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier."
This reverts commit 3971f64843 and a
couple of followon debugging commits; I think we've learned what we can
from them.
2016-02-10 16:01:04 -05:00
Robert Haas 79a7ff0fe5 Code cleanup in the wake of recent LWLock refactoring.
As of commit c1772ad922, there's no
longer any way of requesting additional LWLocks in the main tranche,
so we don't need NumLWLocks() or LWLockAssign() any more.  Also,
some of the allocation counters that we had previously aren't needed
any more either.

Amit Kapila
2016-02-10 09:58:09 -05:00
Tom Lane 3971f64843 Temporarily make pg_ctl and server shutdown a whole lot chattier.
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints.  Maybe they
are just really overloaded, but maybe something else is going on.  Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.

No attempt has been made to make this pretty.  I'm not even totally sure
if it will build on Windows, but we'll soon find out.
2016-02-08 18:43:11 -05:00
Tom Lane 3bb3f42f37 Fix some regex issues with out-of-range characters and large char ranges.
Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a
bad choice because it is outside the range of type "celt" (int32).
Characters approaching that limit could lead to infinite loops in logic
such as "for (c = a; c <= b; c++)" where c is of type celt but the
range bounds are chr.  Such loops will work safely only if CHR_MAX+1
is representable in celt, since c must advance to beyond b before the
loop will exit.

Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe.
It's highly unlikely that Unicode will ever assign codes that high, and
none of our other backend encodings need characters beyond that either.

In addition to modifying the macro, we have to explicitly enforce character
range restrictions on the values of \u, \U, and \x escape sequences, else
the limit is trivially bypassed.

Also, the code for expanding case-independent character ranges in bracket
expressions had a potential integer overflow in its calculation of the
number of characters it could generate, which could lead to allocating too
small a character vector and then overwriting memory.  An attacker with the
ability to supply arbitrary regex patterns could easily cause transient DOS
via server crashes, and the possibility for privilege escalation has not
been ruled out.

Quite aside from the integer-overflow problem, the range expansion code was
unnecessarily inefficient in that it always produced a result consisting of
individual characters, abandoning the knowledge that we had a range to
start with.  If the input range is large, this requires excessive memory.
Change it so that the original range is reported as-is, and then we add on
any case-equivalent characters that are outside that range.  With this
approach, we can bound the number of individual characters allowed without
sacrificing much.  This patch allows at most 100000 individual characters,
which I believe to be more than the number of case pairs existing in
Unicode, so that the restriction will never be hit in practice.

It's still possible for range() to take awhile given a large character code
range, so also add statement-cancel detection to its loop.  The downstream
function dovec() also lacked cancel detection, and could take a long time
given a large output from range().

Per fuzz testing by Greg Stark.  Back-patch to all supported branches.

Security: CVE-2016-0773
2016-02-08 10:25:40 -05:00
Robert Haas d89f06f048 Fix parallel-safety markings for pg_upgrade functions.
These establish backend-local state which will not be copied to
parallel workers, so they must be marked parallel-restricted, not
parallel-safe.
2016-02-07 11:45:21 -05:00
Robert Haas 7c944bd903 Introduce a new GUC force_parallel_mode for testing purposes.
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe.  For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process.  When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.

Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.

Robert Haas, with help from Amit Kapila and Rushabh Lathia.
2016-02-07 11:41:33 -05:00
Robert Haas a1c1af2a1f Introduce group locking to prevent parallel processes from deadlocking.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes.  There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach.  Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.

Robert Haas, with help from Amit Kapila.  Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
2016-02-07 10:16:13 -05:00
Tom Lane aa2387e2fd Improve speed of timestamp/time/date output functions.
It seems that sprintf(), at least in glibc's version, is unreasonably slow
compared to hand-rolled code for printing integers.  Replacing most uses of
sprintf() in the datetime.c output functions with special-purpose code
turns out to give more than a 2X speedup in COPY of a table with a single
timestamp column; which is pretty impressive considering all the other
logic in that code path.

David Rowley and Andres Freund, reviewed by Peter Geoghegan and myself
2016-02-06 23:11:28 -05:00
Robert Haas 78bea62ab0 Fix typo.
Amit Kapila
2016-02-05 07:56:59 -05:00
Tom Lane 6819514fca Add num_nulls() and num_nonnulls() to count NULL arguments.
An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that
exactly one of a,b,c isn't NULL.  The functions are variadic, so they
can also be pressed into service to count the number of null or nonnull
elements in an array.

Marko Tiikkaja, reviewed by Pavel Stehule
2016-02-04 23:03:37 -05:00
Robert Haas a104a017fc Add some additional core functions to support join pushdown for FDWs.
GetExistingLocalJoinPath() is useful for handling EvalPlanQual rechecks
properly, and GetUserMappingById() is needed to make sure you're using
the right credentials.

Shigeru Hanada, Etsuro Fujita, Ashutosh Bapat, Robert Haas
2016-02-04 17:05:09 -05:00
Robert Haas c1772ad922 Change the way that LWLocks for extensions are allocated.
The previous RequestAddinLWLocks() method had several disadvantages.
First, the locks would be in the main tranche; we've recently decided
that it's useful for LWLocks used for separate purposes to have
separate tranche IDs.  Second, there wasn't any correlation between
what code called RequestAddinLWLocks() and what code called
LWLockAssign(); when multiple modules are in use, it could become
quite difficult to troubleshoot problems where LWLockAssign() ran out
of locks.  To fix, create a concept of named LWLock tranches which
can be used either by extension or by core code.

Amit Kapila and Robert Haas
2016-02-04 16:43:04 -05:00
Robert Haas b47b4dbf68 Extend sortsupport for text to more opclasses.
Have varlena.c expose an interface that allows the char(n), bytea, and
bpchar types to piggyback on a now-generalized SortSupport for text.
This pushes a little more knowledge of the bpchar/char(n) type into
varlena.c than might be preferred, but that seems like the approach
that creates least friction.  Also speed things up for index builds
that use text_pattern_ops or varchar_pattern_ops.

This patch does quite a bit of renaming, but it seems likely to be
worth it, so as to avoid future confusion about the fact that this code
is now more generally used than the old names might have suggested.

Peter Geoghegan, reviewed by Álvaro Herrera and Andreas Karlsson,
with small tweaks by me.
2016-02-03 14:29:53 -05:00
Robert Haas 69d34408e5 Allow parallel custom and foreign scans.
This patch doesn't put the new infrastructure to use anywhere, and
indeed it's not clear how it could ever be used for something like
postgres_fdw which has to send an SQL query and wait for a reply,
but there might be FDWs or custom scan providers that are CPU-bound,
so let's give them a way to join club parallel.

KaiGai Kohei, reviewed by me.
2016-02-03 12:49:46 -05:00
Robert Haas f2305d40ec Remove CustomPath's TextOutCustomPath method.
You can't really do anything useful with this in the form it currently
exists; among other problems, there's no way to reread whatever
information might be produced when the path is output.  Work is
underway to replace this with a more useful and more general system of
extensible nodes, but let's start by getting rid of this bit.

Extracted from a larger patch by KaiGai Kohei.
2016-02-03 10:38:50 -05:00
Peter Eisentraut 7d17e683fc Add support for systemd service notifications
Insert sd_notify() calls at server start and stop for integration with
systemd.  This allows the use of systemd service units of type "notify",
which greatly simplifies the systemd configuration.

Reviewed-by: Pavel Stěhule <pavel.stehule@gmail.com>
2016-02-02 21:04:29 -05:00
Tom Lane 2ad83fff22 Remove unnecessary "implementation of FOO operator" DESCR() entries.
Apparently at least one committer hasn't gotten the word that these do not
need to be maintained by hand, since initdb will create them automatically.
Noted while fixing bug #13905.

No catversion bump since the post-initdb state is exactly the same either
way.  I don't see a need for back-patch, either.
2016-02-02 11:52:27 -05:00
Tom Lane a4627e8fd4 Fix pg_description entries for jsonb_to_record() and jsonb_to_recordset().
All the other jsonb function descriptions refer to the arguments as being
"jsonb", but these two said "json".  Make it consistent.  Per bug #13905
from Petru Florin Mihancea.

No catversion bump --- we can't force one in the back branches, and this
isn't very critical anyway.
2016-02-02 11:39:50 -05:00
Robert Haas 7191ce8bea Make all built-in lwlock tranche IDs fixed.
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.

Alexander Korotkov and Amit Kapila
2016-02-02 06:45:55 -05:00
Robert Haas 2251179e6a Migrate replication slot I/O locks into a separate tranche.
This is following in a long train of similar changes and for the same
reasons - see b319356f0e and
fe702a7b3f inter alia.

Author: Amit Kapila
Reviewed-by: Alexander Korotkov, Robert Haas
2016-01-29 09:45:38 -05:00
Robert Haas b319356f0e Migrate PGPROC's backendLock into PGPROC itself, using a new tranche.
Previously, each PGPROC's backendLock was part of the main tranche,
and the PGPROC just contained a pointer.  Now, the actual LWLock is
part of the PGPROC.

As with previous, similar patches, this makes it significantly easier
to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output
and improves modularity.

Author: Ildus Kurbangaliev
Reviewed-by: Amit Kapila, Robert Haas
2016-01-29 08:14:28 -05:00
Robert Haas fbe5a3fb73 Only try to push down foreign joins if the user mapping OIDs match.
Previously, the foreign join pushdown infrastructure left the question
of security entirely up to individual FDWs, but it would be easy for
a foreign data wrapper to inadvertently open up subtle security holes
that way.  So, make it the core code's job to determine which user
mapping OID is relevant, and don't attempt join pushdown unless it's
the same for all relevant relations.

Per a suggestion from Tom Lane.  Shigeru Hanada and Ashutosh Bapat,
reviewed by Etsuro Fujita and KaiGai Kohei, with some further
changes by me.
2016-01-28 14:05:36 -05:00
Robert Haas 96198d94cb Avoid multiple foreign server connections when all use same user mapping.
Previously, postgres_fdw's connection cache was keyed by user OID and
server OID, but this can lead to multiple connections when it's not
really necessary.  In particular, if all relevant users are mapped to
the public user mapping, then their connection options are certainly
the same, so one connection can be used for all of them.

While we're cleaning things up here, drop the "server" argument to
GetConnection(), which isn't really needed.  This saves a few cycles
because callers no longer have to look this up; the function itself
does, but only when establishing a new connection, not when reusing
an existing one.

Ashutosh Bapat, with a few small changes by me.
2016-01-28 12:05:19 -05:00
Fujii Masao 7f46eaf035 Add gin_clean_pending_list function to clean up GIN pending list
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.

This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.

Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.

Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
2016-01-28 12:57:52 +09:00
Fujii Masao e09507a272 Fix volatility marking of pg_size_pretty function
pg_size_pretty function should be marked immutable rather than volatile
because it always returns the same result given the same argument.

Pavel Stehule
2016-01-27 11:13:31 +09:00
Tom Lane e1bd684a34 Add trigonometric functions that work in degrees.
The implementations go to some lengths to deliver exact results for values
where an exact result can be expected, such as sind(30) = 0.5 exactly.

Dean Rasheed, reviewed by Michael Paquier
2016-01-22 15:46:22 -05:00
Tom Lane a396144ac0 Remove new coupling between NAMEDATALEN and MAX_LEVENSHTEIN_STRLEN.
Commit e529cd4ffa introduced an Assert requiring NAMEDATALEN to be
less than MAX_LEVENSHTEIN_STRLEN, which has been 255 for a long time.
Since up to that instant we had always allowed NAMEDATALEN to be
substantially more than that, this was ill-advised.

It's debatable whether we need MAX_LEVENSHTEIN_STRLEN at all (versus
putting a CHECK_FOR_INTERRUPTS into the loop), or whether it has to be
so tight; but this patch takes the narrower approach of just not applying
the MAX_LEVENSHTEIN_STRLEN limit to calls from the parser.

Trusting the parser for this seems reasonable, first because the strings
are limited to NAMEDATALEN which is unlikely to be hugely more than 256,
and second because the maximum distance is tightly constrained by
MAX_FUZZY_DISTANCE (though we'd forgotten to make use of that limit in one
place).  That means the cost is not really O(mn) but more like O(max(m,n)).

Relaxing the limit for user-supplied calls is left for future research;
given the lack of complaints to date, it doesn't seem very high priority.

In passing, fix confusion between lengths-in-bytes and lengths-in-chars
in comments and error messages.

Per gripe from Kevin Day; solution suggested by Robert Haas.  Back-patch
to 9.5 where the unwanted restriction was introduced.
2016-01-22 11:53:06 -05:00
Tom Lane be44ed27b8 Improve index AMs' opclass validation procedures.
The amvalidate functions added in commit 65c5fcd353 were on the
crude side.  Improve them in a few ways:

* Perform signature checking for operators and support functions.

* Apply more thorough checks for missing operators and functions,
where possible.

* Instead of reporting problems as ERRORs, report most problems as INFO
messages and make the amvalidate function return FALSE.  This allows
more than one problem to be discovered per run.

* Report object names rather than OIDs, and work a bit harder on making
the messages understandable.

Also, remove a few more opr_sanity regression test queries that are
now superseded by the amvalidate checks.
2016-01-21 19:47:15 -05:00
Tom Lane b99551832e Add defenses against putting expanded objects into Const nodes.
Putting a reference to an expanded-format value into a Const node would be
a bad idea for a couple of reasons.  It'd be possible for the supposedly
immutable Const to change value, if something modified the referenced
variable ... in fact, if the Const's reference were R/W, any function that
has the Const as argument might itself change it at runtime.  Also, because
datumIsEqual() is pretty simplistic, the Const might fail to compare equal
to other Consts that it should compare equal to, notably including copies
of itself.  This could lead to unexpected planner behavior, such as "could
not find pathkey item to sort" errors or inferior plans.

I have not been able to find any way to get an expanded value into a Const
within the existing core code; but Paul Ramsey was able to trigger the
problem by writing a datatype input function that returns an expanded
value.

The best fix seems to be to establish a rule that varlena values being
placed into Const nodes should be passed through pg_detoast_datum().
That will do nothing (and cost little) in normal cases, but it will flatten
expanded values and thereby avoid the above problems.  Also, it will
convert short-header or compressed values into canonical format, which will
avoid possible unexpected lack-of-equality issues for those cases too.
And it provides a last-ditch defense against putting a toasted value into
a Const, which we already knew was dangerous, cf commit 2b0c86b665.
(In the light of this discussion, I'm no longer sure that that commit
provided 100% protection against such cases, but this fix should do it.)

The test added in commit 65c3d05e18 to catch datatype input functions
with unstable results would fail for functions that returned expanded
values; but it seems a bit uncharitable to deem a result unstable just
because it's expressed in expanded form, so revise the coding so that we
check for bitwise equality only after applying pg_detoast_datum().  That's
a sufficient condition anyway given the new rule about detoasting when
forming a Const.

Back-patch to 9.5 where the expanded-object facility was added.  It's
possible that this should go back further; but in the absence of clear
evidence that there's any live bug in older branches, I'll refrain for now.
2016-01-21 12:56:08 -05:00
Fujii Masao 38710a374e Remove unused argument from ginInsertCleanup()
It's an oversight in commit dc943ad.
2016-01-22 01:22:56 +09:00
Simon Riggs c80b31d557 Refactor headers to split out standby defs
Jeff Janes
2016-01-20 18:51:34 -08:00
Simon Riggs 978b2f65aa Speedup 2PC by skipping two phase state files in normal path
2PC state info is written only to WAL at PREPARE, then read back from WAL at
COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr
checkpoint cycle will be written to disk in the same form as previously. Crash
recovery path is not altered. Measured performance gains of 50-100% for short
2PC transactions by completely avoiding writing files and fsyncing. Other
optimizations still available, further patches in related areas expected.

Stas Kelvich and heavily edited by Simon Riggs

Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas,
a concrete example of how Postgres-XC has fed back ideas into PostgreSQL.

Reviewed by Michael Paquier, Jeff Janes and Andres Freund
Performance testing by Jesper Pedersen
2016-01-20 18:40:44 -08:00
Simon Riggs 422a55a687 Refactor to create generic WAL page read callback
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.

As requested by Michael Paquier, Alvaro Herrera and Andres Freund
2016-01-20 17:18:58 -08:00
Robert Haas 45be99f8cd Support parallel joins, and make related improvements.
The core innovation of this patch is the introduction of the concept
of a partial path; that is, a path which if executed in parallel will
generate a subset of the output rows in each process.  Gathering a
partial path produces an ordinary (complete) path.  This allows us to
generate paths for parallel joins by joining a partial path for one
side (which at the baserel level is currently always a Partial Seq
Scan) to an ordinary path on the other side.  This is subject to
various restrictions at present, especially that this strategy seems
unlikely to be sensible for merge joins, so only nested loops and
hash joins paths are generated.

This also allows an Append node to be pushed below a Gather node in
the case of a partitioned table.

Testing revealed that early versions of this patch made poor decisions
in some cases, which turned out to be caused by the fact that the
original cost model for Parallel Seq Scan wasn't very good.  So this
patch tries to make some modest improvements in that area.

There is much more to be done in the area of generating good parallel
plans in all cases, but this seems like a useful step forward.

Patch by me, reviewed by Dilip Kumar and Amit Kapila.
2016-01-20 14:40:26 -05:00
Robert Haas a7de3dc5c3 Support multi-stage aggregation.
Aggregate nodes now have two new modes: a "partial" mode where they
output the unfinalized transition state, and a "finalize" mode where
they accept unfinalized transition states rather than individual
values as input.

These new modes are not used anywhere yet, but they will be necessary
for parallel aggregation.  The infrastructure also figures to be
useful for cases where we want to aggregate local data and remote
data via the FDW interface, and want to bring back partial aggregates
from the remote side that can then be combined with locally generated
partial aggregates to produce the final value.  It may also be useful
even when neither FDWs nor parallelism are in play, as explained in
the comments in nodeAgg.c.

David Rowley and Simon Riggs, reviewed by KaiGai Kohei, Heikki
Linnakangas, Haribabu Kommi, and me.
2016-01-20 13:46:50 -05:00
Tom Lane dbe2328959 Fix assorted inconsistencies in GIN opclass support function declarations.
GIN had some minor issues too, mostly using "internal" where something
else would be more appropriate.  I went with the same approach as in
9ff60273e3, namely preferring the opclass' indexed datatype for
arguments that receive an operator RHS value, even if that's not
necessarily what they really are.

Again, this is with an eye to having a uniform rule for ginvalidate()
to check support function signatures.
2016-01-19 22:32:22 -05:00
Alvaro Herrera 948c97958b Add two HyperLogLog functions
New functions initHyperLogLogError() and freeHyperLogLog() simplify
using this module from elsewhere.

Author: Tomáš Vondra
Review: Peter Geoghegan
2016-01-19 17:40:15 -03:00
Tom Lane 9ff60273e3 Fix assorted inconsistencies in GiST opclass support function declarations.
The conventions specified by the GiST SGML documentation were widely
ignored.  For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum.  None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures.  So let's try to instill
some consistency here.

Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types).  Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same.  So I've specified a convention
that that's what to do always.

Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that.  Standardize
on telling the truth, instead.

Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".

Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.

Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.

So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion.  Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
2016-01-19 12:04:36 -05:00
Tom Lane 65c5fcd353 Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function.  All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function.  This is similar to
the designs we've adopted for FDWs and tablesample methods.  There
are multiple advantages.  For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.

A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL.  We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.

Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-17 19:36:59 -05:00
Tom Lane 8d290c8ec6 Re-pgindent a few files.
In preparation for landing index AM interface changes.
2016-01-17 19:13:18 -05:00
Magnus Hagander cf7dfbf2d6 Fix minor typo in comment
Tatsuro Yamada
2016-01-15 10:24:37 +01:00
Simon Riggs e63bb4549a Add new user fn pg_current_xlog_flush_location()
Tomas Vondra, reviewed by Michael Paquier and Amit Kapila
Minor edits by me
2016-01-12 07:54:52 +00:00
Tom Lane 26d538dc93 Clean up some lack-of-STRICT issues in the core code, too.
A scan for missed proisstrict markings in the core code turned up
these functions:

brin_summarize_new_values
pg_stat_reset_single_table_counters
pg_stat_reset_single_function_counters
pg_create_logical_replication_slot
pg_create_physical_replication_slot
pg_drop_replication_slot

The first three of these take OID, so a null argument will normally look
like a zero to them, resulting in "ERROR: could not open relation with OID
0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX
functions.  The other three will dump core on a null argument, though this
is mitigated by the fact that they won't do so until after checking that
the caller is superuser or has rolreplication privilege.

In addition, the pg_logical_slot_get/peek[_binary]_changes family was
intentionally marked nonstrict, but failed to make nullness checks on all
the arguments; so again a null-pointer-dereference crash is possible but
only for superusers and rolreplication users.

Add the missing ARGISNULL checks to the latter functions, and mark the
former functions as strict in pg_proc.  Make that change in the back
branches too, even though we can't force initdb there, just so that
installations initdb'd in future won't have the issue.  Since none of these
bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX
functions hardly misbehave at all), it seems sufficient to do this.

In addition, fix some order-of-operations oddities in the slot_get_changes
family, mostly cosmetic, but not the part that moves the function's last
few operations into the PG_TRY block.  As it stood, there was significant
risk for an error to exit without clearing historical information from
the system caches.

The slot_get_changes bugs go back to 9.4 where that code was introduced.
Back-patch appropriate subsets of the pg_proc changes into all active
branches, as well.
2016-01-09 16:58:32 -05:00
Simon Riggs 687f2cd7a0 Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.

The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
2016-01-09 10:10:08 +00:00
Magnus Hagander 2650486ebc Fix typo in comment
Tatsuro Yamada
2016-01-08 08:54:40 +01:00
Alvaro Herrera b1a9bad9e7 pgstat: add WAL receiver status view & SRF
This new view provides insight into the state of a running WAL receiver
in a HOT standby node.
The information returned includes the PID of the WAL receiver process,
its status (stopped, starting, streaming, etc), start LSN and TLI, last
received LSN and TLI, timestamp of last message send and receipt, latest
end-of-WAL LSN and time, and the name of the slot (if any).

Access to the detailed data is only granted to superusers; others only
get the PID.

Author: Michael Paquier
Reviewer: Haribabu Kommi
2016-01-07 16:21:19 -03:00
Alvaro Herrera a967613911 Windows: Make pg_ctl reliably detect service status
pg_ctl is using isatty() to verify whether the process is running in a
terminal, and if not it sends its output to Windows' Event Log ... which
does the wrong thing when the output has been redirected to a pipe, as
reported in bug #13592.

To fix, make pg_ctl use the code we already have to detect service-ness:
in the master branch, move src/backend/port/win32/security.c to src/port
(with suitable tweaks so that it runs properly in backend and frontend
environments); pg_ctl already has access to pgport so it Just Works.  In
older branches, that's likely to cause trouble, so instead duplicate the
required code in pg_ctl.c.

Author: Michael Paquier
Bug report and diagnosis: Egon Kocjan
Backpatch: all supported branches
2016-01-07 11:59:08 -03:00
Alvaro Herrera abb1733922 Add scale(numeric)
Author: Marko Tiikkaja
2016-01-05 19:02:13 -03:00
Tom Lane ea0d494dae Make the to_reg*() functions accept text not cstring.
Using cstring as the input type was a poor decision, because that's not
really a full-fledged type.  In particular, it lacks implicit coercions
from text or varchar, meaning that usages like to_regproc('foo'||'bar')
wouldn't work; basically the only case that did work without explicit
casting was a simple literal constant argument.

The lack of field complaints about this suggests that hardly anyone
is using these functions, so hopefully fixing it won't cause much of
a compatibility problem.  They've only been there since 9.4, anyway.

Petr Korobeinikov
2016-01-05 13:02:43 -05:00
Alvaro Herrera efa318bcfa Make pg_shseclabel available in early backend startup
While the in-core authentication mechanism doesn't need to access
pg_shseclabel at all, it's reasonable to think that an authentication
hook will want to look at the label for the role logging in, or for rows
in other catalogs used during the authentication phase of startup.

Catalog version bumped, because this changes the "is nailed" status for
pg_shseclabel.

Author: Adam Brightwell
2016-01-05 14:50:53 -03:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Tom Lane 0dab5ef39b Fix ALTER OPERATOR to update dependencies properly.
Fix an oversight in commit 321eed5f0f7563a0: replacing an operator's
selectivity functions needs to result in a corresponding update in
pg_depend.  We have a function that can handle that, but it was not
called by AlterOperator().

To fix this without enlarging pg_operator.h's #include list beyond
what clients can safely include, split off the function definitions
into a new file pg_operator_fn.h, similarly to what we've done for
some other catalog header files.  It's not entirely clear whether
any client-side code needs to include pg_operator.h, but it seems
prudent to assume that there is some such code somewhere.
2015-12-31 17:37:31 -05:00
Joe Conway 241448b23a Rename (new|old)estCommitTs to (new|old)estCommitTsXid
The variables newestCommitTs and oldestCommitTs sound as if they are
timestamps, but in fact they are the transaction Ids that correspond
to the newest and oldest timestamps rather than the actual timestamps.
Rename these variables to reflect that they are actually xids: to wit
newestCommitTsXid and oldestCommitTsXid respectively. Also modify
related code in a similar fashion, particularly the user facing output
emitted by pg_controldata and pg_resetxlog.

Complaint and patch by me, review by Tom Lane and Alvaro Herrera.
Backpatch to 9.5 where these variables were first introduced.
2015-12-28 12:34:11 -08:00
Tom Lane 6efbded6e4 Allow omitting one or both boundaries in an array slice specifier.
Omitted boundaries represent the upper or lower limit of the corresponding
array subscript.  This allows simpler specification of many common
use-cases.

(Revised version of commit 9246af6799)

YUriy Zhuravlev
2015-12-22 21:05:29 -05:00
Robert Haas 0ba3f3bc65 Comment improvements for abbreviated keys.
Peter Geoghegan and Robert Haas
2015-12-22 13:57:18 -05:00
Robert Haas ccd8f97922 postgres_fdw: Consider requesting sorted data so we can do a merge join.
When use_remote_estimate is enabled, consider adding ORDER BY to the
query we sending to the remote server so that we can use that ordered
data for a merge join.  Commit f18c944b61
arranges to push down the query pathkeys, which seems like the case
mostly likely to be a win, but testing shows this can sometimes win,
too.

For a regular table, we know which indexes are present and therefore
test whether the ordering provided by each such index is useful.  Here,
we take the opposite approach: guess what orderings would be useful if
they could be generated cheaply, and then ask the remote side what those
will cost.

Ashutosh Bapat, with very substantial cosmetic revisions by me.  Also
reviewed by Rushabh Lathia.
2015-12-22 13:46:40 -05:00
Teodor Sigaev bbbd807097 Revert 9246af6799 because
I miss too much. Patch is returned to commitfest process.
2015-12-18 21:35:22 +03:00
Teodor Sigaev 9246af6799 Allow to omit boundaries in array subscript
Allow to omiy lower or upper or both boundaries in array subscript
for selecting slice of array.

Author: YUriy Zhuravlev
2015-12-18 15:18:58 +03:00
Tom Lane 66d947b9d3 Adjust behavior of single-user -j mode for better initdb error reporting.
Previously, -j caused the entire input file to be read in and executed as
a single command string.  That's undesirable, not least because any error
causes the entire file to be regurgitated as the "failing query".  Some
experimentation suggests a better rule: end the command string when we see
a semicolon immediately followed by two newlines, ie, an empty line after
a query.  This serves nicely to break up the existing examples such as
information_schema.sql and system_views.sql.  A limitation is that it's
no longer possible to write such a sequence within a string literal or
multiline comment in a file meant to be read with -j; but there are no
instances of such a problem within the data currently used by initdb.
(If someone does make such a mistake in future, it'll be obvious because
they'll get an unterminated-literal or unterminated-comment syntax error.)
Other than that, there shouldn't be any negative consequences; you're not
forced to end statements that way, it's just a better idea in most cases.

In passing, remove src/include/tcop/tcopdebug.h, which is dead code
because it's not included anywhere, and hasn't been for more than
ten years.  One of the debug-support symbols it purported to describe
has been unreferenced for at least the same amount of time, and the
other is removed by this commit on the grounds that it was useless:
forcing -j mode all the time would have broken initdb.  The lack of
complaints about that, or about the missing inclusion, shows that
no one has tried to use TCOP_DONTUSENEWLINE in many years.
2015-12-17 19:34:15 -05:00
Alvaro Herrera 756e7b4c9d Rework internals of changing a type's ownership
This is necessary so that REASSIGN OWNED does the right thing with
composite types, to wit, that it also alters ownership of the type's
pg_class entry -- previously, the pg_class entry remained owned by the
original user, which caused later other failures such as the new owner's
inability to use ALTER TYPE to rename an attribute of the affected
composite.  Also, if the original owner is later dropped, the pg_class
entry becomes owned by a non-existant user which is bogus.

To fix, create a new routine AlterTypeOwner_oid which knows whether to
pass the request to ATExecChangeOwner or deal with it directly, and use
that in shdepReassignOwner rather than calling AlterTypeOwnerInternal
directly.  AlterTypeOwnerInternal is now simpler in that it only
modifies the pg_type entry and recurses to handle a possible array type;
higher-level tasks are handled by either AlterTypeOwner directly or
AlterTypeOwner_oid.

I took the opportunity to add a few more objects to the test rig for
REASSIGN OWNED, so that more cases are exercised.  Additional ones could
be added for superuser-only-ownable objects (such as FDWs and event
triggers) but I didn't want to push my luck by adding a new superuser to
the tests on a backpatchable bug fix.

Per bug #13666 reported by Chris Pacejo.

Backpatch to 9.5.

(I would back-patch this all the way back, except that it doesn't apply
cleanly in 9.4 and earlier because 59367fdf9 wasn't backpatched.  If we
decide that we need this in earlier branches too, we should backpatch
both.)
2015-12-17 14:25:41 -03:00
Tom Lane 2ec477dc81 Cope with Readline's failure to track SIGWINCH events outside of input.
It emerges that libreadline doesn't notice terminal window size change
events unless they occur while collecting input.  This is easy to stumble
over if you resize the window while using a pager to look at query output,
but it can be demonstrated without any pager involvement.  The symptom is
that queries exceeding one line are misdisplayed during subsequent input
cycles, because libreadline has the wrong idea of the screen dimensions.

The safest, simplest way to fix this is to call rl_reset_screen_size()
just before calling readline().  That causes an extra ioctl(TIOCGWINSZ)
for every command; but since it only happens when reading from a tty, the
performance impact should be negligible.  A more valid objection is that
this still leaves a tiny window during entry to readline() wherein delivery
of SIGWINCH will be missed; but the practical consequences of that are
probably negligible.  In any case, there doesn't seem to be any good way to
avoid the race, since readline exposes no functions that seem safe to call
from a generic signal handler --- rl_reset_screen_size() certainly isn't.

It turns out that we also need an explicit rl_initialize() call, else
rl_reset_screen_size() dumps core when called before the first readline()
call.

rl_reset_screen_size() is not present in old versions of libreadline,
so we need a configure test for that.  (rl_initialize() is present at
least back to readline 4.0, so we won't bother with a test for it.)
We would need a configure test anyway since libedit's emulation of
libreadline doesn't currently include such a function.  Fortunately,
libedit seems not to have any corresponding bug.

Merlin Moncure, adjusted a bit by me
2015-12-16 16:59:35 -05:00
Robert Haas 6150a1b08a Move buffer I/O and content LWLocks out of the main tranche.
Move the content lock directly into the BufferDesc, so that locking and
pinning a buffer touches only one cache line rather than two.  Adjust
the definition of BufferDesc slightly so that this doesn't make the
BufferDesc any larger than one cache line (at least on platforms where
a spinlock is only 1 or 2 bytes).

We can't fit the I/O locks into the BufferDesc and stay within one
cache line, so move those to a completely separate tranche.  This
leaves a relatively limited number of LWLocks in the main tranche, so
increase the padding of those remaining locks to a full cache line,
rather than allowing adjacent locks to share a cache line, hopefully
reducing false sharing.

Performance testing shows that these changes make little difference
on laptop-class machines, but help significantly on larger servers,
especially those with more than 2 sockets.

Andres Freund, originally based on an earlier patch by Simon Riggs.
Review and cosmetic adjustments (including heavy rewriting of the
comments) by me.
2015-12-15 13:32:54 -05:00
Robert Haas 3fed417452 Provide a way to predefine LWLock tranche IDs.
It's a bit cumbersome to use LWLockNewTrancheId(), because the returned
value needs to be shared between backends so that each backend can call
LWLockRegisterTranche() with the correct ID.  So, for built-in tranches,
use a hard-coded value instead.

This is motivated by an upcoming patch adding further built-in tranches.

Andres Freund and Robert Haas
2015-12-15 11:48:19 -05:00
Stephen Frost 833728d4c8 Handle policies during DROP OWNED BY
DROP OWNED BY handled GRANT-based ACLs but was not removing roles from
policies.  Fix that by having DROP OWNED BY remove the role specified
from the list of roles the policy (or policies) apply to, or the entire
policy (or policies) if it only applied to the role specified.

As with ACLs, the DROP OWNED BY caller must have permission to modify
the policy or a WARNING is thrown and no change is made to the policy.
2015-12-11 16:12:25 -05:00
Tom Lane 4fcf48450d Get rid of the planner's LateralJoinInfo data structure.
I originally modeled this data structure on SpecialJoinInfo, but after
commit acfcd45cac that looks like a pretty poor decision.
All we really need is relid sets identifying laterally-referenced rels;
and most of the time, what we want to know about includes indirect lateral
references, a case the LateralJoinInfo data was unsuited to compute with
any efficiency.  The previous commit redefined RelOptInfo.lateral_relids
as the transitive closure of lateral references, so that it easily supports
checking indirect references.  For the places where we really do want just
direct references, add a new RelOptInfo field direct_lateral_relids, which
is easily set up as a copy of lateral_relids before we perform the
transitive closure calculation.  Then we can just drop lateral_info_list
and LateralJoinInfo and the supporting code.  This makes the planner's
handling of lateral references noticeably more efficient, and shorter too.

Such a change can't be back-patched into stable branches for fear of
breaking extensions that might be looking at the planner's data structures;
but it seems not too late to push it into 9.5, so I've done so.
2015-12-11 15:52:38 -05:00
Tom Lane acfcd45cac Still more fixes for planner's handling of LATERAL references.
More fuzz testing by Andreas Seltenreich exposed that the planner did not
cope well with chains of lateral references.  If relation X references Y
laterally, and Y references Z laterally, then we will have to scan X on the
inside of a nestloop with Z, so for all intents and purposes X is laterally
dependent on Z too.  The planner did not understand this and would generate
intermediate joins that could not be used.  While that was usually harmless
except for wasting some planning cycles, under the right circumstances it
would lead to "failed to build any N-way joins" or "could not devise a
query plan" planner failures.

To fix that, convert the existing per-relation lateral_relids and
lateral_referencers relid sets into their transitive closures; that is,
they now show all relations on which a rel is directly or indirectly
laterally dependent.  This not only fixes the chained-reference problem
but allows some of the relevant tests to be made substantially simpler
and faster, since they can be reduced to simple bitmap manipulations
instead of searches of the LateralJoinInfo list.

Also, when a PlaceHolderVar that is due to be evaluated at a join contains
lateral references, we should treat those references as indirect lateral
dependencies of each of the join's base relations.  This prevents us from
trying to join any individual base relations to the lateral reference
source before the join is formed, which again cannot work.

Andreas' testing also exposed another oversight in the "dangerous
PlaceHolderVar" test added in commit 85e5e222b1.  Simply rejecting
unsafe join paths in joinpath.c is insufficient, because in some cases
we will end up rejecting *all* possible paths for a particular join, again
leading to "could not devise a query plan" failures.  The restriction has
to be known also to join_is_legal and its cohort functions, so that they
will not select a join for which that will happen.  I chose to move the
supporting logic into joinrels.c where the latter functions are.

Back-patch to 9.3 where LATERAL support was introduced.
2015-12-11 14:22:20 -05:00
Alvaro Herrera 69e7235c93 Fix commit timestamp initialization
This module needs explicit initialization in order to replay WAL records
in recovery, but we had broken this recently following changes to make
other (stranger) scenarios work correctly.  To fix, rework the
initialization sequence so that it always takes place before WAL replay
commences for both master and standby.

I could have gone for a more localized fix that just added a "startup"
call for the master server, but it seemed better to restructure the
existing callers as well so that the whole thing made more sense.  As a
drawback, there is more control logic in xlog.c now than previously, but
doing otherwise meant passing down the ControlFile flag, which seemed
uglier as a whole.

This also meant adding a check to not re-execute ActivateCommitTs if it
had already been called.

Reported by Fujii Masao.

Backpatch to 9.5.
2015-12-11 14:30:43 -03:00
Andres Freund e3f4cfc7aa Fix bug leading to restoring unlogged relations from empty files.
At the end of crash recovery, unlogged relations are reset to the empty
state, using their init fork as the template. The init fork is copied to
the main fork without going through shared buffers. Unfortunately WAL
replay so far has not necessarily flushed writes from shared buffers to
disk at that point. In normal crash recovery, and before the
introduction of 'fast promotions' in fd4ced523 / 9.3, the
END_OF_RECOVERY checkpoint flushes the buffers out in time. But with
fast promotions that's not the case anymore.

To fix, force WAL writes targeting the init fork to be flushed
immediately (using the new FlushOneBuffer() function). In 9.5+ that
flush can centrally be triggered from the code dealing with restoring
full page writes (XLogReadBufferForRedoExtended), in earlier releases
that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay
function.

Backpatch to 9.1, even if this currently is only known to trigger in
9.3+. Flushing earlier is more robust, and it is advantageous to keep
the branches similar.

Typical symptoms of this bug are errors like
'ERROR:  index "..." contains unexpected zero page at block 0'
shortly after promoting a node.

Reported-By: Thom Brown
Author: Andres Freund and Michael Paquier
Discussion: 20150326175024.GJ451@alap3.anarazel.de
Backpatch: 9.1-
2015-12-10 16:29:26 +01:00
Robert Haas b287df70e4 Allow EXPLAIN (ANALYZE, VERBOSE) to display per-worker statistics.
The original parallel sequential scan commit included only very limited
changes to the EXPLAIN output.  Aggregated totals from all workers were
displayed, but there was no way to see what each individual worker did
or to distinguish the effort made by the workers from the effort made by
the leader.

Per a gripe by Thom Brown (and maybe others).  Patch by me, reviewed
by Amit Kapila.
2015-12-09 13:21:19 -05:00
Kevin Grittner 25c5392330 Improve performance in freeing memory contexts
The single linked list of memory contexts could result in O(N^2)
performance to free a set of contexts if they were not freed in
reverse order of creation.  In many cases the reverse order was
used, but there were some significant exceptions that caused real-
world performance problems.  Rather than requiring all callers to
care about the order in which contexts were freed, and hunting down
and changing all existing cases where the wrong order was used, we
add one pointer per memory context so that the implementation
details are not so visible.

Jan Wieck
2015-12-08 17:32:49 -06:00
Robert Haas 385f337c9f Allow foreign and custom joins to handle EvalPlanQual rechecks.
Commit e7cb7ee145 provided basic
infrastructure for allowing a foreign data wrapper or custom scan
provider to replace a join of one or more tables with a scan.
However, this infrastructure failed to take into account the need
for possible EvalPlanQual rechecks, and ExecScanFetch would fail
an assertion (or just overwrite memory) if such a check was attempted
for a plan containing a pushed-down join.  To fix, adjust the EPQ
machinery to skip some processing steps when scanrelid == 0, making
those the responsibility of scan's recheck method, which also has
the responsibility in this case of correctly populating the relevant
slot.

To allow foreign scans to gain control in the right place to make
use of this new facility, add a new, optional RecheckForeignScan
method.  Also, allow a foreign scan to have a child plan, which can
be used to correctly populate the slot (or perhaps for something
else, but this is the only use currently envisioned).

KaiGai Kohei, reviewed by Robert Haas, Etsuro Fujita, and Kyotaro
Horiguchi.
2015-12-08 12:31:03 -05:00
Tom Lane edca44b152 Simplify LATERAL-related calculations within add_paths_to_joinrel().
While convincing myself that commit 7e19db0c09 would solve both of
the problems recently reported by Andreas Seltenreich, I realized that
add_paths_to_joinrel's handling of LATERAL restrictions could be made
noticeably simpler and faster if we were to retain the minimum possible
parameterization for each joinrel (that is, the set of relids supplying
unsatisfied lateral references in it).  We already retain that for
baserels, in RelOptInfo.lateral_relids, so we can use that field for
joinrels too.

I re-pgindent'd the files touched here, which affects some unrelated
comments.

This is, I believe, just a minor optimization not a bug fix, so no
back-patch.
2015-12-07 18:56:17 -05:00
Tom Lane 7e19db0c09 Fix another oversight in checking if a join with LATERAL refs is legal.
It was possible for the planner to decide to join a LATERAL subquery to
the outer side of an outer join before the outer join itself is completed.
Normally that's fine because of the associativity rules, but it doesn't
work if the subquery contains a lateral reference to the inner side of the
outer join.  In such a situation the outer join *must* be done first.
join_is_legal() missed this consideration and would allow the join to be
attempted, but the actual path-building code correctly decided that no
valid join path could be made, sometimes leading to planner errors such as
"failed to build any N-way joins".

Per report from Andreas Seltenreich.  Back-patch to 9.3 where LATERAL
support was added.
2015-12-07 17:42:11 -05:00
Alvaro Herrera 820ddb2c2f Further tweak commit_timestamp behavior
As pointed out by Fujii Masao, we weren't quite there on a standby
behaving sanely: first because we were failing to acquire the correct
state in the case where no XLOG_PARAMETER_CHANGE message was sent
(because a checkpoint had already happened after the setting was changed
in the master, and then the standby was restarted); and second because
promoting the standby with the feature enabled failed to activate it if
the master had the feature disabled.

This patch fixes both those misbehaviors hopefully without
re-introducing any old problems.

Also change the hint emitted in a standby together with the error
message about the feature being disabled, to make it point out that the
place to chance the setting is the master.  Otherwise, if the setting is
already enabled in the standby, it is very confusing to have it say that
the setting must be enabled ...

Authors: Álvaro Herrera, Petr Jelínek.
Backpatch to 9.5.
2015-12-03 19:22:31 -03:00
Tom Lane ec7eef6b11 Avoid caching expression state trees for domain constraints across queries.
In commit 8abb3cda0d I attempted to cache
the expression state trees constructed for domain CHECK constraints for
the life of the backend (assuming the domain's constraints don't get
redefined).  However, this turns out not to work very well, because
execQual.c will run those state trees with ecxt_per_query_memory pointing
to a query-lifespan context, and in some situations we'll end up with
pointers into that context getting stored into the state trees.  This
happens in particular with SQL-language functions, as reported by
Emre Hasegeli, but there are many other cases.

To fix, keep only the expression plan trees for domain CHECK constraints
in the typcache's data structure, and revert to performing ExecInitExpr
(at least) once per query to set up expression state trees in the query's
context.

Eventually it'd be nice to undo this, but that will require some careful
thought about memory management for expression state trees, and it seems
far too late for any such redesign in 9.5.  This way is still much more
efficient than what happened before 8abb3cda0.
2015-11-29 18:18:42 -05:00
Tom Lane 8d32717b6b Avoid doing encoding conversions by double-conversion via MULE_INTERNAL.
Previously, we did many conversions for Cyrillic and Central European
single-byte encodings by converting to a related MULE_INTERNAL coding
scheme before converting to the destination.  This seems unnecessarily
inefficient.  Moreover, if the conversion encounters an untranslatable
character, the error message will confusingly complain about failure
to convert to or from MULE_INTERNAL, rather than the user-visible
encodings.  Worse still, this approach results in some completely
unnecessary conversion failures; there are cases where the chosen
MULE subset lacks characters that exist in both of the user-visible
encodings, causing a conversion failure that need not occur.

This patch fixes the first two of those deficiencies by introducing
a new local2local() conversion support subroutine for direct conversion
between any two single-byte character sets, and adding new conversion
tables where needed.  However, I generated the new conversion tables by
testing PG 9.5's behavior, so that the actual conversion behavior is
bug-compatible with previous releases; the only user-visible behavior
change is that the error messages for conversion failures are saner.
Changes in the conversion behavior will probably ensue after discussion.

Interestingly, although this approach requires more tables, the .so files
actually end up smaller (at least on my x86_64 machine); the tables are
smaller than the management code needed for double conversion.

Per a complaint from Albe Laurenz.
2015-11-28 13:42:27 -05:00
Teodor Sigaev 92e38182d7 COPY (INSERT/UPDATE/DELETE .. RETURNING ..)
Attached is a patch for being able to do COPY (query) without a CTE.

Author: Marko Tiikkaja
Review: Michael Paquier
2015-11-27 19:11:22 +03:00
Tom Lane 00cdd83521 Adopt the GNU convention for handling tar-archive members exceeding 8GB.
The POSIX standard for tar headers requires archive member sizes to be
printed in octal with at most 11 digits, limiting the representable file
size to 8GB.  However, GNU tar and apparently most other modern tars
support a convention in which oversized values can be stored in base-256,
allowing any practical file to be a tar member.  Adopt this convention
to remove two limitations:
* pg_dump with -Ft output format failed if the contents of any one table
exceeded 8GB.
* pg_basebackup failed if the data directory contained any file exceeding
8GB.  (This would be a fatal problem for installations configured with a
table segment size of 8GB or more, and it has also been seen to fail when
large core dump files exist in the data directory.)

File sizes under 8GB are still printed in octal, so that no compatibility
issues are created except in cases that would have failed entirely before.

In addition, this patch fixes several bugs in the same area:

* In 9.3 and later, we'd defined tarCreateHeader's file-size argument as
size_t, which meant that on 32-bit machines it would write a corrupt tar
header for file sizes between 4GB and 8GB, even though no error was raised.
This broke both "pg_dump -Ft" and pg_basebackup for such cases.

* pg_restore from a tar archive would fail on tables of size between 4GB
and 8GB, on machines where either "size_t" or "unsigned long" is 32 bits.
This happened even with an archive file not affected by the previous bug.

* pg_basebackup would fail if there were files of size between 4GB and 8GB,
even on 64-bit machines.

* In 9.3 and later, "pg_basebackup -Ft" failed entirely, for any file size,
on 64-bit big-endian machines.

In view of these potential data-loss bugs, back-patch to all supported
branches, even though removal of the documented 8GB limit might otherwise
be considered a new feature rather than a bug fix.
2015-11-21 20:21:31 -05:00
Tom Lane 074c5cfbfb Fix handling of inherited check constraints in ALTER COLUMN TYPE (again).
The previous way of reconstructing check constraints was to do a separate
"ALTER TABLE ONLY tab ADD CONSTRAINT" for each table in an inheritance
hierarchy.  However, that way has no hope of reconstructing the check
constraints' own inheritance properties correctly, as pointed out in
bug #13779 from Jan Dirk Zijlstra.  What we should do instead is to do
a regular "ALTER TABLE", allowing recursion, at the topmost table that
has a particular constraint, and then suppress the work queue entries
for inherited instances of the constraint.

Annoyingly, we'd tried to fix this behavior before, in commit 5ed6546cf,
but we failed to notice that it wasn't reconstructing the pg_constraint
field values correctly.

As long as I'm touching pg_get_constraintdef_worker anyway, tweak it to
always schema-qualify the target table name; this seems like useful backup
to the protections installed by commit 5f173040.

In HEAD/9.5, get rid of get_constraint_relation_oids, which is now unused.
(I could alternatively have modified it to also return conislocal, but that
seemed like a pretty single-purpose API, so let's not pretend it has some
other use.)  It's unused in the back branches as well, but I left it in
place just in case some third-party code has decided to use it.

In HEAD/9.5, also rename pg_get_constraintdef_string to
pg_get_constraintdef_command, as the previous name did nothing to explain
what that entry point did differently from others (and its comment was
equally useless).  Again, that change doesn't seem like material for
back-patching.

I did a bit of re-pgindenting in tablecmds.c in HEAD/9.5, as well.

Otherwise, back-patch to all supported branches.
2015-11-20 14:55:47 -05:00
Robert Haas bc4996e61b Make ALTER .. SET SCHEMA do nothing, instead of throwing an ERROR.
This was already true for CREATE EXTENSION, but historically has not
been true for other object types.  Therefore, this is a backward
incompatibility.  Per discussion on pgsql-hackers, everyone seems to
agree that the new behavior is better.

Marti Raudsepp, reviewed by Haribabu Kommi and myself
2015-11-19 10:49:25 -05:00
Robert Haas 166b61a88e Avoid aggregating worker instrumentation multiple times.
Amit Kapila, per design ideas from me.
2015-11-18 12:35:25 -05:00
Robert Haas e93b62985f Remove volatile qualifiers from bufmgr.c and freelist.c
Prior to commit 0709b7ee72, access to
variables within a spinlock-protected critical section had to be done
through a volatile pointer, but that should no longer be necessary.

Review by Andres Freund
2015-11-16 18:50:06 -05:00
Robert Haas fe702a7b3f Move each SLRU's lwlocks to a separate tranche.
This makes it significantly easier to identify these lwlocks in
LWLOCK_STATS or Trace_lwlocks output.  It's also arguably better
from a modularity standpoint, since lwlock.c no longer needs to
know anything about the LWLock needs of the higher-level SLRU
facility.

Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.
2015-11-12 14:59:09 -05:00
Robert Haas a05dc4d7fd Provide readfuncs support for custom scans.
Commit a0d9f6e434 added this support for
all other plan node types; this fills in the gap.

Since TextOutCustomScan complicates this and is pretty well useless,
remove it.

KaiGai Kohei, with some modifications by me.
2015-11-12 07:40:31 -05:00
Robert Haas 80558c1f5a Generate parallel sequential scan plans in simple cases.
Add a new flag, consider_parallel, to each RelOptInfo, indicating
whether a plan for that relation could conceivably be run inside of
a parallel worker.  Right now, we're pretty conservative: for example,
it might be possible to defer applying a parallel-restricted qual
in a worker, and later do it in the leader, but right now we just
don't try to parallelize access to that relation.  That's probably
the right decision in most cases, anyway.

Using the new flag, generate parallel sequential scan plans for plain
baserels, meaning that we now have parallel sequential scan in
PostgreSQL.  The logic here is pretty unsophisticated right now: the
costing model probably isn't right in detail, and we can't push joins
beneath Gather nodes, so the number of plans that can actually benefit
from this is pretty limited right now.  Lots more work is needed.
Nevertheless, it seems time to enable this functionality so that all
this code can actually be tested easily by users and developers.

Note that, if you wish to test this functionality, it will be
necessary to set max_parallel_degree to a value greater than the
default of 0.  Once a few more loose ends have been tidied up here, we
might want to consider changing the default value of this GUC, but
I'm leaving it alone for now.

Along the way, fix a bug in cost_gather: the previous coding thought
that a Gather node's transfer overhead should be costed on the basis of
the relation size rather than the number of tuples that actually need
to be passed off to the leader.

Patch by me, reviewed in earlier versions by Amit Kapila.
2015-11-11 09:02:52 -05:00
Robert Haas f0661c4e8c Make sequential scans parallel-aware.
In addition, this path fills in a number of missing bits and pieces in
the parallel infrastructure.  Paths and plans now have a parallel_aware
flag indicating whether whatever parallel-aware logic they have should
be engaged.  It is believed that we will need this flag for a number of
path/plan types, not just sequential scans, which is why the flag is
generic rather than part of the SeqScan structures specifically.
Also, execParallel.c now gives parallel nodes a chance to initialize
their PlanState nodes from the DSM during parallel worker startup.

Amit Kapila, with a fair amount of adjustment by me.  Review of previous
patch versions by Haribabu Kommi and others.
2015-11-11 08:57:52 -05:00
Tom Lane c5e86ea932 Add "xid <> xid" and "xid <> int4" operators.
The corresponding "=" operators have been there a long time, and not
having their negators is a bit of a nuisance.

Michael Paquier
2015-11-07 16:40:15 -05:00
Robert Haas 6e71dd7ce9 Modify tqueue infrastructure to support transient record types.
Commit 4a4e6893aa, which introduced this
mechanism, failed to account for the fact that the RECORD pseudo-type
uses transient typmods that are only meaningful within a single
backend.  Transferring such tuples without modification between two
cooperating backends does not work.  This commit installs a system
for passing the tuple descriptors over the same shm_mq being used to
send the tuples themselves.  The two sides might not assign the same
transient typmod to any given tuple descriptor, so we must also
substitute the appropriate receiver-side typmod for the one used by
the sender.  That adds some CPU overhead, but still seems better than
being unable to pass records between cooperating parallel processes.

Along the way, move the logic for handling multiple tuple queues from
tqueue.c to nodeGather.c; tqueue.c now provides a TupleQueueReader,
which reads from a single queue, rather than a TupleQueueFunnel, which
potentially reads from multiple queues.  This change was suggested
previously as a way to make sure that nodeGather.c rather than tqueue.c
had policy control over the order in which to read from queues, but
it wasn't clear to me until now how good an idea it was.  typmod
mapping needs to be performed separately for each queue, and it is
much simpler if the tqueue.c code handles that and leaves multiplexing
multiple queues to higher layers of the stack.
2015-11-06 16:58:45 -05:00
Robert Haas a76ef15d9f Add sort support routine for the UUID data type.
This introduces a simple encoding scheme to produce abbreviated keys:
pack as many bytes of each UUID as will fit into a Datum.  On
little-endian machines, a byteswap is also performed; the abbreviated
comparator can therefore just consist of a simple 3-way unsigned integer
comparison.

The purpose of this change is to speed up sorting data on a column
of type UUID.

Peter Geoghegan
2015-11-06 12:14:35 -05:00
Robert Haas 64b2e7ad91 Pass extra data to bgworkers, and use this to fix parallel contexts.
Up until now, the total amount of data that could be passed to a
background worker at startup was one datum, which can be a small as
4 bytes on some systems.  That's enough to pass a dsm_handle or an
array index, but not much else.  Add a bgw_extra flag to the
BackgroundWorker struct, allowing up to 128 bytes to be passed to
a new worker on any platform.

Use this to fix a problem I recently discovered with the parallel
context machinery added in 9.5: the master assigns each worker an
array index, and each worker subsequently assigns itself an array
index, and there's nothing to guarantee that the two sets of indexes
match, leading to chaos.

Normally, I would not back-patch the change to add bgw_extra, since it
is basically a feature addition.  However, since 9.5 is still in beta
and there seems to be no other sensible way to repair the broken
parallel context machinery, back-patch to 9.5.  Existing background
worker code can ignore the bgw_extra field without a problem, but
might need to be recompiled since the structure size has changed.

Report and patch by me.  Review by Amit Kapila.
2015-11-05 12:13:56 -05:00
Tom Lane d894941663 Allow postgres_fdw to ship extension funcs/operators for remote execution.
The user can whitelist specified extension(s) in the foreign server's
options, whereupon we will treat immutable functions and operators of those
extensions as candidates to be sent for remote execution.

Whitelisting an extension in this way basically promises that the extension
exists on the remote server and behaves compatibly with the local instance.
We have no way to prove that formally, so we have to rely on the user to
get it right.  But this seems like something that people can usually get
right in practice.

We might in future allow functions and operators to be whitelisted
individually, but extension granularity is a very convenient special case,
so it got done first.

The patch as-committed lacks any regression tests, which is unfortunate,
but introducing dependencies on other extensions for testing purposes
would break "make installcheck" scenarios, which is worse.  I have some
ideas about klugy ways around that, but it seems like material for a
separate patch.  For the moment, leave the problem open.

Paul Ramsey, hacked up a bit more by me
2015-11-03 18:42:18 -05:00
Robert Haas 1efc7e5382 Fix problems with ParamListInfo serialization mechanism.
Commit d1b7c1ffe7 introduced a mechanism
for serializing a ParamListInfo structure to be passed to a parallel
worker.  However, this mechanism failed to handle external expanded
values, as pointed out by Noah Misch.  Repair.

Moreover, plpgsql_param_fetch requires adjustment because the
serialization mechanism needs it to skip evaluating unused parameters
just as we would do when it is called from copyParamList, but params
== estate->paramLI in that case.  To fix, make the bms_is_member test
in that function unconditional.

Finally, have setup_param_list set a new ParamListInfo field,
paramMask, to the parameters actually used in the expression, so that
we don't try to fetch those that are not needed when serializing a
parameter list.  This isn't necessary for correctness, but it makes
the performance of the parallel executor code comparable to what we
do for cases involving cursors.

Design suggestions and extensive review by Noah Misch.  Patch by me.
2015-11-02 18:11:29 -05:00
Tom Lane 12c9a04008 Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point.  This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.

Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked.  In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall.  To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops.  This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space.  So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.

I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine.  This avoids the overhead
of the LACON machinery entirely for these rather common cases.

The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely.  So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares.  (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
2015-10-30 19:14:19 -04:00
Robert Haas 3a1f8611f2 Update parallel executor support to reuse the same DSM.
Commit b0b0d84b3d purported to make it
possible to relaunch workers using the same parallel context, but it had
an unpleasant race condition: we might reinitialize after the workers
have sent their last control message but before they have dettached the
DSM, leaving to crashes.  Repair by introducing a new ParallelContext
operation, ReinitializeParallelDSM.

Adjust execParallel.c to use this new support, so that we can rescan a
Gather node by relaunching workers but without needing to recreate the
DSM.

Amit Kapila, with some adjustments by me.  Extracted from latest parallel
sequential scan patch.
2015-10-30 10:44:54 +01:00
Robert Haas 8538a63070 Make Gather node projection-capable.
The original Gather code failed to mark a Gather node as not able to
do projection, but it couldn't, even though it did call initialize its
projection info via ExecAssignProjectionInfo.  There doesn't seem to
be any good reason for this node not to have projection capability,
so clean things up so that it does.  Without this, plans using Gather
nodes might need to carry extra Result nodes to do projection.
2015-10-28 00:27:58 +01:00
Alvaro Herrera 531d21b75f Cleanup commit timestamp module activaction, again
Further tweak commit_ts.c so that on a standby the state is completely
consistent with what that in the master, rather than behaving
differently in the cases that the settings differ.  Now in standby and
master the module should always be active or inactive in lockstep.

Author: Petr Jelínek, with some further tweaks by Álvaro Herrera.

Backpatch to 9.5, where commit timestamps were introduced.

Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com
2015-10-27 15:06:50 -03:00
Tom Lane d435542583 Fix incorrect translation of minus-infinity datetimes for json/jsonb.
Commit bda76c1c8c caused both plus and
minus infinity to be rendered as "infinity", which is not only wrong
but inconsistent with the pre-9.4 behavior of to_json().  Fix that by
duplicating the coding in date_out/timestamp_out/timestamptz_out more
closely.  Per bug #13687 from Stepan Perlov.  Back-patch to 9.4, like
the previous commit.

In passing, also re-pgindent json.c, since it had gotten a bit messed up by
recent patches (and I was already annoyed by indentation-related problems
in back-patching this fix ...)
2015-10-20 11:07:04 -07:00
Robert Haas a1c466c5dd Fix incorrect comment in plannodes.h
Etsuro Fujita
2015-10-20 11:11:35 -04:00
Robert Haas ee7ca559fc Add a C API for parallel heap scans.
Using this API, one backend can set up a ParallelHeapScanDesc to
which multiple backends can then attach.  Each tuple in the relation
will be returned to exactly one of the scanning backends.  Only
forward scans are supported, and rescans must be carefully
coordinated.

This is not exposed to the planner or executor yet.

The original version of this code was written by me.  Amit Kapila
reviewed it, tested it, and improved it, including adding support for
synchronized scans, per review comments from Jeff Davis.  Extensive
testing of this and related patches was performed by Haribabu Kommi.
Final cleanup of this patch by me.
2015-10-16 17:33:18 -04:00
Robert Haas b0b0d84b3d Allow a parallel context to relaunch workers.
This may allow some callers to avoid the overhead involved in tearing
down a parallel context and then setting up a new one, which means
releasing the DSM and then allocating and populating a new one.  I
suspect we'll want to revise the Gather node to make use of this new
capability, but even if not it may be useful elsewhere and requires
very little additional code.
2015-10-16 17:18:05 -04:00
Tom Lane 538b3b8b35 Improve memory-usage accounting in regular-expression compiler.
This code previously counted the number of NFA states it created, and
complained if a limit was exceeded, so as to prevent bizarre regex patterns
from consuming unreasonable time or memory.  That's fine as far as it went,
but the code paid no attention to how many arcs linked those states.  Since
regexes can be contrived that have O(N) states but will need O(N^2) arcs
after fixempties() processing, it was still possible to blow out memory,
and take a long time doing it too.  To fix, modify the bookkeeping to count
space used by both states and arcs.

I did not bother with including the "color map" in the accounting; it
can only grow to a few megabytes, which is not a lot in comparison to
what we're allowing for states+arcs (about 150MB on 64-bit machines
or half that on 32-bit machines).

Looking at some of the larger real-world regexes captured in the Tcl
regression test suite suggests that the most that is likely to be needed
for regexes found in the wild is under 10MB, so I believe that the current
limit has enough headroom to make it okay to keep it as a hard-wired limit.

In connection with this, redefine REG_ETOOBIG as meaning "regular
expression is too complex"; the previous wording of "nfa has too many
states" was already somewhat inapropos because of the error code's use
for stack depth overrun, and it was not very user-friendly either.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane 579840ca05 Fix O(N^2) performance problems in regular-expression compiler.
Change the singly-linked in-arc and out-arc lists to be doubly-linked,
so that arc deletion is constant time rather than having worst-case time
proportional to the number of other arcs on the connected states.

Modify the bulk arc transfer operations copyins(), copyouts(), moveins(),
moveouts() so that they use a sort-and-merge algorithm whenever there's
more than a small number of arcs to be copied or moved.  The previous
method is O(N^2) in the number of arcs involved, because it performs
duplicate checking independently for each copied arc.  The new method may
change the ordering of existing arcs for the destination state, but nothing
really cares about that.

Provide another bulk arc copying method mergeins(), which is unused as
of this commit but is needed for the next one.  It basically is like
copyins(), but the source arcs might not all come from the same state.

Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort()
call.

These changes greatly improve the performance of regex compilation for
large or complex regexes, at the cost of extra space for arc storage during
compilation.  The original tradeoff was probably fine when it was made, but
now we care more about speed and less about memory consumption.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00