Commit Graph

14976 Commits

Author SHA1 Message Date
Robert Haas 3d6d1b5855 Move out-of-memory error checks from aset.c to mcxt.c
This potentially allows us to add mcxt.c interfaces that do something
other than throw an error when memory cannot be allocated.  We'll
handle adding those interfaces in a separate commit.

Michael Paquier, with minor changes by me
2015-01-29 10:23:38 -05:00
Stephen Frost c7cf9a2433 Add usebypassrls to pg_user and pg_shadow
The row level security patches didn't add the 'usebypassrls' columns to
the pg_user and pg_shadow views on the belief that they were deprecated,
but we havn't actually said they are and therefore we should include it.

This patch corrects that, adds missing documentation for rolbypassrls
into the system catalog page for pg_authid, along with the entries for
pg_user and pg_shadow, and cleans up a few other uses of 'row-level'
cases to be 'row level' in the docs.

Pointed out by Amit Kapila.

Catalog version bump due to system view changes.
2015-01-28 21:47:15 -05:00
Stephen Frost f8519a6a46 Clean up range-table building in copy.c
Commit 804b6b6db4 added the build of a
range table in copy.c to initialize the EState es_range_table since it
can be needed in error paths.  Unfortunately, that commit didn't
appreciate that some code paths might end up not initializing the rte
which is used to build the range table.

Fix that and clean up a couple others things along the way- build it
only once and don't explicitly set it on the !is_from path as it
doesn't make any sense there (cstate is palloc0'd, so this isn't an
issue from an initializing standpoint either).

The prior commit went back to 9.0, but this only goes back to 9.1 as
prior to that the range table build happens immediately after building
the RTE and therefore doesn't suffer from this issue.

Pointed out by Robert.
2015-01-28 17:42:28 -05:00
Stephen Frost 804b6b6db4 Fix column-privilege leak in error-message paths
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.

Instead, include only those columns which the user is providing or which
the user has select rights on.  If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription.  Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.

Further, in master only, do not return any data for cases where row
security is enabled on the relation and row security should be applied
for the user.  This required a bit of refactoring and moving of things
around related to RLS- note the addition of utils/misc/rls.c.

Back-patch all the way, as column-level privileges are now in all
supported versions.

This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
2015-01-28 12:31:30 -05:00
Heikki Linnakangas acc2b1e843 Fix typo in comment. 2015-01-28 10:26:30 +02:00
Heikki Linnakangas 670bf71f65 Remove dead NULL-pointer checks in GiST code.
gist_poly_compress() and gist_circle_compress() checked for a NULL-pointer
key argument, but that was dead code; the gist code never passes a
NULL-pointer to the "compress" method.

This commit also removes a documentation note added in commit a0a3883,
about doing NULL-pointer checks in the "compress" method. It was added
based on the fact that some implementations were doing NULL-pointer
checks, but those checks were unnecessary in the first place.

The NULL-pointer check in gbt_var_same() function was also unnecessary.
The arguments to the "same" method come from the "compress", "union", or
"picksplit" methods, but none of them return a NULL pointer.

None of this is to be confused with SQL NULL values. Those are dealt with
by the gist machinery, and are never passed to the GiST opclass methods.

Michael Paquier
2015-01-28 10:03:58 +02:00
Tom Lane 1a2b2034d4 Fix NUMERIC field access macros to treat NaNs consistently.
Commit 145343534c arranged to store numeric
NaN values as short-header numerics, but the field access macros did not
get the memo: they thought only "SHORT" numerics have short headers.

Most of the time this makes no difference because we don't access the
weight or dscale of a NaN; but numeric_send does that.  As pointed out
by Andrew Gierth, this led to fetching uninitialized bytes.

AFAICS this could not have any worse consequences than that; in particular,
an unaligned stored numeric would have been detoasted by PG_GETARG_NUMERIC,
so that there's no risk of a fetch off the end of memory.  Still, the code
is wrong on its own terms, and it's not hard to foresee future changes that
might expose us to real risks.  So back-patch to all affected branches.
2015-01-27 12:06:31 -05:00
Robert Haas 168a809d4b Re-enable abbreviated keys on Windows.
Commit 1be4eb1b2d disabled this, but I
think the real problem here was fixed by commit
b181a91981 and commit
d060e07fa9.  So let's try re-enabling
it now and see what happens.
2015-01-26 14:28:14 -05:00
Tom Lane c58accd70b Fix volatile-safety issue in asyncQueueReadAllNotifications().
The "pos" variable is modified within PG_TRY and then referenced
within PG_CATCH, so for strict POSIX conformance it must be marked
volatile.  Superficially the code looked safe because pos's address
was taken, which was sufficient to force it into memory ... but it's
not sufficient to ensure that the compiler applies updates exactly
where the program text says to.  The volatility marking has to extend
into a couple of subroutines too, but I think that's probably a good
thing because the risk of out-of-order updates is mostly in those
subroutines not asyncQueueReadAllNotifications() itself.  In principle
the compiler could have re-ordered operations such that an error could
be thrown while "pos" had an incorrect value.

It's unclear how real the risk is here, but for safety back-patch
to all active branches.
2015-01-26 11:57:33 -05:00
Tom Lane c70f9e8988 Further cleanup of ReorderBufferCommit().
On closer inspection, we can remove the "volatile" qualifier on
"using_subtxn" so long as we initialize that before the PG_TRY block,
which there's no particularly good reason not to do.
Also, push the "change" variable inside the PG_TRY so as to remove
all question of whether it needs "volatile", and remove useless
early initializations of "snapshow_now" and "using_subtxn".
2015-01-25 22:49:56 -05:00
Tom Lane bf007a27ac Clean up assorted issues in ALTER SYSTEM coding.
Fix unsafe use of a non-volatile variable in PG_TRY/PG_CATCH in
AlterSystemSetConfigFile().  While at it, clean up a bundle of other
infelicities and outright bugs, including corner-case-incorrect linked list
manipulation, a poorly designed and worse documented parse-and-validate
function (which even included some randomly chosen hard-wired substitutes
for the specified elevel in one code path ... wtf?), direct use of open()
instead of fd.c's facilities, inadequate checking of write()'s return
value, and generally poorly written commentary.
2015-01-25 20:19:04 -05:00
Tom Lane fd496129d1 Clean up some mess in row-security patches.
Fix unsafe coding around PG_TRY in RelationBuildRowSecurity: can't change
a variable inside PG_TRY and then use it in PG_CATCH without marking it
"volatile".  In this case though it seems saner to avoid that by doing
a single assignment before entering the TRY block.

I started out just intending to fix that, but the more I looked at the
row-security code the more distressed I got.  This patch also fixes
incorrect construction of the RowSecurityPolicy cache entries (there was
not sufficient care taken to copy pass-by-ref data into the cache memory
context) and a whole bunch of sloppiness around the definition and use of
pg_policy.polcmd.  You can't use nulls in that column because initdb will
mark it NOT NULL --- and I see no particular reason why a null entry would
be a good idea anyway, so changing initdb's behavior is not the right
answer.  The internal value of '\0' wouldn't be suitable in a "char" column
either, so after a bit of thought I settled on using '*' to represent ALL.
Chasing those changes down also revealed that somebody wasn't paying
attention to what the underlying values of ACL_UPDATE_CHR etc really were,
and there was a great deal of lackadaiscalness in the catalogs.sgml
documentation for pg_policy and pg_policies too.

This doesn't pretend to be a complete code review for the row-security
stuff, it just fixes the things that were in my face while dealing with
the bugs in RelationBuildRowSecurity.
2015-01-24 16:16:22 -05:00
Tom Lane f8a4dd2e14 Fix unsafe coding in ReorderBufferCommit().
"iterstate" must be marked volatile since it's changed inside the PG_TRY
block and then used in the PG_CATCH stanza.  Noted by Mark Wilding of
Salesforce.  (We really need to see if we can't get the C compiler to warn
about this.)

Also, reset iterstate to NULL after the mainline ReorderBufferIterTXNFinish
call, to ensure the PG_CATCH block doesn't try to do that a second time.
2015-01-24 13:25:19 -05:00
Tom Lane 586dd5d6a5 Replace a bunch more uses of strncpy() with safer coding.
strncpy() has a well-deserved reputation for being unsafe, so make an
effort to get rid of nearly all occurrences in HEAD.

A large fraction of the remaining uses were passing length less than or
equal to the known strlen() of the source, in which case no null-padding
can occur and the behavior is equivalent to memcpy(), though doubtless
slower and certainly harder to reason about.  So just use memcpy() in
these cases.

In other cases, use either StrNCpy() or strlcpy() as appropriate (depending
on whether padding to the full length of the destination buffer seems
useful).

I left a few strncpy() calls alone in the src/timezone/ code, to keep it
in sync with upstream (the IANA tzcode distribution).  There are also a
few such calls in ecpg that could possibly do with more analysis.

AFAICT, none of these changes are more than cosmetic, except for the four
occurrences in fe-secure-openssl.c, which are in fact buggy: an overlength
source leads to a non-null-terminated destination buffer and ensuing
misbehavior.  These don't seem like security issues, first because no stack
clobber is possible and second because if your values of sslcert etc are
coming from untrusted sources then you've got problems way worse than this.
Still, it's undesirable to have unpredictable behavior for overlength
inputs, so back-patch those four changes to all active branches.
2015-01-24 13:05:42 -05:00
Robert Haas d1747571b6 Fix typos, update README.
Peter Geoghegan
2015-01-23 15:06:53 -05:00
Robert Haas 5cefbf5a6c Don't use abbreviated keys for the final merge pass.
When we write tuples out to disk and read them back in, the abbreviated
keys become non-abbreviated, because the readtup routines don't know
anything about abbreviation.  But without this fix, the rest of the
code still thinks the abbreviation-aware compartor should be used,
so chaos ensues.

Report by Andrew Gierth; patch by Peter Geoghegan.
2015-01-23 11:58:31 -05:00
Robert Haas 6a3c6ba0ba Add an explicit cast to Size to hyperloglog.c
MSVC generates a warning here; we hope this will make it happy.

Report by Michael Paquier.  Patch by David Rowley.
2015-01-23 11:44:51 -05:00
Tom Lane eb213acfe2 Prevent duplicate escape-string warnings when using pg_stat_statements.
contrib/pg_stat_statements will sometimes run the core lexer a second time
on submitted statements.  Formerly, if you had standard_conforming_strings
turned off, this led to sometimes getting two copies of any warnings
enabled by escape_string_warning.  While this is probably no longer a big
deal in the field, it's a pain for regression testing.

To fix, change the lexer so it doesn't consult the escape_string_warning
GUC variable directly, but looks at a copy in the core_yy_extra_type state
struct.  Then, pg_stat_statements can change that copy to disable warnings
while it's redoing the lexing.

It seemed like a good idea to make this happen for all three of the GUCs
consulted by the lexer, not just escape_string_warning.  There's not an
immediate use-case for callers to adjust the other two AFAIK, but making
it possible is easy enough and seems like good future-proofing.

Arguably this is a bug fix, but there doesn't seem to be enough interest to
justify a back-patch.  We'd not be able to back-patch exactly as-is anyway,
for fear of breaking ABI compatibility of the struct.  (We could perhaps
back-patch the addition of only escape_string_warning by adding it at the
end of the struct, where there's currently alignment padding space.)
2015-01-22 18:11:00 -05:00
Peter Eisentraut f5f2c2de16 Fix whitespace 2015-01-22 16:57:16 -05:00
Alvaro Herrera 972bf7d6f1 Tweak BRIN minmax operator class
In the union support proc, we were not checking the hasnulls flag of
value A early enough, so it could be skipped if the "allnulls" flag in
value B is set.  Also, a check on the allnulls flag of value "B" was
redundant, so remove it.

Also change inet_minmax_ops to not be the default opclass for type inet,
as a future inclusion operator class would be more useful and it's
pretty difficult to change default opclass for a datatype later on.
(There is no catversion bump for this catalog change; this shouldn't be
a problem.)

Extracted from a larger patch to add an "inclusion" operator class.

Author: Emre Hasegeli
2015-01-22 17:01:09 -03:00
Robert Haas d060e07fa9 Repair brain fade in commit b181a91981.
The split between which things need to happen in the C-locale case and
which needed to happen in the locale-aware case was a few bricks short
of a load.  Try to fix that.
2015-01-22 12:51:20 -05:00
Bruce Momjian 59367fdf97 adjust ACL owners for REASSIGN and ALTER OWNER TO
When REASSIGN and ALTER OWNER TO are used, both the object owner and ACL
list should be changed from the old owner to the new owner. This patch
fixes types, foreign data wrappers, and foreign servers to change their
ACL list properly;  they already changed owners properly.

BACKWARD INCOMPATIBILITY?

Report by Alexey Bashtanov
2015-01-22 12:36:55 -05:00
Robert Haas b181a91981 More fixes for abbreviated keys infrastructure.
First, when LC_COLLATE = C, bttext_abbrev_convert should use memcpy()
rather than strxfrm() to construct the abbreviated key, because the
authoritative comparator uses memcpy().  If we do anything else here,
we might get inconsistent answers, and the buildfarm says this risk
is not theoretical.  It should be faster this way, too.

Second, while I'm looking at bttext_abbrev_convert, convert a needless
use of goto into the loop it's trying to implement into an actual
loop.

Both of the above problems date to the original commit of abbreviated
keys, commit 4ea51cdfe8.

Third, fix a bogus assignment to tss->locale before tss is set up.
That's a new goof in commit b529b65d1b.
2015-01-22 11:58:58 -05:00
Robert Haas b529b65d1b Heavily refactor btsortsupport_worker.
Prior to commit 4ea51cdfe8, this function
only had one job, which was to decide whether we could avoid trampolining
through the fmgr layer when performing sort comparisons.  As of that
commit, it has a second job, which is to decide whether we can use
abbreviated keys.  Unfortunately, those two tasks are somewhat intertwined
in the existing coding, which is likely why neither Peter Geoghegan nor
I noticed prior to commit that this calls pg_newlocale_from_collation() in
cases where it didn't previously.  The buildfarm noticed, though.

To fix, rewrite the logic so that the decision as to which comparator to
use is more cleanly separated from the decision about abbreviation.
2015-01-22 10:54:16 -05:00
Robert Haas 1be4eb1b2d Disable abbreviated keys on Windows.
Most of the Windows buildfarm members (bowerbird, hamerkop, currawong,
jacana, brolga) are unhappy with yesterday's abbreviated keys patch,
although there are some (narwhal, frogmouth) that seem OK with it.
Since there's no obvious pattern to explain why some are working and
others are failing, just disable this across-the-board on Windows for
now.  This is a bit unfortunate since the optimization will be a big
win in some cases, but we can't leave the buildfarm broken.
2015-01-20 20:32:21 -05:00
Tom Lane 75b48e1fff Adjust "pgstat wait timeout" message to be a translatable LOG message.
Per discussion, change the log level of this message to be LOG not WARNING.
The main point of this change is to avoid causing buildfarm run failures
when the stats collector is exceptionally slow to respond, which it not
infrequently is on some of the smaller/slower buildfarm members.

This change does lose notice to an interactive user when his stats query
is looking at out-of-date stats, but the majority opinion (not necessarily
that of yours truly) is that WARNING messages would probably not get
noticed anyway on heavily loaded production systems.  A LOG message at
least ensures that the problem is recorded somewhere where bulk auditing
for the issue is possible.

Also, instead of an untranslated "pgstat wait timeout" message, provide
a translatable and hopefully more understandable message "using stale
statistics instead of current ones because stats collector is not
responding".  The original text was written hastily under the assumption
that it would never really happen in practice, which we now know to be
unduly optimistic.

Back-patch to all active branches, since we've seen the buildfarm issue
in all branches.
2015-01-19 23:01:33 -05:00
Andres Freund 2d115e47c8 Fix various shortcomings of the new PrivateRefCount infrastructure.
As noted by Tom Lane the improvements in 4b4b680c3d had the problem
that in some situations we searched, entered and modified entries in
the private refcount hash while holding a spinlock. I had tried to
keep the logic entirely local to PinBuffer_Locked(), but that's not
really possible given it's called with a spinlock held...

Besides being disadvantageous from a performance point of view, this
also has problems with error handling safety. If we failed inserting
an entry into the hashtable due to an out of memory error, we'd error
out with a held spinlock. Not good.

Change the way private refcounts are manipulated: Before a buffer can
be tracked an entry has to be reserved using
ReservePrivateRefCountEntry(); then, if a entry is not found using
GetPrivateRefCountEntry(), it can be entered with
NewPrivateRefCountEntry().

Also take advantage of the fact that PinBuffer_Locked() currently is
never called for buffers that already have been pinned by the current
backend and don't search the private refcount entries for preexisting
local pins. That results in a small, but measurable, performance
improvement.

Additionally make ReleaseBuffer() always call UnpinBuffer() for shared
buffers. That avoids duplicating work in an eventual UnpinBuffer()
call that already has been done in ReleaseBuffer() and also saves some
code.

Per discussion with Tom Lane.

Discussion: 15028.1418772313@sss.pgh.pa.us
2015-01-19 23:59:41 +01:00
Robert Haas 4ea51cdfe8 Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.
2015-01-19 15:28:27 -05:00
Robert Haas 1605291b6c Typo fix.
Etsuro Fujita
2015-01-19 11:36:48 -05:00
Robert Haas 9d54b93239 BRIN typo fix.
Amit Langote
2015-01-19 08:34:29 -05:00
Tom Lane 75df6dc083 Fix ancient thinko in default table rowcount estimation.
The code used sizeof(ItemPointerData) where sizeof(ItemIdData) is correct,
since we're trying to account for a tuple's line pointer.  Spotted by
Tomonari Katsumata (bug #12584).

Although this mistake is of very long standing, no back-patch, since it's
a relatively harmless error and changing it would risk changing default
planner behavior in stable branches.  (I don't see any change in regression
test outputs here, but the buildfarm may think differently.)
2015-01-18 17:04:11 -05:00
Andres Freund ff44fba46c Replace walsender's latch with the general shared latch.
Relying on the normal shared latch simplifies interrupt/signal
handling because we can rely on all signal handlers setting the proc
latch. That in turn allows us to avoid the use of
ImmediateInterruptOK, which arguably isn't correct because
WaitLatchOrSocket isn't declared to be immediately interruptible.

Also change sections that wait on the walsender's latch to notice
interrupts quicker/more reliably and make them more consistent with
each other.

This is part of a larger "get rid of ImmediateInterruptOK" series.

Discussion: 20150115020335.GZ5245@awork2.anarazel.de
2015-01-17 13:00:42 +01:00
Tom Lane 20af53d719 Show sort ordering options in EXPLAIN output.
Up to now, EXPLAIN has contented itself with printing the sort expressions
in a Sort or Merge Append plan node.  This patch improves that by
annotating the sort keys with COLLATE, DESC, USING, and/or NULLS FIRST/LAST
whenever nondefault sort ordering options are used.  The output is now a
reasonably close approximation of an ORDER BY clause equivalent to the
plan's ordering.

Marius Timmer, Lukas Kreft, and Arne Scheffer; reviewed by Mike Blackwell.
Some additional hacking by me.
2015-01-16 18:19:00 -05:00
Heikki Linnakangas 9402869160 Advance backend's advertised xmin more aggressively.
Currently, a backend will reset it's PGXACT->xmin value when it doesn't
have any registered snapshots left. That covered the common case that a
transaction in read committed mode runs several queries, one after each
other, as there would be no snapshots active between those queries.
However, if you hold cursors across each of the query, we didn't get a
chance to reset xmin.

To make that better, keep all the registered snapshots in a pairing heap,
ordered by xmin so that it's always quick to find the snapshot with the
smallest xmin. That allows us to advance PGXACT->xmin whenever the oldest
snapshot is deregistered, even if there are others still active.

Per discussion originally started by Jeff Davis back in 2009 and more
recently by Robert Haas.
2015-01-17 01:15:23 +02:00
Tom Lane 779fdcdeee Improve new caching logic in tbm_add_tuples().
For no significant extra complexity, we can cache knowledge that the
target page is lossy, and save a hash_search per iteration in that
case as well.  This probably makes little difference, since the extra
rechecks that must occur when pages are lossy are way more expensive
than anything we can save here ... but we might as well do it if we're
going to cache anything.
2015-01-16 13:28:30 -05:00
Andres Freund f5ae3ba482 Make tbm_add_tuples more efficient by caching the last acccessed page.
When adding a large number of tuples to a TID bitmap using
tbm_add_tuples() sometimes a lot of time was spent looking up a page's
entry in the bitmap's internal hashtable.

Improve efficiency by caching the last accessed page, while iterating
over the passed in tuples, hoping consecutive tuples will often be on
the same page.  In many cases that's a good bet, and in the rest the
added overhead isn't big.

Discussion: 54479A85.8060309@sigaev.ru

Author: Teodor Sigaev
Reviewed-By: David Rowley
2015-01-16 17:47:59 +01:00
Tom Lane c480cb9d24 Fix use-of-already-freed-memory problem in EvalPlanQual processing.
Up to now, the "child" executor state trees generated for EvalPlanQual
rechecks have simply shared the ResultRelInfo arrays used for the original
execution tree.  However, this leads to dangling-pointer problems, because
ExecInitModifyTable() is all too willing to scribble on some fields of the
ResultRelInfo(s) even when it's being run in one of those child trees.
This trashes those fields from the perspective of the parent tree, because
even if the generated subtree is logically identical to what was in use in
the parent, it's in a memory context that will go away when we're done
with the child state tree.

We do however want to share information in the direction from the parent
down to the children; in particular, fields such as es_instrument *must*
be shared or we'll lose the stats arising from execution of the children.
So the simplest fix is to make a copy of the parent's ResultRelInfo array,
but not copy any fields back at end of child execution.

Per report from Manuel Kniep.  The added isolation test is based on his
example.  In an unpatched memory-clobber-enabled build it will reliably
fail with "ctid is NULL" errors in all branches back to 9.1, as a
consequence of junkfilter->jf_junkAttNo being overwritten with $7f7f.
This test cannot be run as-is before that for lack of WITH syntax; but
I have no doubt that some variant of this problem can arise in older
branches, so apply the code change all the way back.
2015-01-15 18:52:58 -05:00
Heikki Linnakangas 49b04188f8 Fix thinko in re-setting wal_log_hints flag from a parameter-change record.
The flag is supposed to be copied from the record. Same issue with
track_commit_timestamps, but that's master-only.

Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was
added.
2015-01-15 20:52:41 +02:00
Tom Lane 8e166e164c Rearrange explain.c's API so callers need not embed sizeof(ExplainState).
The folly of the previous arrangement was just demonstrated: there's no
convenient way to add fields to ExplainState without breaking ABI, even
if callers have no need to touch those fields.  Since we might well need
to do that again someday in back branches, let's change things so that
only explain.c has to have sizeof(ExplainState) compiled into it.  This
costs one extra palloc() per EXPLAIN operation, which is surely pretty
negligible.
2015-01-15 13:39:33 -05:00
Tom Lane a5cd70dcbc Improve performance of EXPLAIN with large range tables.
As of 9.3, ruleutils.c goes to some lengths to ensure that table and column
aliases used in its output are unique.  Of course this takes more time than
was required before, which in itself isn't fatal.  However, EXPLAIN was set
up so that recalculation of the unique aliases was repeated for each
subexpression printed in a plan.  That results in O(N^2) time and memory
consumption for large plan trees, which did not happen in older branches.

Fortunately, the expensive work is the same across a whole plan tree,
so there is no need to repeat it; we can do most of the initialization
just once per query and re-use it for each subexpression.  This buys
back most (not all) of the performance loss since 9.2.

We need an extra ExplainState field to hold the precalculated deparse
context.  That's no problem in HEAD, but in the back branches, expanding
sizeof(ExplainState) seems risky because third-party extensions might
have local variables of that struct type.  So, in 9.4 and 9.3, introduce
an auxiliary struct to keep sizeof(ExplainState) the same.  We should
refactor the APIs to avoid such local variables in future, but that's
material for a separate HEAD-only commit.

Per gripe from Alexey Bashtanov.  Back-patch to 9.3 where the issue
was introduced.
2015-01-15 13:18:12 -05:00
Andres Freund 59f71a0d0b Add a default local latch for use in signal handlers.
To do so, move InitializeLatchSupport() into the new common process
initialization functions, and add a new global variable MyLatch.

MyLatch is usable as soon InitPostmasterChild() has been called
(i.e. very early during startup). Initially it points to a process
local latch that exists in all processes. InitProcess/InitAuxiliaryProcess
then replaces that local latch with PGPROC->procLatch. During shutdown
the reverse happens.

This is primarily advantageous for two reasons: For one it simplifies
dealing with the shared process latch, especially in signal handlers,
because instead of having to check for MyProc, MyLatch can be used
unconditionally. For another, a later patch that makes FEs/BE
communication use latches, now can rely on the existence of a latch,
even before having gone through InitProcess.

Discussion: 20140927191243.GD5423@alap3.anarazel.de
2015-01-14 18:45:22 +01:00
Andres Freund 0139dea8f1 Remove some dead IsUnderPostmaster code from bootstrap.c.
Since commit 626eb02198 has introduced the auxiliary process
infrastructure, bootstrap_signals() was never used when forked from
postmaster.

Remove the IsUnderPostmaster specific code, and add a appropriate
assertion.
2015-01-14 00:37:02 +01:00
Andres Freund 31c453165b Commonalize process startup code.
Move common code, that was duplicated in every postmaster child/every
standalone process, into two functions in miscinit.c.  Not only does
that already result in a fair amount of net code reduction but it also
makes it much easier to remove more duplication in the future. The
prime motivation wasn't code deduplication though, but easier addition
of new common code.
2015-01-14 00:33:14 +01:00
Andres Freund 2be82dcf17 Make logging_collector=on work with non-windows EXEC_BACKEND again.
Commit b94ce6e80 reordered postmaster's startup sequence so that the
tempfile directory is only cleaned up after all the necessary state
for pg_ctl is collected.  Unfortunately the chosen location is after
the syslogger has been started; which normally is fine, except for
!WIN32 EXEC_BACKEND builds, which pass information to children via
files in the temp directory.

Move the call to RemovePgTempFiles() to just before the syslogger has
started. That's the first child we fork.

Luckily EXEC_BACKEND is pretty much only used by endusers on windows,
which has a separate method to pass information to children. That
means the real world impact of this bug is very small.

Discussion: 20150113182344.GF12272@alap3.anarazel.de

Backpatch to 9.1, just as the previous commit was.
2015-01-14 00:14:53 +01:00
Heikki Linnakangas e922a13058 Spell the X072 feature correctly, was missing "with".
Also use lower-case for a few more features, to be consistent with the
others and with the SQL spec.
2015-01-13 16:08:55 +02:00
Andres Freund 14e8803f10 Add barriers to the latch code.
Since their introduction latches have required barriers in SetLatch
and ResetLatch - but when they were introduced there wasn't any
barrier abstraction. Instead latches were documented to rely on the
callsites to provide barrier semantics.

Now that the barrier support looks halfway complete, add the necessary
barriers to both latch implementations.

Also remove a now superflous lock acquisition from syncrep.c and a
superflous (and insufficient) barrier from freelist.c. There might be
other cases that can now be simplified, but those are the only ones
I've seen on a quick scan.

We might want to backpatch this at some later point, but right now the
barrier infrastructure in the backbranches isn't totally on par with
master.

Discussion: 20150112154026.GB2092@awork2.anarazel.de
2015-01-13 12:58:43 +01:00
Andres Freund 4bad60e3fd Allow latches to wait for socket writability without waiting for readability.
So far WaitLatchOrSocket() required to pass in WL_SOCKET_READABLE as
that solely was used to indicate error conditions, like EOF. Waiting
for WL_SOCKET_WRITEABLE would have meant to busy wait upon socket
errors.

Adjust the API to signal errors by returning the socket as readable,
writable or both, depending on WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE
being specified.  It would arguably be nicer to return WL_SOCKET_ERROR
but that's not possible on platforms and would probably also result in
more complex callsites.

This previously had explicitly been forbidden in e42a21b9e6, as
there was no strong use case at that point. We now are looking into
making FE/BE communication use latches, so changing this makes sense.

There also are some portability concerns because there cases of older
platforms where select(2) is known to, in violation of POSIX, not
return a socket as writable after the peer has closed it.  So far the
platforms where that's the case provide a working poll(2). If we find
one where that's not the case, we'll need to add a workaround for that
platform.

Discussion: 20140927191243.GD5423@alap3.anarazel.de
Reviewed-By: Heikki Linnakangas, Noah Misch
2015-01-13 12:58:43 +01:00
Alvaro Herrera d126e1e95f Tweak heapam's rmgr desc output slightly
Some spaces were missing, and putting the affected tuple offset first in
the lock cases instead of the locking data makes more sense.

No backpatch since this is cosmetic and surrounding code has changed.
2015-01-12 16:09:16 -03:00
Alvaro Herrera 5c5ffee80f Fix get_object_address argument type for extension statement
Commit 3f88672a4 neglected to update the AlterExtensionContentsStmt
production in the grammar to use TypeName to represent types when
passing objects to get_object_address.

Reported as a pg_upgrade failure by Jeff Janes.
2015-01-12 15:32:48 -03:00
Tom Lane 1f9bf05e53 Use correct text domain for errcontext() appearing within ereport().
The mechanism added in commit dbdf9679d7
for associating the correct translation domain with errcontext strings
potentially fails in cases where errcontext() is used within an ereport()
macro.  Such usage was not originally envisioned for errcontext(), but we
do have a few places that do it.  In this situation, the intended comma
expression becomes just a couple of arguments to errfinish(), which the
compiler might choose to evaluate right-to-left.

Fortunately, in such cases the textdomain for the errcontext string must
be the same as for the surrounding ereport.  So we can fix this by letting
errstart initialize context_domain along with domain; then it will have
the correct value no matter which order the calls occur in.  (Note that
error stack callback functions are not invoked until errfinish, so normal
usage of errcontext won't affect what happens for errcontext calls within
the ereport macro.)

In passing, make sure that errcontext calls within the main backend set
context_domain to something non-NULL.  This isn't a live bug because
NULL would select the current textdomain() setting which should be the
right thing anyway --- but it seems better to handle this completely
consistently with the regular domain field.

Per report from Dmitry Voronin.  Backpatch to 9.3; before that, there
wasn't any attempt to ensure that errcontext strings were translated
in an appropriate domain.
2015-01-12 12:40:29 -05:00
Stephen Frost 1bf4a84d0f Skip dead backends in MinimumActiveBackends
Back in ed0b409, PGPROC was split and moved to static variables in
procarray.c, with procs in ProcArrayStruct replaced by an array of
integers representing process numbers (pgprocnos), with -1 indicating a
dead process which has yet to be removed.  Access to procArray is
generally done under ProcArrayLock and therefore most code does not have
to concern itself with -1 entries.

However, MinimumActiveBackends intentionally does not take
ProcArrayLock, which means it has to be extra careful when accessing
procArray.  Prior to ed0b409, this was handled by checking for a NULL
in the pointer array, but that check was no longer valid after the
split.  Coverity pointed out that the check could never happen and so
it was removed in 5592eba.  That didn't make anything worse, but it
didn't fix the issue either.

The correct fix is to check for pgprocno == -1 and skip over that entry
if it is encountered.

Back-patch to 9.2, since there can be attempts to access the arrays
prior to their start otherwise.  Note that the changes prior to 9.4 will
look a bit different due to the change in 5592eba.

Note that MinimumActiveBackends only returns a bool for heuristic
purposes and any pre-array accesses are strictly read-only and so there
is no security implication and the lack of fields complaints indicates
it's very unlikely to run into issues due to this.

Pointed out by Noah.
2015-01-12 11:31:57 -05:00
Andres Freund de6429a8fd Provide a generic fallback for pg_compiler_barrier using an extern function.
If the compiler/arch combination does not provide compiler barriers,
provide a fallback. That fallback simply consists out of a function
call into a externally defined function.  That should guarantee
compiler barrierer semantics except for compilers that do inter
translation unit/global optimization - those better provide an actual
compiler barrier.

Hopefully this fixes Tom's report of linker failures due to
pg_compiler_barrier_impl not being provided.

I'm not backpatching this commit as it builds on the new atomics
infrastructure. If we decide an equivalent fix needs to be
backpatched, I'll do so in a separate commit.

Discussion: 27746.1420930690@sss.pgh.pa.us

Per report from Tom Lane.
2015-01-11 01:15:29 +01:00
Stephen Frost c4fda14845 Fix typo in execMain.c
Wee -> We.

Pointed out by Etsuro Fujita.
2015-01-09 11:07:35 -05:00
Alvaro Herrera 045c68ad21 xlogreader.c: Fix report_invalid_record translatability flag
For some reason I overlooked in GETTEXT_TRIGGERS that the right argument
be read by gettext in 7fcbf6a405.  This
will drop the translation percentages for the backend all the way back
to 9.3 ...

Problem reported by Heikki.
2015-01-09 12:34:25 -03:00
Andres Freund f454144a34 Remove comment that was intended to have been removed before commit.
Noticed by Amit Kapila
2015-01-08 13:16:31 +01:00
Andres Freund 17eaae9897 Fix logging of pages skipped due to pins during vacuum.
The new logging introduced in 35192f06 made the incorrect assumption
that scan_all vacuums would always wait for buffer pins; but they only
do so if the page actually needs to be frozen.

Fix that inaccuracy by removing the difference in log output based on
scan_all and just always remove the same message.  I chose to keep the
split log message from the original commit for now, it seems likely
that it'll be of use in the future.

Also merge the line about buffer pins in autovacuum's log output into
the existing "pages: ..." line. It seems odd to have a separate line
about pins, without the "topic: " prefix others have.

Also rename the new 'pinned_pages' variable to 'pinskipped_pages'
because it actually tracks the number of pages that could *not* be
pinned.

Discussion: 20150104005324.GC9626@awork2.anarazel.de
2015-01-08 12:57:09 +01:00
Noah Misch 2048e5b881 On Darwin, refuse postmaster startup when multithreaded.
The previous commit introduced its report at LOG level to avoid
surprises at minor release upgrade time.  Compel users deploying the
next major release to also deploy the reported workaround.
2015-01-07 22:46:59 -05:00
Noah Misch 894459e59f On Darwin, detect and report a multithreaded postmaster.
Darwin --enable-nls builds use a substitute setlocale() that may start a
thread.  Buildfarm member orangutan experienced BackendList corruption
on account of different postmaster threads executing signal handlers
simultaneously.  Furthermore, a multithreaded postmaster risks undefined
behavior from sigprocmask() and fork().  Emit LOG messages about the
problem and its workaround.  Back-patch to 9.0 (all supported versions).
2015-01-07 22:35:44 -05:00
Noah Misch 6fdba8ceb0 Always set the six locale category environment variables in main().
Typical server invocations already achieved that.  Invalid locale
settings in the initial postmaster environment interfered, as could
malloc() failure.  Setting "LC_MESSAGES=pt_BR.utf8 LC_ALL=invalid" in
the postmaster environment will now choose C-locale messages, not
Brazilian Portuguese messages.  Most localized programs, including all
PostgreSQL frontend executables, do likewise.  Users are unlikely to
observe changes involving locale categories other than LC_MESSAGES.
CheckMyDatabase() ensures that we successfully set LC_COLLATE and
LC_CTYPE; main() sets the remaining three categories to locale "C",
which almost cannot fail.  Back-patch to 9.0 (all supported versions).
2015-01-07 22:34:57 -05:00
Noah Misch e415b469b3 Reject ANALYZE commands during VACUUM FULL or another ANALYZE.
vacuum()'s static variable handling makes it non-reentrant; an ensuing
null pointer deference crashed the backend.  Back-patch to 9.0 (all
supported versions).
2015-01-07 22:33:58 -05:00
Heikki Linnakangas 1e78d81e88 Don't open a WAL segment for writing at end of recovery.
Since commit ba94518a, we used XLogFileOpen to open the next segment for
writing, but if the end-of-recovery happens exactly at a segment boundary,
the new segment might not exist yet. (Before ba94518a, XLogFileOpen was
correct, because we would open the previous segment if the switch happened
at the boundary.)

Instead of trying to create it if necessary, it's simpler to not bother
opening the segment at all. XLogWrite() will open or create it soon anyway,
after writing the checkpoint or end-of-recovery record.

Reported by Andres Freund.
2015-01-07 16:20:20 +02:00
Peter Eisentraut 79af9a1d26 Fix namespace handling in xpath function
Previously, the xml value resulting from an xpath query would not have
namespace declarations if the namespace declarations were attached to
an ancestor element in the input xml value.  That means the output value
was not correct XML.  Fix that by running the result value through
xmlCopyNode(), which produces the correct namespace declarations.

Author: Ali Akbar <the.apaan@gmail.com>
2015-01-06 23:06:13 -05:00
Andres Freund 3fabed0705 Correctly handle relcache invalidation corner case during logical decoding.
When using a historic snapshot for logical decoding it can validly
happen that a relation that's in the relcache isn't visible to that
historic snapshot.  E.g. if a newly created relation is referenced in
the query that uses the SQL interface for logical decoding and a
sinval reset occurs.

The earlier commit that fixed the error handling for that corner case
already improves the situation as a ERROR is better than hitting an
assertion... But it's obviously not good enough.  So additionally
allow that case without an error if a historic snapshot is set up -
that won't allow an invalid entry to stay in the cache because it's a)
already marked invalid and will thus be rebuilt during the next access
b) the syscaches will be reset at the end of decoding.

There might be prettier solutions to handle this case, but all that we
could think of so far end up being much more complex than this quite
simple fix.

This fixes the assertion failures reported by the buildfarm (markhor,
tick, leech) after the introduction of new regression tests in
89fd41b390. The failure there weren't actually directly caused by
CLOBBER_CACHE_ALWAYS but the extraordinary long runtimes due to it
lead to sinval resets triggering the behaviour.

Discussion: 22459.1418656530@sss.pgh.pa.us

Backpatch to 9.4 where logical decoding was introduced.
2015-01-07 00:19:37 +01:00
Andres Freund 31912d01d8 Improve relcache invalidation handling of currently invisible relations.
The corner case where a relcache invalidation tried to rebuild the
entry for a referenced relation but couldn't find it in the catalog
wasn't correct.

The code tried to RelationCacheDelete/RelationDestroyRelation the
entry. That didn't work when assertions are enabled because the latter
contains an assertion ensuring the refcount is zero. It's also more
generally a bad idea, because by virtue of being referenced somebody
might actually look at the entry, which is possible if the error is
trapped and handled via a subtransaction abort.

Instead just error out, without deleting the entry. As the entry is
marked invalid, the worst that can happen is that the invalid (and at
some point unused) entry lingers in the relcache.

Discussion: 22459.1418656530@sss.pgh.pa.us

There should be no way to hit this case < 9.4 where logical decoding
introduced a bug that can hit this. But since the code for handling
the corner case is there it should do something halfway sane, so
backpatch all the the way back.  The logical decoding bug will be
handled in a separate commit.
2015-01-07 00:18:00 +01:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Fujii Masao 9f1d7313aa Fix typo in comment.
Report by Amit Kapila
2015-01-05 16:35:26 +09:00
Alvaro Herrera d5e3d1e969 Fix thinko in lock mode enum
Commit 0e5680f473 contained a thinko
mixing LOCKMODE with LockTupleMode.  This caused misbehavior in the case
where a tuple is marked with a multixact with at most a FOR SHARE lock,
and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock;
this case should block but doesn't.

Include a new isolation tester spec file to explicitely try all the
tuple lock combinations; without the fix it shows the problem:

    starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit
    step s1_begin: BEGIN;
    step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo;
    a

    1
    step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE;
    a

    1
    step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE;
    a

    1
    step s1_commit: COMMIT;

With the fixed code, step s2_tuplock3 blocks until session 1 commits,
which is the correct behavior.

All other cases behave correctly.

Backpatch to 9.3, like the commit that introduced the problem.
2015-01-04 15:48:29 -03:00
Andres Freund 2ea95959af Add error handling for failing fstat() calls in copy.c.
These calls are pretty much guaranteed not to fail unless something
has gone horribly wrong, and even in that case we'd just error out a
short time later.  But since several code checkers complain about the
missing check it seems worthwile to fix it nonetheless.

Pointed out by Coverity.
2015-01-04 16:47:23 +01:00
Andres Freund 14570c2828 Remove superflous variable from xlogreader's XLogFindNextRecord().
Pointed out by Coverity.

Since this is mere, and debatable, cosmetics I'm not backpatching
this.
2015-01-04 15:35:46 +01:00
Andres Freund 2c0a485896 Prevent WAL files created by pg_basebackup -x/X from being archived again.
WAL (and timeline history) files created by pg_basebackup did not
maintain the new base backup's archive status. That's currently not a
problem if the new node is used as a standby - but if that node is
promoted all still existing files can get archived again.  With a high
wal_keep_segment settings that can happen a significant time later -
which is quite confusing.

Change both the backend (for the -x/-X fetch case) and pg_basebackup
(for -X stream) itself to always mark WAL/timeline files included in
the base backup as .done. That's in line with walreceiver.c doing so.

The verbosity of the pg_basebackup changes show pretty clearly that it
needs some refactoring, but that'd result in not be backpatchable
changes.

Backpatch to 9.1 where pg_basebackup was introduced.

Discussion: 20141205002854.GE21964@awork2.anarazel.de
2015-01-03 20:54:12 +01:00
Andres Freund ccb161b66a Add pg_string_endswith as the start of a string helper library in src/common.
Backpatch to 9.3 where src/common was introduce, because a bugfix that
needs to be backpatched, requires the function. Earlier branches will
have to duplicate the code.
2015-01-03 20:54:12 +01:00
Tom Lane d6657d2a10 Treat negative values of recovery_min_apply_delay as having no effect.
At one point in the development of this feature, it was claimed that
allowing negative values would be useful to compensate for timezone
differences between master and slave servers.  That was based on a mistaken
assumption that commit timestamps are recorded in local time; but of course
they're in UTC.  Nor is a negative apply delay likely to be a sane way of
coping with server clock skew.  However, the committed patch still treated
negative delays as doing something, and the timezone misapprehension
survived in the user documentation as well.

If recovery_min_apply_delay were a proper GUC we'd just set the minimum
allowed value to be zero; but for the moment it seems better to treat
negative settings as if they were zero.

In passing do some extra wordsmithing on the parameter's documentation,
including correcting a second misstatement that the parameter affects
processing of Restore Point records.

Issue noted by Michael Paquier, who also provided the code patch; doc
changes by me.  Back-patch to 9.4 where the feature was introduced.
2015-01-03 13:14:03 -05:00
Tom Lane a486841eb1 Print more information about getObjectIdentityParts() failures.
This might help us debug what's happening on some buildfarm members.

In passing, reduce the message from ereport to elog --- it doesn't seem
like this should be a user-facing case, so not worth translating.
2014-12-31 14:44:43 -05:00
Alvaro Herrera ba66c9d068 Add missing pstrdup calls
The one for the OCLASS_COLLATION case was noticed by
CLOBBER_CACHE_ALWAYS buildfarm members; the others I spotted by manual
code inspection.

Also remove a redundant check.
2014-12-31 13:19:40 -03:00
Alvaro Herrera 72dd233d3e pg_event_trigger_dropped_objects: Add name/args output columns
These columns can be passed to pg_get_object_address() and used to
reconstruct the dropped objects identities in a remote server containing
similar objects, so that the drop can be replicated.

Reviewed by Stephen Frost, Heikki Linnakangas, Abhijit Menon-Sen, Andres
Freund.
2014-12-30 17:41:46 -03:00
Alvaro Herrera a676201490 Add pg_identify_object_as_address
This function returns object type and objname/objargs arrays, which can
be passed to pg_get_object_address.  This is especially useful because
the textual representation can be copied to a remote server in order to
obtain the corresponding OID-based address.  In essence, this function
is the inverse of recently added pg_get_object_address().

Catalog version bumped due to the addition of the new function.

Also add docs to pg_get_object_address.
2014-12-30 15:41:50 -03:00
Alvaro Herrera 3f88672a4e Use TypeName to represent type names in certain commands
In COMMENT, DROP, SECURITY LABEL, and the new pg_get_object_address
function, we were representing types as a list of names, same as other
objects; but types are special objects that require their own
representation to be totally accurate.  In the original COMMENT code we
had a note about fixing it which was lost in the course of c10575ff00.
Change all those places to use TypeName instead, as suggested by that
comment.

Right now the original coding doesn't cause any bugs, so no backpatch.
It is more problematic for proposed future code that operate with object
addresses from the SQL interface; type details such as array-ness are
lost when working with the degraded representation.

Thanks to Petr Jelínek and Dimitri Fontaine for offlist help on finding
a solution to a shift/reduce grammar conflict.
2014-12-30 13:57:23 -03:00
Tom Lane 9a11df1449 Remove duplicate assignment in new pg_get_object_address() function.
Noted by Coverity.
2014-12-28 12:03:32 -05:00
Alvaro Herrera 6630420fc9 Restrict name list len for domain constraints
This avoids an ugly-looking "cache lookup failure" message.

Ugliness pointed out by Andres Freund.
2014-12-26 14:31:37 -03:00
Alvaro Herrera 0e5680f473 Grab heavyweight tuple lock only before sleeping
We were trying to acquire the lock even when we were subsequently
not sleeping in some other transaction, which opens us up unnecessarily
to deadlocks.  In particular, this is troublesome if an update tries to
lock an updated version of a tuple and finds itself doing EvalPlanQual
update chain walking; more than two sessions doing this concurrently
will find themselves sleeping on each other because the HW tuple lock
acquisition in heap_lock_tuple called from EvalPlanQualFetch races with
the same tuple lock being acquired in heap_update -- one of these
sessions sleeps on the other one to finish while holding the tuple lock,
and the other one sleeps on the tuple lock.

Per trouble report from Andrew Sackville-West in
http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230

His scenario can be simplified down to a relatively simple
isolationtester spec file which I don't include in this commit; the
reason is that the current isolationtester is not able to deal with more
than one blocked session concurrently and it blocks instead of raising
the expected deadlock.  In the future, if we improve isolationtester, it
would be good to include the spec file in the isolation schedule.  I
posted it in
http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org

Hat tip to Mark Kirkwood, who helped diagnose the trouble.
2014-12-26 13:52:27 -03:00
Andres Freund 740a4ec7f4 Blindly fix a dtrace probe in lwlock.c for a removed local variable.
Per buildfarm member locust.
2014-12-25 19:48:46 +01:00
Tom Lane 966115c305 Temporarily revert "Move pg_lzcompress.c to src/common."
This reverts commit 60838df922.
That change needs a bit more thought to be workable.  In view of
the potentially machine-dependent stuff that went in today,
we need all of the buildfarm to be testing those other changes.
2014-12-25 13:22:55 -05:00
Andres Freund d72731a704 Lockless StrategyGetBuffer clock sweep hot path.
StrategyGetBuffer() has proven to be a bottleneck in a number of
buffer acquisition heavy workloads. To some degree this has already
been alleviated by 5d7962c6, but it still can be quite a heavy
bottleneck.  The problem is that in unfortunate usage patterns a
single StrategyGetBuffer() call will have to look at a large number of
buffers - in turn making it likely that the process will be put to
sleep while still holding the spinlock.

Replace most of the usage of the buffer_strategy_lock spinlock for the
clock sweep by a atomic nextVictimBuffer variable. That variable,
modulo NBuffers, is the current hand of the clock sweep. The buffer
clock-sweep then only needs to acquire the spinlock after a
wraparound. And even then only in the process that did the wrapping
around. That alleviates nearly all the contention on the relevant
spinlock, although significant contention on the cacheline can still
exist.

Reviewed-By: Robert Haas and Amit Kapila

Discussion: 20141010160020.GG6670@alap3.anarazel.de,
    20141027133218.GA2639@awork2.anarazel.de
2014-12-25 18:26:25 +01:00
Andres Freund ab5194e6f6 Improve LWLock scalability.
The old LWLock implementation had the problem that concurrent lock
acquisitions required exclusively acquiring a spinlock. Often that
could lead to acquirers waiting behind the spinlock, even if the
actual LWLock was free.

The new implementation doesn't acquire the spinlock when acquiring the
lock itself. Instead the new atomic operations are used to atomically
manipulate the state. Only the waitqueue, used solely in the slow
path, is still protected by the spinlock. Check lwlock.c's header for
an explanation about the used algorithm.

For some common workloads on larger machines this can yield
significant performance improvements. Particularly in read mostly
workloads.

Reviewed-By: Amit Kapila and Robert Haas
Author: Andres Freund

Discussion: 20130926225545.GB26663@awork2.anarazel.de
2014-12-25 17:24:30 +01:00
Andres Freund 7882c3b0b9 Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it.
Besides being shorter and much easier to read it changes the logic in
LWLockRelease() to release all shared lockers when waking up any. This
can yield some significant performance improvements - and the fairness
isn't really much worse than before, as we always allowed new shared
lockers to jump the queue.
2014-12-25 17:24:30 +01:00
Andres Freund 570bd2b3fd Add capability to suppress CONTEXT: messages to elog machinery.
Hiding context messages usually is not a good idea - except for rather
verbose debugging/development utensils like LOG_DEBUG. There the
amount of repeated context messages just bloat the log without adding
information.
2014-12-25 17:24:30 +01:00
Fujii Masao 4a5593197b Remove duplicate include of slot.h.
Back-patch to 9.4, where this problem was added.
2014-12-25 22:47:53 +09:00
Fujii Masao 60838df922 Move pg_lzcompress.c to src/common.
Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. pglz_decompress contained a call
to elog to emit an error message in case of corrupted data. This function
is changed to return a status code to let its callers return an error instead.

This commit is required for upcoming WAL compression feature so that
the WAL reader facility can decompress the WAL data by using pglz_decompress.

Michael Paquier
2014-12-25 20:46:14 +09:00
Fujii Masao 3b6ca123b5 Remove unused fields from ReindexStmt.
fe263d1 changed the REINDEX logic so that those fields are not used at all,
but forgot to remove them.

Sawada Masahiko
2014-12-24 21:40:47 +09:00
Andres Freund cd5ebe1edd Suppress MSVC warning in typeStringToTypeName function.
MSVC doesn't realize ereport(ERROR) doesn't return.

David Rowley
2014-12-24 12:30:08 +01:00
Alvaro Herrera a609d96778 Revert "Use a bitmask to represent role attributes"
This reverts commit 1826987a46.

The overall design was deemed unacceptable, in discussion following the
previous commit message; we might find some parts of it still
salvageable, but I don't want to be on the hook for fixing it, so let's
wait until we have a new patch.
2014-12-23 15:35:49 -03:00
Alvaro Herrera d7ee82e50f Add SQL-callable pg_get_object_address
This allows access to get_object_address from SQL, which is useful to
obtain OID addressing information from data equivalent to that emitted
by the parser.  This is necessary infrastructure of a project to let
replication systems propagate object dropping events to remote servers,
where the schema might be different than the server originating the
DROP.

This patch also adds support for OBJECT_DEFAULT to get_object_address;
that is, it is now possible to refer to a column's default value.

Catalog version bumped due to the new function.

Reviewed by Stephen Frost, Heikki Linnakangas, Robert Haas, Andres
Freund, Abhijit Menon-Sen, Adam Brightwell.
2014-12-23 15:31:29 -03:00
Alvaro Herrera 1826987a46 Use a bitmask to represent role attributes
The previous representation using a boolean column for each attribute
would not scale as well as we want to add further attributes.

Extra auxilliary functions are added to go along with this change, to
make up for the lost convenience of access of the old representation.

Catalog version bumped due to change in catalogs and the new functions.

Author: Adam Brightwell, minor tweaks by Álvaro
Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
2014-12-23 10:22:09 -03:00
Alvaro Herrera 7eca575d1c get_object_address: separate domain constraints from table constraints
Apart from enabling comments on domain constraints, this enables a
future project to replicate object dropping to remote servers: with the
current mechanism there's no way to distinguish between the two types of
constraints, so there's no way to know what to drop.

Also added support for the domain constraint comments in psql's \dd and
pg_dump.

Catalog version bumped due to the change in ObjectType enum.
2014-12-23 09:06:44 -03:00
Peter Eisentraut 584e35d17c Change local_preload_libraries to PGC_USERSET
This allows it to be used with ALTER ROLE SET.

Although the old setting of PGC_BACKEND prevented changes after session
start, after discussion it was more useful to allow ALTER ROLE SET
instead and just document that changes during a session have no effect.
This is similar to how session_preload_libraries works already.

An alternative would be to change things to allow PGC_BACKEND and
PGC_SU_BACKEND settings to be changed by ALTER ROLE SET.  But that might
need further research (e.g., log_connections would probably not work).

based on patch by Kyotaro Horiguchi
2014-12-22 23:05:46 -05:00
Heikki Linnakangas 955557ddcc Move rbtree.c from src/backend/utils/misc to src/backend/lib.
We have other general-purpose data structures in src/backend/lib, so it
seems like a better home for the red-black tree as well.
2014-12-22 17:52:08 +02:00
Heikki Linnakangas e7032610f7 Use a pairing heap for the priority queue in kNN-GiST searches.
This performs slightly better, uses less memory, and needs slightly less
code in GiST, than the Red-Black tree previously used.

Reviewed by Peter Geoghegan
2014-12-22 12:05:57 +02:00
Heikki Linnakangas 2ef6c66a2b Fix file descriptor leak at end of recovery.
XLogFileInit() returns a file descriptor, which needs to be closed. The leak
was short-lived, since the startup process exits shortly afterwards, but it
was clearly a bug, nevertheless.

Per Coverity report.
2014-12-21 21:51:59 +02:00
Alvaro Herrera 0ee98d1cbf pg_event_trigger_dropped_objects: add behavior flags
Add "normal" and "original" flags as output columns to the
pg_event_trigger_dropped_objects() function.  With this it's possible to
distinguish which objects, among those listed, need to be explicitely
referenced when trying to replicate a deletion.

This is necessary so that the list of objects can be pruned to the
minimum necessary to replicate the DROP command in a remote server that
might have slightly different schema (for instance, TOAST tables and
constraints with different names and such.)

Catalog version bumped due to change of function definition.

Reviewed by: Abhijit Menon-Sen, Stephen Frost, Heikki Linnakangas,
Robert Haas.
2014-12-19 15:00:45 -03:00
Heikki Linnakangas 5c805d0a81 Fix timestamp in end-of-recovery WAL records.
We used time(null) to set a TimestampTz field, which gave bogus results.
Noticed while looking at pg_xlogdump output.

Backpatch to 9.3 and above, where the fast promotion was introduced.
2014-12-19 17:04:20 +02:00
Andres Freund 37de8de9e3 Prevent potentially hazardous compiler/cpu reordering during lwlock release.
In LWLockRelease() (and in 9.4+ LWLockUpdateVar()) we release enqueued
waiters using PGSemaphoreUnlock(). As there are other sources of such
unlocks backends only wake up if MyProc->lwWaiting is set to false;
which is only done in the aforementioned functions.

Before this commit there were dangers because the store to lwWaitLink
could become visible before the store to lwWaitLink. This could both
happen due to compiler reordering (on most compilers) and on some
platforms due to the CPU reordering stores.

The possible consequence of this is that a backend stops waiting
before lwWaitLink is set to NULL. If that backend then tries to
acquire another lock and has to wait there the list could become
corrupted once the lwWaitLink store is finally performed.

Add a write memory barrier to prevent that issue.

Unfortunately the barrier support has been only added in 9.2. Given
that the issue has not knowingly been observed in praxis it seems
sufficient to prohibit compiler reordering using volatile for 9.0 and
9.1. Actual problems due to compiler reordering are more likely
anyway.

Discussion: 20140210134625.GA15246@awork2.anarazel.de
2014-12-19 14:29:52 +01:00
Alvaro Herrera cd6e66572b Use %u to print out BlockNumber variables
Per Tom Lane
2014-12-18 17:59:00 -03:00
Alvaro Herrera 35192f0626 Have VACUUM log number of skipped pages due to pins
Author: Jim Nasby, some kibitzing by Heikki Linnankangas.
Discussion leading to current behavior and precise wording fueled by
thoughts from Robert Haas and Andres Freund.
2014-12-18 17:18:33 -03:00
Tom Lane 4a14f13a0a Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create().  Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate.  Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions.  hash_create() itself will
take care of optimizing when the key size is four bytes.

This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.

In future we could look into offering a similar optimized hashing function
for 8-byte keys.  Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.

For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules.  Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.

Teodor Sigaev and Tom Lane
2014-12-18 13:36:36 -05:00
Heikki Linnakangas ba94518aad Change how first WAL segment on new timeline after promotion is created.
Two changes:

1. When copying a WAL segment from old timeline to create the first segment
on the new timeline, only copy up to the point where the timeline switch
happens, and zero-fill the rest. This avoids corner cases where we might
think that the copied WAL from the previous timeline belong to the new
timeline.

2. If the timeline switch happens at a segment boundary, don't copy the
whole old segment to the new timeline. It's pointless, because it's 100%
identical to the old segment.
2014-12-18 20:23:03 +02:00
Fujii Masao 38628db8d8 Add memory barriers for PgBackendStatus.st_changecount protocol.
st_changecount protocol needs the memory barriers to ensure that
the apparent order of execution is as it desires. Otherwise,
for example, the CPU might rearrange the code so that st_changecount
is incremented twice before the modification on a machine with
weak memory ordering. This surprising result can lead to bugs.

This commit introduces the macros to load and store st_changecount
with the memory barriers. These are called before and after
PgBackendStatus entries are modified or copied into private memory,
in order to prevent CPU from reordering PgBackendStatus access.

Per discussion on pgsql-hackers, we decided not to back-patch this
to 9.4 or before until we get an actual bug report about this.

Patch by me. Review by Robert Haas.
2014-12-18 23:07:51 +09:00
Fujii Masao 19e065c049 Ensure variables live across calls in generate_series(numeric, numeric).
In generate_series_step_numeric(), the variables "start_num"
and "stop_num" may be potentially freed until the next call.
So they should be put in the location which can survive across calls.
But previously they were not, and which could cause incorrect
behavior of generate_series(numeric, numeric). This commit fixes
this problem by copying them on multi_call_memory_ctx.

Andrew Gierth
2014-12-18 21:13:52 +09:00
Fujii Masao 26674c923d Remove odd blank line in comment.
Etsuro Fujita
2014-12-18 17:33:38 +09:00
Andres Freund c303e9e7e5 Fix (re-)starting from a basebackup taken off a standby after a failure.
When starting up from a basebackup taken off a standby extra logic has
to be applied to compute the point where the data directory is
consistent. Normal base backups use a WAL record for that purpose, but
that isn't possible on a standby.

That logic had a error check ensuring that the cluster's control file
indicates being in recovery. Unfortunately that check was too strict,
disregarding the fact that the control file could also indicate that
the cluster was shut down while in recovery.

That's possible when the a cluster starting from a basebackup is shut
down before the backup label has been removed. When everything goes
well that's a short window, but when either restore_command or
primary_conninfo isn't configured correctly the window can get much
wider. That's because inbetween reading and unlinking the label we
restore the last checkpoint from WAL which can need additional WAL.

To fix simply also allow starting when the control file indicates
"shutdown in recovery". There's nicer fixes imaginable, but they'd be
more invasive.

Backpatch to 9.2 where support for taking basebackups from standbys
was added.
2014-12-18 08:47:27 +01:00
Tom Lane fc2ac1fb41 Allow CHECK constraints to be placed on foreign tables.
As with NOT NULL constraints, we consider that such constraints are merely
reports of constraints that are being enforced by the remote server (or
other underlying storage mechanism).  Their only real use is to allow
planner optimizations, for example in constraint-exclusion checks.  Thus,
the code changes here amount to little more than removal of the error that
was formerly thrown for applying CHECK to a foreign table.

(In passing, do a bit of cleanup of the ALTER FOREIGN TABLE reference page,
which had accumulated some weird decisions about ordering etc.)

Shigeru Hanada and Etsuro Fujita, reviewed by Kyotaro Horiguchi and
Ashutosh Bapat.
2014-12-17 17:00:53 -05:00
Tom Lane c977b8cffc Fix poorly worded error message.
Adam Brightwell, per report from Martín Marqués.
2014-12-17 13:14:53 -05:00
Tom Lane 66709133c7 Fix off-by-one loop count in MapArrayTypeName, and get rid of static array.
MapArrayTypeName would copy up to NAMEDATALEN-1 bytes of the base type
name, which of course is wrong: after prepending '_' there is only room for
NAMEDATALEN-2 bytes.  Aside from being the wrong result, this case would
lead to overrunning the statically allocated work buffer.  This would be a
security bug if the function were ever used outside bootstrap mode, but it
isn't, at least not in any currently supported branches.

Aside from fixing the off-by-one loop logic, this patch gets rid of the
static work buffer by having MapArrayTypeName pstrdup its result; the sole
caller was already doing that, so this just requires moving the pstrdup
call.  This saves a few bytes but mainly it makes the API a lot cleaner.

Back-patch on the off chance that there is some third-party code using
MapArrayTypeName with less-secure input.  Pushing pstrdup into the function
should not cause any serious problems for such hypothetical code; at worst
there might be a short term memory leak.

Per Coverity scanning.
2014-12-16 15:35:33 -05:00
Andrew Dunstan c8315930e6 Fix some jsonb issues found by Coverity in recent commits.
Mostly these issues concern the non-use of function results. These
have been changed to use (void) pushJsonbValue(...) instead of assigning
the result to a variable that gets overwritten before it is used.

There is a larger issue that we should possibly examine the API for
pushJsonbValue(), so that instead of returning a value it modifies a
state argument. The current idiom is rather clumsy. However, changing
that requires quite a bit more work, so this change should do for the
moment.
2014-12-16 10:32:06 -05:00
Heikki Linnakangas 4d65e16a6f Misc comment typo fixes.
Backpatch the applicable parts, just to make backpatching future patches
easier.
2014-12-16 16:37:46 +02:00
Tom Lane 9418820efb Fix point <-> polygon code for zero-distance case.
"PG_RETURN_FLOAT8(x)" is not "return x", except perhaps by accident
on some platforms.
2014-12-15 14:04:27 -05:00
Heikki Linnakangas 4520ba6769 Add point <-> polygon distance operator.
Alexander Korotkov, reviewed by Emre Hasegeli.
2014-12-15 17:06:21 +02:00
Peter Eisentraut ee3bec5e22 Translation updates 2014-12-15 00:25:35 -05:00
Andrew Dunstan e39b6f953e Add CINE option for CREATE TABLE AS and CREATE MATERIALIZED VIEW
Fabrízio de Royes Mello reviewed by Rushabh Lathia.
2014-12-13 13:56:09 -05:00
Tom Lane b0f479113a Repair corner-case bug in array version of percentile_cont().
The code for advancing through the input rows overlooked the case that we
might already be past the first row of the row pair now being considered,
in case the previous percentile also fell between the same two input rows.

Report and patch by Andrew Gierth; logic rewritten a bit for clarity by me.
2014-12-13 11:49:41 -05:00
Andrew Dunstan 7e354ab9fe Add several generator functions for jsonb that exist for json.
The functions are:
    to_jsonb()
    jsonb_object()
    jsonb_build_object()
    jsonb_build_array()
    jsonb_agg()
    jsonb_object_agg()

Also along the way some better logic is implemented in
json_categorize_type() to match that in the newly implemented
jsonb_categorize_type().

Andrew Dunstan, reviewed by Pavel Stehule and Alvaro Herrera.
2014-12-12 15:31:14 -05:00
Andrew Dunstan 237a882443 Add json_strip_nulls and jsonb_strip_nulls functions.
The functions remove object fields, including in nested objects, that
have null as a value. In certain cases this can lead to considerably
smaller datums, with no loss of semantic information.

Andrew Dunstan, reviewed by Pavel Stehule.
2014-12-12 09:00:43 -05:00
Heikki Linnakangas b1332e98c4 Put the logic to decide which synchronous standby is active into a function.
This avoids duplicating the code.

Michael Paquier, reviewed by Simon Riggs and me
2014-12-12 14:26:42 +02:00
Tom Lane 462bd95705 Fix planning of SELECT FOR UPDATE on child table with partial index.
Ordinarily we can omit checking of a WHERE condition that matches a partial
index's condition, when we are using an indexscan on that partial index.
However, in SELECT FOR UPDATE we must include the "redundant" filter
condition in the plan so that it gets checked properly in an EvalPlanQual
recheck.  The planner got this mostly right, but improperly omitted the
filter condition if the index in question was on an inheritance child
table.  In READ COMMITTED mode, this could result in incorrectly returning
just-updated rows that no longer satisfy the filter condition.

The cause of the error is using get_parse_rowmark() when get_plan_rowmark()
is what should be used during planning.  In 9.3 and up, also fix the same
mistake in contrib/postgres_fdw.  It's currently harmless there (for lack
of inheritance support) but wrong is wrong, and the incorrect code might
get copied to someplace where it's more significant.

Report and fix by Kyotaro Horiguchi.  Back-patch to all supported branches.
2014-12-11 21:02:25 -05:00
Tom Lane 2db576ba8c Fix corner case where SELECT FOR UPDATE could return a row twice.
In READ COMMITTED mode, if a SELECT FOR UPDATE discovers it has to redo
WHERE-clause checking on rows that have been updated since the SELECT's
snapshot, it invokes EvalPlanQual processing to do that.  If this first
occurs within a non-first child table of an inheritance tree, the previous
coding could accidentally re-return a matching row from an earlier,
already-scanned child table.  (And, to add insult to injury, I think this
could make it miss returning a row that should have been returned, if the
updated row that this happens on should still have passed the WHERE qual.)
Per report from Kyotaro Horiguchi; the added isolation test is based on his
test case.

This has been broken for quite awhile, so back-patch to all supported
branches.
2014-12-11 19:37:36 -05:00
Simon Riggs 2646d2d4a9 Further changes to REINDEX SCHEMA
Ensure we reindex indexes built on Mat Views.
Based on patch from Micheal Paquier

Add thorough tests to check that indexes on
tables, toast tables and mat views are reindexed.

Simon Riggs
2014-12-11 22:54:05 +00:00
Tom Lane 06d5803ffa Fix assorted confusion between Oid and int32.
In passing, also make some debugging elog's in pgstat.c a bit more
consistently worded.

Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are
really old).

Mark Dilger identified and patched the type violations; the message
rewordings are mine.
2014-12-11 15:41:15 -05:00
Heikki Linnakangas 10eb7dfa9b Use correct macro for reltablespace.
It's an OID. WRITE_UINT_FIELD is identical to WRITE_OID_FIELD, but let's
be tidy.

Mark Dilger
2014-12-11 10:19:50 +02:00
Tom Lane 24688f4e5a Fix minor thinko in convertToJsonb().
The amount of space to reserve for the value's varlena header is
VARHDRSZ, not sizeof(VARHDRSZ).  The latter coding accidentally
failed to fail because of the way the VARHDRSZ macro is currently
defined; but if we ever change it to return size_t (as one might
reasonably expect it to do), convertToJsonb() would have failed.

Spotted by Mark Dilger.
2014-12-10 19:06:27 -05:00
Simon Riggs ae4e6887a4 Silence REINDEX
Previously REINDEX DATABASE and REINDEX SCHEMA
produced a stream of NOTICE messages. Removing that
since it is inconsistent for such a command to
produce output without a VERBOSE option.
2014-12-09 18:05:36 +09:00
Simon Riggs fe263d115a REINDEX SCHEMA
Add new SCHEMA option to REINDEX and reindexdb.

Sawada Masahiko

Reviewed by Michael Paquier and Fabrízio de Royes Mello
2014-12-09 00:28:00 +09:00
Simon Riggs 8001fe67a3 Windows: use GetSystemTimePreciseAsFileTime if available
PostgreSQL on Windows 8 or Windows Server 2012 will now
get high-resolution timestamps by dynamically loading the
GetSystemTimePreciseAsFileTime function. It'll fall back to
to GetSystemTimeAsFileTime if the higher precision variant
isn't found, so the same binaries without problems on older
Windows releases.

No attempt is made to detect the Windows version.  Only the
presence or absence of the desired function is considered.

Craig Ringer
2014-12-08 23:36:06 +09:00
Simon Riggs c270754719 Remove duplicate code in heap_prune_chain()
No need to set tuple tableOid twice

Jim Nasby
2014-12-08 08:44:37 +09:00
Simon Riggs 618c9430a8 Event Trigger for table_rewrite
Generate a table_rewrite event when ALTER TABLE
attempts to rewrite a table. Provide helper
functions to identify table and reason.

Intended use case is to help assess or to react
to schema changes that might hold exclusive locks
for long periods.

Dimitri Fontaine, triggering an edit by Simon Riggs

Reviewed in detail by Michael Paquier
2014-12-08 00:55:28 +09:00
Simon Riggs b8e33a85d4 Tweaks for recovery_target_action
Rename parameter action_at_recovery_target to
recovery_target_action suggested by Christoph Berg.

Place into recovery.conf suggested by Fujii Masao,
replacing (deprecating) earlier parameters, per
Michael Paquier.
2014-12-07 21:55:29 +09:00
Heikki Linnakangas 326b6f009f Print new track_commit_timestamp in rm_desc of a parameter-change record.
Michael Paquier
2014-12-05 12:11:43 +02:00
Heikki Linnakangas c846e67c46 Print wal_log_hints in the rm_desc routing of a parameter-change record.
It was an oversight in the original commit.

Also note in the sample config file that changing wal_log_hints requires a
restart.

Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.
2014-12-05 12:00:48 +02:00
Robert Haas 9a94629833 Don't dump core if pq_comm_reset() is called before pq_init().
This can happen if an error occurs in a standalone backend.  This bug
was introduced by commit 2bd9e412f9.

Reported by Álvaro Herrera.
2014-12-04 19:49:43 -05:00
Alvaro Herrera 73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Alvaro Herrera 6597ec9be6 Fix typos 2014-12-03 11:52:15 -03:00
Tom Lane 475aedd1ef Improve error messages for malformed array input strings.
Make the error messages issued by array_in() uniformly follow the style
	ERROR: malformed array literal: "actual input string"
	DETAIL: specific complaint here
and rewrite many of the specific complaints to be clearer.

The immediate motivation for doing this is a complaint from Josh Berkus
that json_to_record() produced an unintelligible error message when
dealing with an array item, because it tries to feed the JSON-format
array value to array_in().  Really it ought to be smart enough to
perform JSON-to-Postgres array conversion, but that's a future feature
not a bug fix.  In the meantime, this change is something we agreed
we could back-patch into 9.4, and it should help de-confuse things a bit.
2014-12-02 18:23:27 -05:00
Andres Freund 0fd38e1370 Don't skip SQL backends in logical decoding for visibility computation.
The logical decoding patchset introduced PROC_IN_LOGICAL_DECODING flag
PGXACT flag, that allows such backends to be skipped when computing
the xmin horizon/snapshots. That's fine and sensible for walsenders
streaming out logical changes, but not at all fine for SQL backends
doing logical decoding. If the latter set that flag any change they
have performed outside of logical decoding will not be regarded as
visible - which e.g. can lead to that change being vacuumed away.

Note that not setting the flag for SQL backends isn't particularly
bothersome - the SQL backend doesn't do streaming, so it only runs for
a limited amount of time.

Per buildfarm member 'tick' and Alvaro.

Backpatch to 9.4, where logical decoding was introduced.
2014-12-02 23:47:08 +01:00
Tom Lane 75ef435218 Fix JSON aggregates to work properly when final function is re-executed.
Davide S. reported that json_agg() sometimes produced multiple trailing
right brackets.  This turns out to be because json_agg_finalfn() attaches
the final right bracket, and was doing so by modifying the aggregate state
in-place.  That's verboten, though unfortunately it seems there's no way
for nodeAgg.c to check for such mistakes.

Fix that back to 9.3 where the broken code was introduced.  In 9.4 and
HEAD, likewise fix json_object_agg(), which had copied the erroneous logic.
Make some cosmetic cleanups as well.
2014-12-02 15:02:37 -05:00
Tom Lane 1511521a36 Minor cleanup of function declarations for BRIN.
Get rid of PG_FUNCTION_INFO_V1() macros, which are quite inappropriate
for built-in functions (possibly leftovers from testing as a loadable
module?).  Also, fix gratuitous inconsistency between SQL-level and
C-level names of the minmax support functions.
2014-12-02 14:07:54 -05:00
Tom Lane 0927bf8060 Guard against bad "dscale" values in numeric_recv().
We were not checking to see if the supplied dscale was valid for the given
digit array when receiving binary-format numeric values.  While dscale can
validly be more than the number of nonzero fractional digits, it shouldn't
be less; that case causes fractional digits to be hidden on display even
though they're there and participate in arithmetic.

Bug #12053 from Tommaso Sala indicates that there's at least one broken
client library out there that sometimes supplies an incorrect dscale value,
leading to strange behavior.  This suggests that simply throwing an error
might not be the best response; it would lead to failures in applications
that might seem to be working fine today.  What seems the least risky fix
is to truncate away any digits that would be hidden by dscale.  This
preserves the existing behavior in terms of what will be printed for the
transmitted value, while preventing subsequent arithmetic from producing
results inconsistent with that.

In passing, throw a specific error for the case of dscale being outside
the range that will fit into a numeric's header.  Before you got "value
overflows numeric format", which is a bit misleading.

Back-patch to all supported branches.
2014-12-01 15:25:02 -05:00
Andrew Dunstan e09996ff8d Fix hstore_to_json_loose's detection of valid JSON number values.
We expose a function IsValidJsonNumber that internally calls the lexer
for json numbers. That allows us to use the same test everywhere,
instead of inventing a broken test for hstore conversions. The new
function is also used in datum_to_json, replacing the code that is now
moved to the new function.

Backpatch to 9.3 where hstore_to_json_loose was introduced.
2014-12-01 11:28:45 -05:00
Alvaro Herrera ae04bf5027 Update transaction README for persistent multixacts
Multixacts are now maintained during recovery, but the README didn't get
the memo.  Backpatch to 9.3, where the divergence was introduced.
2014-11-28 18:06:18 -03:00
Tom Lane d25367ec4f Add bms_get_singleton_member(), and use it where appropriate.
This patch adds a function that replaces a bms_membership() test followed
by a bms_singleton_member() call, performing both the test and the
extraction of a singleton set's member in one scan of the bitmapset.
The performance advantage over the old way is probably minimal in current
usage, but it seems worthwhile on notational grounds anyway.

David Rowley
2014-11-28 14:16:24 -05:00
Tom Lane f4e031c662 Add bms_next_member(), and use it where appropriate.
This patch adds a way of iterating through the members of a bitmapset
nondestructively, unlike the old way with bms_first_member().  While
bms_next_member() is very slightly slower than bms_first_member()
(at least for typical-size bitmapsets), eliminating the need to palloc
and pfree a temporary copy of the target bitmapset is a significant win.
So this method should be preferred in all cases where a temporary copy
would be necessary.

Tom Lane, with suggestions from Dean Rasheed and David Rowley
2014-11-28 13:37:25 -05:00
Tom Lane 96d66bcfc6 Improve performance of OverrideSearchPathMatchesCurrent().
This function was initially coded on the assumption that it would not be
performance-critical, but that turns out to be wrong in workloads that
are heavily dependent on the speed of plpgsql functions.  Speed it up by
hard-coding the comparison rules, thereby avoiding palloc/pfree traffic
from creating and immediately freeing an OverrideSearchPath object.
Per report from Scott Marlowe.
2014-11-28 12:37:27 -05:00
Tom Lane e384ed6cde Improve typcache: cache negative lookup results, add invalidation logic.
Previously, if the typcache had for example tried and failed to find a hash
opclass for a given data type, it would nonetheless repeat the unsuccessful
catalog lookup each time it was asked again.  This can lead to a
significant amount of useless bufmgr traffic, as in a recent report from
Scott Marlowe.  Like the catalog caches, typcache should be able to cache
negative results.  This patch arranges that by making use of separate flag
bits to remember whether a particular item has been looked up, rather than
treating a zero OID as an indicator that no lookup has been done.

Also, install a credible invalidation mechanism, namely watching for inval
events in pg_opclass.  The sole advantage of the lack of negative caching
was that the code would cope if operators or opclasses got added for a type
mid-session; to preserve that behavior we have to be able to invalidate
stale lookup results.  Updates in pg_opclass should be pretty rare in
production systems, so it seems sufficient to just invalidate all the
dependent data whenever one happens.

Adding proper invalidation also means that this code will now react sanely
if an opclass is dropped mid-session.  Arguably, that's a back-patchable
bug fix, but in view of the lack of complaints from the field I'll refrain
from back-patching.  (Probably, in most cases where an opclass is dropped,
the data type itself is dropped soon after, so that this misfeasance has
no bad consequences.)
2014-11-28 12:19:14 -05:00
Heikki Linnakangas afeacd2748 Fix assertion failure at end of PITR.
InitXLogInsert() cannot be called in a critical section, because it
allocates memory. But CreateCheckPoint() did that, when called for the
end-of-recovery checkpoint by the startup process.

In the passing, fix the scratch space allocation in InitXLogInsert to go to
the right memory context. Also update the comment at InitXLOGAccess, which
hasn't been totally accurate since hot standby was introduced (in a hot
standby backend, InitXLOGAccess isn't called at backend startup).

Reported by Michael Paquier
2014-11-28 09:31:53 +02:00
Stephen Frost 143b39c185 Rename pg_rowsecurity -> pg_policy and other fixes
As pointed out by Robert, we should really have named pg_rowsecurity
pg_policy, as the objects stored in that catalog are policies.  This
patch fixes that and updates the column names to start with 'pol' to
match the new catalog name.

The security consideration for COPY with row level security, also
pointed out by Robert, has also been addressed by remembering and
re-checking the OID of the relation initially referenced during COPY
processing, to make sure it hasn't changed under us by the time we
finish planning out the query which has been built.

Robert and Alvaro also commented on missing OCLASS and OBJECT entries
for POLICY (formerly ROWSECURITY or POLICY, depending) in various
places.  This patch fixes that too, which also happens to add the
ability to COMMENT on policies.

In passing, attempt to improve the consistency of messages, comments,
and documentation as well.  This removes various incarnations of
'row-security', 'row-level security', 'Row-security', etc, in favor
of 'policy', 'row level security' or 'row_security' as appropriate.

Happy Thanksgiving!
2014-11-27 01:15:57 -05:00
Robert Haas a6c84c770e Attempt to suppress uninitialized variable warning.
Report by Heikki Linnakangas.
2014-11-25 20:07:07 -05:00
Tom Lane d934a05234 Fix uninitialized-variable warning.
In passing, add an Assert defending the presumption that bytes_left
is positive to start with.  (I'm not exactly convinced that using an
unsigned type was such a bright thing here, but let's at least do
this much.)
2014-11-25 15:17:16 -05:00
Simon Riggs aedccb1f6f action_at_recovery_target recovery config option
action_at_recovery_target = pause | promote | shutdown

Petr Jelinek

Reviewed by Muhammad Asif Naeem, Fujji Masao and
Simon Riggs
2014-11-25 20:13:30 +00:00
Tom Lane bac27394a1 Support arrays as input to array_agg() and ARRAY(SELECT ...).
These cases formerly failed with errors about "could not find array type
for data type".  Now they yield arrays of the same element type and one
higher dimension.

The implementation involves creating functions with API similar to the
existing accumArrayResult() family.  I (tgl) also extended the base family
by adding an initArrayResult() function, which allows callers to avoid
special-casing the zero-inputs case if they just want an empty array as
result.  (Not all do, so the previous calling convention remains valid.)
This allowed simplifying some existing code in xml.c and plperl.c.

Ali Akbar, reviewed by Pavel Stehule, significantly modified by me
2014-11-25 12:21:28 -05:00
Stephen Frost 25976710df Add int64 -> int8 mapping to genbki
Per discussion with Tom and Andrew, 64bit integers are no longer a
problem for the catalogs, so go ahead and add the mapping from the C
int64 type to the int8 SQL identification to allow using them.

Patch by Adam Brightwell
2014-11-25 12:12:19 -05:00
Heikki Linnakangas b3fc6727ce Allow using connection URI in primary_conninfo.
The old method of appending options to the connection string didn't work if
the primary_conninfo was a postgres:// style URI, instead of a traditional
connection string. Use PQconnectdbParams instead.

Alex Shulgin
2014-11-25 18:26:05 +02:00
Heikki Linnakangas e453cc2741 Make Port->ssl_in_use available, even when built with !USE_SSL
Code that check the flag no longer need #ifdef's, which is more convenient.
In particular, makes it easier to write extensions that depend on it.

In the passing, modify sslinfo's ssl_is_used function to check ssl_in_use
instead of the OpenSSL specific 'ssl' pointer. It doesn't make any
difference currently, as sslinfo is only compiled when built with OpenSSL,
but seems cleaner anyway.
2014-11-25 09:46:11 +02:00
Robert Haas f5d9698a84 Add infrastructure to save and restore GUC values.
This is further infrastructure for parallelism.

Amit Khandekar, Noah Misch, Robert Haas
2014-11-24 16:37:56 -05:00
Heikki Linnakangas 49b86fb1c9 Add a few paragraphs to B-tree README explaining L&Y algorithm.
This gives an overview of what Lehman & Yao's paper is all about, so that
you can understand the rest of the README without having to read the paper.

Per discussion with Peter Geoghegan and others.
2014-11-24 13:43:33 +02:00
Heikki Linnakangas 0bd624d63b Distinguish XLOG_FPI records generated for hint-bit updates.
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
2014-11-24 11:09:08 +02:00
Tom Lane b62f94c603 Allow simplification of EXISTS() subqueries containing LIMIT.
The locution "EXISTS(SELECT ... LIMIT 1)" seems to be rather common among
people who don't realize that the database already performs optimizations
equivalent to putting LIMIT 1 in the sub-select.  Unfortunately, this was
actually making things worse, because it prevented us from optimizing such
EXISTS clauses into semi or anti joins.  Teach simplify_EXISTS_query() to
suppress constant-positive LIMIT clauses.  That fixes the semi/anti-join
case, and may help marginally even for cases that have to be left as
sub-SELECTs.

Marti Raudsepp, reviewed by David Rowley
2014-11-22 19:12:38 -05:00
Tom Lane 9c58101117 Fix mishandling of system columns in FDW queries.
postgres_fdw would send query conditions involving system columns to the
remote server, even though it makes no effort to ensure that system
columns other than CTID match what the remote side thinks.  tableoid,
in particular, probably won't match and might have some use in queries.
Hence, prevent sending conditions that include non-CTID system columns.

Also, create_foreignscan_plan neglected to check local restriction
conditions while determining whether to set fsSystemCol for a foreign
scan plan node.  This again would bollix the results for queries that
test a foreign table's tableoid.

Back-patch the first fix to 9.3 where postgres_fdw was introduced.
Back-patch the second to 9.2.  The code is probably broken in 9.1 as
well, but the patch doesn't apply cleanly there; given the weak state
of support for FDWs in 9.1, it doesn't seem worth fixing.

Etsuro Fujita, reviewed by Ashutosh Bapat, and somewhat modified by me
2014-11-22 16:01:05 -05:00
Tom Lane 447770404c Rearrange CustomScan API.
Make it work more like FDW plans do: instead of assuming that there are
expressions in a CustomScan plan node that the core code doesn't know
about, insist that all subexpressions that need planner attention be in
a "custom_exprs" list in the Plan representation.  (Of course, the
custom plugin can break the list apart again at executor initialization.)
This lets us revert the parts of the patch that exposed setrefs.c and
subselect.c processing to the outside world.

Also revert the GetSpecialCustomVar stuff in ruleutils.c; that concept
may work in future, but it's far from fully baked right now.
2014-11-21 18:21:46 -05:00
Tom Lane c2ea2285e9 Simplify API for initially hooking custom-path providers into the planner.
Instead of register_custom_path_provider and a CreateCustomScanPath
callback, let's just provide a standard function hook in set_rel_pathlist.
This is more flexible than what was previously committed, is more like the
usual conventions for planner hooks, and requires less support code in the
core.  We had discussed this design (including centralizing the
set_cheapest() calls) back in March or so, so I'm not sure why it wasn't
done like this already.
2014-11-21 14:05:46 -05:00
Heikki Linnakangas 622983ea69 No need to call XLogEnsureRecordSpace when the relation is unlogged.
Amit Kapila
2014-11-21 15:13:15 +02:00
Heikki Linnakangas 8f5dcb56cb Fix bogus comments in XLogRecordAssemble
Pointed out by Michael Paquier
2014-11-21 12:15:27 +02:00
Tom Lane adbfab119b Remove dead code supporting mark/restore in SeqScan, TidScan, ValuesScan.
There seems no prospect that any of this will ever be useful, and indeed
it's questionable whether some of it would work if it ever got called;
it's certainly not been exercised in a very long time, if ever. So let's
get rid of it, and make the comments about mark/restore in execAmi.c less
wishy-washy.

The mark/restore support for Result nodes is also currently dead code,
but that's due to planner limitations not because it's impossible that
it could be useful.  So I left it in.
2014-11-20 20:20:54 -05:00
Tom Lane a34fa8ee7c Initial code review for CustomScan patch.
Get rid of the pernicious entanglement between planner and executor headers
introduced by commit 0b03e5951b.

Also, rearrange the CustomFoo struct/typedef definitions so that all the
typedef names are seen as used by the compiler.  Without this pgindent
will mess things up a bit, which is not so important perhaps, but it also
removes a bizarre discrepancy between the declaration arrangement used for
CustomExecMethods and that used for CustomScanMethods and
CustomPathMethods.

Clean up the commentary around ExecSupportsMarkRestore to reflect the
rather large change in its API.

Const-ify register_custom_path_provider's argument.  This necessitates
casting away const in the function, but that seems better than forcing
callers of the function to do so (or else not const-ify their method
pointer structs, which was sort of the whole point).

De-export fix_expr_common.  I don't like the exporting of fix_scan_expr
or replace_nestloop_params either, but this one surely has got little
excuse.
2014-11-20 18:36:07 -05:00
Tom Lane 081a6048cf Fix another oversight in CustomScan patch.
execCurrent.c's search_plan_tree() must recognize a CustomScan on the
target relation.  This would only be helpful for custom providers that
support CurrentOfExpr quals, which is probably a bit far-fetched, but
it's not impossible I think.  But even without assuming that, we need
to recognize a scanned-relation match so that we will properly throw
error if the desired relation is being scanned with both a CustomScan
and a regular scan (ie, self-join).

Also recognize ForeignScanState for similar reasons.  Supporting WHERE
CURRENT OF on a foreign table is probably even more far-fetched than
it is for custom scans, but I think in principle you could do it with
postgres_fdw (or another FDW that supports the ctid column).  This
would be a back-patchable bug fix if existing FDWs handled CurrentOfExpr,
but I doubt any do so I won't bother back-patching.
2014-11-20 15:56:39 -05:00
Tom Lane 03e574af5f Fix another oversight in CustomScan patch.
disuse_physical_tlist() must work for all plan types handled by
create_scan_plan().
2014-11-20 14:49:02 -05:00
Tom Lane f9e0255c6f Add missing case for CustomScan.
Per KaiGai Kohei.

In passing improve formatting of some code added in commit 30d7ae3c,
because otherwise pgindent will make a mess of it.
2014-11-20 12:32:34 -05:00
Heikki Linnakangas f464042161 Silence compiler warning about variable being used uninitialized.
It's a false positive - the variable is only used when 'onleft' is true,
and it is initialized in that case. But the compiler doesn't necessarily
see that.
2014-11-20 19:17:19 +02:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Simon Riggs 606c0123d6 Reduce btree scan overhead for < and > strategies
For <, <=, > and >= strategies, mark the first scan key
as already matched if scanning in an appropriate direction.
If index tuple contains no nulls we can skip the first
re-check for each tuple.

Author: Rajeev Rastogi
Reviewer: Haribabu Kommi
Rework of the code and comments by Simon Riggs
2014-11-18 10:24:55 +00:00
Heikki Linnakangas c73669c0e0 Fix WAL-logging of B-tree "unlink halfdead page" operation.
There was some confusion on how to record the case that the operation
unlinks the last non-leaf page in the branch being deleted.
_bt_unlink_halfdead_page set the "topdead" field in the WAL record to
the leaf page, but the redo routine assumed that it would be an invalid
block number in that case. This commit fixes _bt_unlink_halfdead_page to
do what the redo routine expected.

This code is new in 9.4, so backpatch there.
2014-11-17 18:45:46 +02:00
Alvaro Herrera 0f9692b40d Fix relpersistence setting in reindex_index
Buildfarm members with CLOBBER_CACHE_ALWAYS advised us that commit
85b506bbfc was mistaken in setting the relpersistence value of the
index directly in the relcache entry, within reindex_index.  The reason
for the failure is that an invalidation message that comes after mucking
with the relcache entry directly, but before writing it to the catalogs,
would cause the entry to become rebuilt in place from catalogs with the
old contents, losing the update.

Fix by passing the correct persistence value to
RelationSetNewRelfilenode instead; this routine also writes the updated
tuple to pg_class, avoiding the problem.  Suggested by Tom Lane.
2014-11-17 11:23:35 -03:00
Peter Eisentraut 7466a1b75f Translation updates 2014-11-16 21:32:51 -05:00
Simon Riggs 0f66d21201 Emit msg re skipping ANALYZE for absent inh tree
When checking a table that has an inheritance tree marked,
if no child tables remain, we skip ANALYZE. This patch emits
a message to show that the action has been skipped.

Author: Etsuro Fujita
Reviewer: Furuya Osamu
2014-11-15 22:49:54 +00:00
Alvaro Herrera 85b506bbfc Get rid of SET LOGGED indexes persistence kludge
This removes ATChangeIndexesPersistence() introduced by f41872d0c1
which was too ugly to live for long.  Instead, the correct persistence
marking is passed all the way down to reindex_index, so that the
transient relation built to contain the index relfilenode can
get marked correctly right from the start.

Author: Fabrízio de Royes Mello
Review and editorialization by Michael Paquier
                                     and Álvaro Herrera
2014-11-15 01:19:49 -03:00
Andres Freund 98ec7fd903 Sync unlogged relations to disk after they have been reset.
Unlogged relations are only reset when performing a unclean
restart. That means they have to be synced to disk during clean
shutdowns. During normal processing that's achieved by registering a
buffer's file to be fsynced at the next checkpoint when flushed. But
ResetUnloggedRelations() doesn't go through the buffer manager, so
nothing will force reset relations to disk before the next shutdown
checkpoint.

So just make ResetUnloggedRelations() fsync the newly created main
forks to disk.

Discussion: 20140912112246.GA4984@alap3.anarazel.de

Backpatch to 9.1 where unlogged tables were introduced.

Abhijit Menon-Sen and Andres Freund
2014-11-15 01:19:31 +01:00
Andres Freund d3586fc8aa Ensure unlogged tables are reset even if crash recovery errors out.
Unlogged relations are reset at the end of crash recovery as they're
only synced to disk during a proper shutdown. Unfortunately that and
later steps can fail, e.g. due to running out of space. This reset
was, up to now performed after marking the database as having finished
crash recovery successfully. As out of space errors trigger a crash
restart that could lead to the situation that not all unlogged
relations are reset.

Once that happend usage of unlogged relations could yield errors like
"could not open file "...": No such file or directory". Luckily
clusters that show the problem can be fixed by performing a immediate
shutdown, and starting the database again.

To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT)
earlier, before marking the database as having successfully recovered.

Discussion: 20140912112246.GA4984@alap3.anarazel.de

Backpatch to 9.1 where unlogged tables were introduced.

Abhijit Menon-Sen and Andres Freund
2014-11-15 01:19:26 +01:00
Stephen Frost 80eacaa3cd Clean up includes from RLS patch
The initial patch for RLS mistakenly included headers associated with
the executor and planner bits in rewrite/rowsecurity.h.  Per policy and
general good sense, executor headers should not be included in planner
headers or vice versa.

The include of execnodes.h was a mistaken holdover from previous
versions, while the include of relation.h was used for Relation's
definition, which should have been coming from utils/relcache.h.  This
patch cleans these issues up, adds comments to the RowSecurityPolicy
struct and the RowSecurityConfigType enum, and changes Relation->rsdesc
to Relation->rd_rsdesc to follow Relation field naming convention.

Additionally, utils/rel.h was including rewrite/rowsecurity.h, which
wasn't a great idea since that was pulling in things not really needed
in utils/rel.h (which gets included in quite a few places).  Instead,
use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments
explaining why.

Lastly, add an include into access/nbtree/nbtsort.c for
utils/sortsupport.h, which was evidently missed due to the above mess.

Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the
concerns regarding a similar situation in the custom-path commit still
need to be addressed.
2014-11-14 17:05:17 -05:00
Alvaro Herrera 51f9ea25dc Allow interrupting GetMultiXactIdMembers
This function has a loop which can lead to uninterruptible process
"stalls" (actually infinite loops) when some bugs are triggered.  Avoid
that unpleasant situation by adding a check for interrupts in a place
that shouldn't degrade performance in the normal case.

Backpatch to 9.3.  Older branches have an identical loop here, but the
aforementioned bugs are only a problem starting in 9.3 so there doesn't
seem to be any point in backpatching any further.
2014-11-14 15:14:01 -03:00
Andres Freund 0c5af0a537 Move BufferGetBlockNumber() out of heap_page_is_all_visible()'s inner loop.
In some workloads BufferGetBlockNumber() shows up in profiles due to
the sheer number of calls to it (and because it causes cache
misses). The compiler can't move it out of the loop because it's a
full extern function call...
2014-11-14 17:04:44 +01:00
Peter Eisentraut a15d387c22 Improve logical decoding log messages
suggestions from Robert Haas
2014-11-13 20:44:34 -05:00
Andres Freund 89fd41b390 Fix and improve cache invalidation logic for logical decoding.
There are basically three situations in which logical decoding needs
to perform cache invalidation. During/After replaying a transaction
with catalog changes, when skipping a uninteresting transaction that
performed catalog changes and when erroring out while replaying a
transaction. Unfortunately these three cases were all done slightly
differently - partially because 8de3e410fa, which greatly simplifies
matters, got committed in the midst of the development of logical
decoding.

The actually problematic case was when logical decoding skipped
transaction commits (and thus processed invalidations). When used via
the SQL interface cache invalidation could access the catalog - bad,
because we didn't set up enough state to allow that correctly. It'd
not be hard to setup sufficient state, but the simpler solution is to
always perform cache invalidation outside a valid transaction.

Also make the different cache invalidation cases look as similar as
possible, to ease code review.

This fixes the assertion failure reported by Antonin Houska in
53EE02D9.7040702@gmail.com. The presented testcase has been expanded
into a regression test.

Backpatch to 9.4, where logical decoding was introduced.
2014-11-13 20:34:31 +01:00
Andres Freund 5a2c184058 Fix xmin/xmax horizon computation during logical decoding initialization.
When building the initial historic catalog snapshot there were
scenarios where snapbuild.c would use incorrect xmin/xmax values when
starting from a xl_running_xacts record. The values used were always a
bit suspect, but happened to be correct in the easy to test
cases. Notably the values used when the the initial snapshot was
computed while no other transactions were running were correct.

This is likely to be the cause of the occasional buildfarm failures on
animals markhor and tick; but it's quite possible to reproduce
problems without CLOBBER_CACHE_ALWAYS.

Backpatch to 9.4, where logical decoding was introduced.
2014-11-13 20:34:30 +01:00
Heikki Linnakangas 81c4508196 Fix race condition between hot standby and restoring a full-page image.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.

To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.

In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.

Backpatch to all supported versions; this has been racy since hot standby
was introduced.
2014-11-13 20:02:37 +02:00
Robert Haas c0828b78e9 Move the guts of our Levenshtein implementation into core.
The hope is that we can use this to produce better diagnostics in
some cases.

Peter Geoghegan, reviewed by Michael Paquier, with some further
changes by me.
2014-11-13 12:33:26 -05:00
Heikki Linnakangas 34402ae351 Fix XLogReadBufferForRedoExtended to get cleanup lock when asked to do so. 2014-11-13 17:54:20 +02:00
Fujii Masao c291503b1c Rename pending_list_cleanup_size to gin_pending_list_limit.
Since this parameter is only for GIN index, it's better to
add "gin" to the parameter name for easier understanding.
2014-11-13 12:14:48 +09:00
Tom Lane 677708032c Explicitly support the case that a plancache's raw_parse_tree is NULL.
This only happens if a client issues a Parse message with an empty query
string, which is a bit odd; but since it is explicitly called out as legal
by our FE/BE protocol spec, we'd probably better continue to allow it.

Fix by adding tests everywhere that the raw_parse_tree field is passed to
functions that don't or shouldn't accept NULL.  Also make it clear in the
relevant comments that NULL is an expected case.

This reverts commits a73c9dbab0 and
2e9650cbcf, which fixed specific crash
symptoms by hacking things at what now seems to be the wrong end, ie the
callee functions.  Making the callees allow NULL is superficially more
robust, but it's not always true that there is a defensible thing for the
callee to do in such cases.  The caller has more context and is better
able to decide what the empty-query case ought to do.

Per followup discussion of bug #11335.  Back-patch to 9.2.  The code
before that is sufficiently different that it would require development
of a separate patch, which doesn't seem worthwhile for what is believed
to be an essentially cosmetic change.
2014-11-12 15:59:01 -05:00
Andres Freund ec5896aed3 Fix several weaknesses in slot and logical replication on-disk serialization.
Heikki noticed in 544E23C0.8090605@vmware.com that slot.c and
snapbuild.c were missing the FIN_CRC32 call when computing/checking
checksums of on disk files. That doesn't lower the the error detection
capabilities of the checksum, but is inconsistent with other usages.

In a followup mail Heikki also noticed that, contrary to a comment,
the 'version' and 'length' struct fields of replication slot's on disk
data where not covered by the checksum. That's not likely to lead to
actually missed corruption as those fields are cross checked with the
expected version and the actual file length. But it's wrong
nonetheless.

As fixing these issues makes existing on disk files unreadable, bump
the expected versions of on disk files for both slots and logical
decoding historic catalog snapshots.  This means that loading old
files will fail with
ERROR: "replication slot file ... has unsupported version 1"
and
ERROR: "snapbuild state file ... has unsupported version 1 instead of
2" respectively. Given the low likelihood of anybody already using
these new features in a production setup that seems acceptable.

Fixing these issues made me notice that there's no regression test
covering the loading of historic snapshot from disk - so add one.

Backpatch to 9.4 where these features were introduced.
2014-11-12 18:52:49 +01:00
Noah Misch 28245b8424 Use just one database connection in the "tablespace" test.
On Windows, DROP TABLESPACE has a race condition when run concurrently
with other processes having opened files in the tablespace.  This led to
a rare failure on buildfarm member frogmouth.  Back-patch to 9.4, where
the reconnection was introduced.
2014-11-12 07:33:17 -05:00
Peter Eisentraut 8339f33d68 Message improvements 2014-11-11 20:02:30 -05:00
Robert Haas f1abd78be7 Remove incorrect comment.
This was introduced by commit 5ea86e6e65.

Peter Geoghegan
2014-11-11 18:41:29 -05:00
Tom Lane 2edfc021c6 Fix dependency searching for case where column is visited before table.
When the recursive search in dependency.c visits a column and then later
visits the whole table containing the column, it needs to propagate the
drop-context flags for the table to the existing target-object entry for
the column.  Otherwise we might refuse the DROP (if not CASCADE) on the
incorrect grounds that there was no automatic drop pathway to the column.
Remarkably, this has not been reported before, though it's possible at
least when an extension creates both a datatype and a table using that
datatype.

Rather than just marking the column as allowed to be dropped, it might
seem good to skip the DROP COLUMN step altogether, since the later DROP
of the table will surely get the job done.  The problem with that is that
the datatype would then be dropped before the table (since the whole
situation occurred because we visited the datatype, and then recursed to
the dependent column, before visiting the table).  That seems pretty risky,
and the case is rare enough that it doesn't seem worth expending a lot of
effort or risk to make the drops happen in a safe order.  So we just play
dumb and delete the column separately according to the existing drop
ordering rules.

Per report from Petr Jelinek, though this is different from his proposed
patch.

Back-patch to 9.1, where extensions were introduced.  There's currently
no evidence that such cases can arise before 9.1, and in any case we would
also need to back-patch cb5c2ba2d8 to 9.0
if we wanted to back-patch this.
2014-11-11 17:00:11 -05:00
Fujii Masao 1871c89202 Add generate_series(numeric, numeric).
Платон Малюгин
Reviewed by Michael Paquier, Ali Akbar and Marti Raudsepp
2014-11-11 21:44:46 +09:00
Fujii Masao a1b395b6a2 Add GUC and storage parameter to set the maximum size of GIN pending list.
Previously the maximum size of GIN pending list was controlled only by
work_mem. But the reasonable value of work_mem and the reasonable size
of the list are basically not the same, so it was not appropriate to
control both of them by only one GUC, i.e., work_mem. This commit
separates new GUC, pending_list_cleanup_size, from work_mem to allow
users to control only the size of the list.

Also this commit adds pending_list_cleanup_size as new storage parameter
to allow users to specify the size of the list per index. This is useful,
for example, when users want to increase the size of the list only for
the GIN index which can be updated heavily, and decrease it otherwise.

Reviewed by Etsuro Fujita.
2014-11-11 21:08:21 +09:00
Alvaro Herrera a590f266e4 BRIN: fix bug in xlog backup block counting
The code that generates the BRIN_XLOG_UPDATE removes the buffer
reference when the page that's target for the updated tuple is freshly
initialized.  This is a pretty usual optimization, but was breaking the
case where the revmap buffer, which is referenced in the same WAL
record, is getting a backup block: the replay code was using backup
block index 1, which is not valid when the update target buffer gets
pruned; the revmap buffer gets assigned 0 instead.  Make sure to use the
correct backup block index for revmap when replaying.

Bug reported by Fujii Masao.
2014-11-10 18:13:49 -03:00
Robert Haas c8df9477f8 Fix potential NULL-pointer dereference.
Commit 2781b4bea7 arranged to defer
the setup of after-trigger-related data structures, but
AfterTriggerPendingOnRel didn't get the memo.
2014-11-10 15:22:46 -05:00
Tom Lane bf7ca15875 Ensure that RowExprs and whole-row Vars produce the expected column names.
At one time it wasn't terribly important what column names were associated
with the fields of a composite Datum, but since the introduction of
operations like row_to_json(), it's important that looking up the rowtype
ID embedded in the Datum returns the column names that users would expect.
That did not work terribly well before this patch: you could get the column
names of the underlying table, or column aliases from any level of the
query, depending on minor details of the plan tree.  You could even get
totally empty field names, which is disastrous for cases like row_to_json().

To fix this for whole-row Vars, look to the RTE referenced by the Var, and
make sure its column aliases are applied to the rowtype associated with
the result Datums.  This is a tad scary because we might have to return
a transient RECORD type even though the Var is declared as having some
named rowtype.  In principle it should be all right because the record
type will still be physically compatible with the named rowtype; but
I had to weaken one Assert in ExecEvalConvertRowtype, and there might be
third-party code containing similar assumptions.

Similarly, RowExprs have to be willing to override the column names coming
from a named composite result type and produce a RECORD when the column
aliases visible at the site of the RowExpr differ from the underlying
table's column names.

In passing, revert the decision made in commit 398f70ec07 to add
an alias-list argument to ExecTypeFromExprList: better to provide that
functionality in a separate function.  This also reverts most of the code
changes in d685814835, which we don't need because we're no longer
depending on the tupdesc found in the child plan node's result slot to be
blessed.

Back-patch to 9.4, but not earlier, since this solution changes the results
in some cases that users might not have realized were buggy.  We'll apply a
more restricted form of this patch in older branches.
2014-11-10 15:21:09 -05:00
Alvaro Herrera 1e0b4365c2 Further code and wording tweaks in BRIN
Besides a couple of typo fixes, per David Rowley, Thom Brown, and Amit
Langote, and mentions of BRIN in the general CREATE INDEX page again per
David, this includes silencing MSVC compiler warnings (thanks Microsoft)
and an additional variable initialization per Coverity scanner.
2014-11-10 15:56:08 -03:00
Kevin Grittner 96a73fcdac Fix compiler warning for non-assert builds.
Reported by Peter Geoghegan
David Rowley
2014-11-10 09:42:46 -06:00
Alvaro Herrera b89ee54e20 Fix some coding issues in BRIN
Reported by David Rowley: variadic macros are a problem.  Get rid of
them using a trick suggested by Tom Lane: add extra parentheses where
needed.  In the future we might decide we don't need the calls at all
and remove them, but it seems appropriate to keep them while this code
is still new.

Also from David Rowley: brininsert() was trying to use a variable before
initializing it.  Fix by moving the brin_form_tuple call (which
initializes the variable) to within the locked section.

Reported by Peter Eisentraut: can't use "new" as a struct member name,
because C++ compilers will choke on it, as reported by cpluspluscheck.
2014-11-08 00:31:03 -03:00
Robert Haas 0b03e5951b Introduce custom path and scan providers.
This allows extension modules to define their own methods for
scanning a relation, and get the core code to use them.  It's
unclear as yet how much use this capability will find, but we
won't find out if we never commit it.

KaiGai Kohei, reviewed at various times and in various levels
of detail by Shigeru Hanada, Tom Lane, Andres Freund, Álvaro
Herrera, and myself.
2014-11-07 17:34:36 -05:00
Heikki Linnakangas 7250d8535b Fix building with WAL_DEBUG.
Now that the backup blocks are appended to the WAL record in xloginsert.c,
XLogInsert doesn't see them anymore and cannot remove them from the version
reconstructed for xlog_outdesc. This makes running with wal_debug=on more
expensive, as we now make (unnecessary) temporary copies of the backup
blocks, but it doesn't seem worth convoluting the code to keep that
optimization.

Reported by Alvaro Herrera.
2014-11-07 23:09:31 +02:00
Robert Haas 5ea86e6e65 Use the sortsupport infrastructure in more cases.
This removes some fmgr overhead from cases such as btree index builds.

Peter Geoghegan, reviewed by Andreas Karlsson and me.
2014-11-07 15:50:55 -05:00
Alvaro Herrera 7516f52594 BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.

As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.

For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range.  This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results.  In this commit I only include minmax.

Catalog version bumped due to new builtin catalog entries.

There's more that could be done here, but this is a good step forwards.

Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.

Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633.
2014-11-07 16:38:14 -03:00
Heikki Linnakangas 1961b1c131 Fix generation of SP-GiST vacuum WAL records.
I broke these in 8776faa81c. Backpatch to
9.4, where that was done.
2014-11-07 21:17:46 +02:00
Heikki Linnakangas 2effb72e68 Remove obsolete cases from GiST update redo code.
The code that generated a record to clear the F_TUPLES_DELETED flag hasn't
existed since we got rid of old-style VACUUM FULL. I kept the code that sets
the flag, although it's not used for anything anymore, because it might
still be interesting information for debugging purposes that some tuples
have been deleted from a page.

Likewise, the code to turn the root page from non-leaf to leaf page was
removed when we got rid of old-style VACUUM FULL. Remove the code to replay
that action, too.
2014-11-07 15:13:02 +02:00
Tom Lane d6e37b35cd Cope with more than 64K phrases in a thesaurus dictionary.
dict_thesaurus stored phrase IDs in uint16 fields, so it would get confused
and even crash if there were more than 64K entries in the configuration
file.  It turns out to be basically free to widen the phrase IDs to uint32,
so let's just do so.

This was complained of some time ago by David Boutin (in bug #7793);
he later submitted an informal patch but it was never acted on.
We now have another complaint (bug #11901 from Luc Ouellette) so it's
time to make something happen.

This is basically Boutin's patch, but for future-proofing I also added a
defense against too many words per phrase.  Note that we don't need any
explicit defense against overflow of the uint32 counters, since before that
happens we'd hit array allocation sizes that repalloc rejects.

Back-patch to all supported branches because of the crash risk.
2014-11-06 20:52:40 -05:00
Tom Lane 4875931938 Fix normalization of numeric values in JSONB GIN indexes.
The default JSONB GIN opclass (jsonb_ops) converts numeric data values
to strings for storage in the index.  It must ensure that numeric values
that would compare equal (such as 12 and 12.00) produce identical strings,
else index searches would have behavior different from regular JSONB
comparisons.  Unfortunately the function charged with doing this was
completely wrong: it could reduce distinct numeric values to the same
string, or reduce equivalent numeric values to different strings.  The
former type of error would only lead to search inefficiency, but the
latter type of error would cause index entries that should be found by
a search to not be found.

Repairing this bug therefore means that it will be necessary for 9.4 beta
testers to reindex GIN jsonb_ops indexes, if they care about getting
correct results from index searches involving numeric data values within
the comparison JSONB object.

Per report from Thomas Fanghaenel.
2014-11-06 11:41:06 -05:00
Fujii Masao 5332b8cec5 Prevent the unnecessary creation of .ready file for the timeline history file.
Previously .ready file was created for the timeline history file at the end
of an archive recovery even when WAL archiving was not enabled.
This creation is unnecessary and causes .ready file to remain infinitely.

This commit changes an archive recovery so that it creates .ready file for
the timeline history file only when WAL archiving is enabled.

Backpatch to all supported versions.
2014-11-06 21:24:40 +09:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Fujii Masao d2b8a2c7ec Fix typo in comment.
Etsuro Fujita
2014-11-06 20:04:11 +09:00
Fujii Masao 08309aaf74 Implement IF NOT EXIST for CREATE INDEX.
Fabrízio de Royes Mello, reviewed by Marti Raudsepp, Adam Brightwell and me.
2014-11-06 18:48:33 +09:00
Tom Lane 525a489915 Remove the last vestige of server-side autocommit.
Long ago we briefly had an "autocommit" GUC that turned server-side
autocommit on and off.  That behavior was removed in 7.4 after concluding
that it broke far too much client-side logic, and making clients cope with
both behaviors was impractical.  But the GUC variable was left behind, so
as not to break any client code that might be trying to read its value.
Enough time has now passed that we should remove the GUC completely.
Whatever vestigial backwards-compatibility benefit it had is outweighed by
the risk of confusion for newbies who assume it ought to do something,
as per a recent complaint from Wolfgang Wilhelm.

In passing, adjust what seemed to me a rather confusing documentation
reference to libpq's autocommit behavior.  libpq as such knows nothing
about autocommit, so psql is probably what was meant.
2014-11-05 19:35:23 -05:00
Robert Haas c30be9787b Fix thinko in commit 2bd9e412f9.
Obviously, every translation unit should not be declaring this
separately.  It needs to be PGDLLIMPORT as well, to avoid breaking
third-party code that uses any of the functions that the commit
mentioned above changed to macros.
2014-11-05 17:12:23 -05:00
Tom Lane 465d7e1882 Make CREATE TYPE print warnings if a datatype's I/O functions are volatile.
This is a followup to commit 43ac12c6e6,
which added regression tests checking that I/O functions of built-in
types are not marked volatile.  Complaining in CREATE TYPE should push
developers of add-on types to fix any misdeclared functions in their
types.  It's just a warning not an error, to avoid creating upgrade
problems for what might be just cosmetic mis-markings.

Aside from adding the warning code, fix a number of types that were
sloppily created in the regression tests.
2014-11-05 11:44:06 -05:00
Tom Lane 33f80f8480 Drop no-longer-needed buffers during ALTER DATABASE SET TABLESPACE.
The previous coding assumed that we could just let buffers for the
database's old tablespace age out of the buffer arena naturally.
The folly of that is exposed by bug #11867 from Marc Munro: the user could
later move the database back to its original tablespace, after which any
still-surviving buffers would match lookups again and appear to contain
valid data.  But they'd be missing any changes applied while the database
was in the new tablespace.

This has been broken since ALTER SET TABLESPACE was introduced, so
back-patch to all supported branches.
2014-11-04 13:24:06 -05:00
Heikki Linnakangas 5028f22f6e Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.

Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.

The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 11:39:48 +02:00
Robert Haas 2bd9e412f9 Support frontend-backend protocol communication using a shm_mq.
A background worker can use pq_redirect_to_shm_mq() to direct protocol
that would normally be sent to the frontend to a shm_mq so that another
process may read them.

The receiving process may use pq_parse_errornotice() to parse an
ErrorResponse or NoticeResponse from the background worker and, if
it wishes, ThrowErrorData() to propagate the error (with or without
further modification).

Patch by me.  Review by Andres Freund.
2014-10-31 12:02:40 -04:00
Robert Haas f7102b0463 Extend dsm API with a new function dsm_unpin_mapping.
This reassociates a dynamic shared memory handle previous passed to
dsm_pin_mapping with the current resource owner, so that it will be
cleaned up at the end of the current query.

Patch by me.  Review of the function name by Andres Freund, Amit
Kapila, Jim Nasby, Petr Jelinek, and Álvaro Herrera.
2014-10-30 14:55:23 -04:00
Tom Lane fd0f651a86 Test IsInTransactionChain, not IsTransactionBlock, in vac_update_relstats.
As noted by Noah Misch, my initial cut at fixing bug #11638 didn't cover
all cases where ANALYZE might be invoked in an unsafe context.  We need to
test the result of IsInTransactionChain not IsTransactionBlock; which is
notationally a pain because IsInTransactionChain requires an isTopLevel
flag, which would have to be passed down through several levels of callers.
I chose to pass in_outer_xact (ie, the result of IsInTransactionChain)
rather than isTopLevel per se, as that seemed marginally more apropos
for the intermediate functions to know about.
2014-10-30 13:04:06 -04:00
Robert Haas 6057c212f3 "Pin", rather than "keep", dynamic shared memory mappings and segments.
Nobody seemed concerned about this naming when it originally went in,
but there's a pending patch that implements the opposite of
dsm_keep_mapping, and the term "unkeep" was judged unpalatable.
"unpin" has existing precedent in the PostgreSQL code base, and the
English language, so use this terminology instead.

Per discussion, back-patch to 9.4.
2014-10-30 11:35:55 -04:00
Tom Lane e0722d9cb5 Avoid corrupting tables when ANALYZE inside a transaction is rolled back.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally.  This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow.  It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back.  This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table.  However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.

To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al.  There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back.  In any case we probably don't want
to change this behavior in back branches.

Per bug #11638 from Casey Shobe.  This has been broken for a good long
time, so back-patch to all supported branches.

Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
2014-10-29 18:12:02 -04:00
Robert Haas 6cb4afff33 Avoid setup work for invalidation messages at start-of-(sub)xact.
Instead of initializing a new TransInvalidationInfo for every
transaction or subtransaction, we can just do it for those
transactions or subtransactions that actually need to queue
invalidation messages.  That also avoids needing to free those
entries at the end of a transaction or subtransaction that does
not generate any invalidation messages, which is by far the
common case.

Patch by me.  Review by Simon Riggs and Andres Freund.
2014-10-29 12:35:19 -04:00
Tom Lane a00d468e65 Remove obsolete commentary.
Since we got rid of non-MVCC catalog scans, the fourth reason given for
using a non-transactional update in index_update_stats() is obsolete.
The other three are still good, so we're not going to change the code,
but fix the comment.
2014-10-28 18:36:02 -04:00
Heikki Linnakangas 18f158ef69 Remove unnecessary assignment.
Reported by MauMau.
2014-10-28 20:26:20 +02:00
Heikki Linnakangas 22926e00f7 Fix two bugs in tsquery @> operator.
1. The comparison for matching terms used only the CRC to decide if there's
a match. Two different terms with the same CRC gave a match.

2. It assumed that if the second operand has more terms than the first, it's
never a match. That assumption is bogus, because there can be duplicate
terms in either operand.

Rewrite the implementation in a way that doesn't have those bugs.

Backpatch to all supported versions.
2014-10-27 10:50:41 +02:00
Tom Lane a4523c5aa5 Improve planning of btree index scans using ScalarArrayOpExpr quals.
Since we taught btree to handle ScalarArrayOpExpr quals natively (commit
9e8da0f757), the planner has always included
ScalarArrayOpExpr quals in index conditions if possible.  However, if the
qual is for a non-first index column, this could result in an inferior plan
because we can no longer take advantage of index ordering (cf. commit
807a40c551).  It can be better to omit the
ScalarArrayOpExpr qual from the index condition and let it be done as a
filter, so that the output doesn't need to get sorted.  Indeed, this is
true for the query introduced as a test case by the latter commit.

To fix, restructure get_index_paths and build_index_paths so that we
consider paths both with and without ScalarArrayOpExpr quals in non-first
index columns.  Redesign the API of build_index_paths so that it reports
what it found, saving useless second or third calls.

Report and patch by Andrew Gierth (though rather heavily modified by me).
Back-patch to 9.2 where this code was introduced, since the issue can
result in significant performance regressions compared to plans produced
by 9.1 and earlier.
2014-10-26 16:12:22 -04:00
Robert Haas 85bb81de53 Fix off-by-one error in 2781b4bea7.
Spotted by Tom Lane.
2014-10-24 08:18:28 -04:00
Alvaro Herrera b01a4f6838 Update README.tuplock
This file was documenting an older version of patch 0ac5ad5134; update
it to match what was really committed

Author: Florian Pflug
2014-10-23 20:51:58 -03:00
Tom Lane b34d6f03db Improve ispell dictionary's defenses against bad affix files.
Don't crash if an ispell dictionary definition contains flags but not
any compound affixes.  (This isn't a security issue since only superusers
can install affix files, but still it's a bad thing.)

Also, be more careful about detecting whether an affix-file FLAG command
is old-format (ispell) or new-format (myspell/hunspell).  And change the
error message about mixed old-format and new-format commands into something
intelligible.

Per bug #11770 from Emre Hasegeli.  Back-patch to all supported branches.
2014-10-23 13:12:00 -04:00
Robert Haas 2781b4bea7 Perform less setup work for AFTER triggers at transaction start.
Testing reveals that the memory allocation we do at transaction start
has small but measurable overhead on simple transactions.  To cut
down on that overhead, defer some of that work to the point when
AFTER triggers are first used, thus avoiding it altogether if they
never are.

Patch by me.  Review by Andres Freund.
2014-10-23 12:33:02 -04:00
Robert Haas 5ac372fc1a Add a function to get the authenticated user ID.
Previously, this was not exposed outside of miscinit.c.  It is needed
for the pending pg_background patch, and will also be needed for
parallelism.  Without it, there's no way for a background worker to
re-create the exact authentication environment that was present in the
process that started it, which could lead to security exposures.
2014-10-23 08:18:45 -04:00
Fujii Masao c7371c4a60 Prevent the already-archived WAL file from being archived again.
Previously the archive recovery always created .ready file for
the last WAL file of the old timeline at the end of recovery even when
it's restored from the archive and has .done file. That is, there was
the case where the WAL file had both .ready and .done files.
This caused the already-archived WAL file to be archived again.

This commit prevents the archive recovery from creating .ready file
for the last WAL file if it has .done file, in order to prevent it from
being archived again.

This bug was added when cascading replication feature was introduced,
i.e., the commit 5286105800.
So, back-patch to 9.2, where cascading replication was added.

Reviewed by Michael Paquier
2014-10-23 16:21:27 +09:00
Peter Eisentraut e64d3c5635 Minimize calls of pg_class_aclcheck to minimum necessary
In a couple of code paths, pg_class_aclcheck is called in succession
with multiple different modes set.  This patch combines those modes to
have a single call of this function and reduce a bit process overhead
for permission checking.

Author: Michael Paquier <michael@otacoo.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
2014-10-22 21:41:43 -04:00
Heikki Linnakangas 98b3743779 Update comment.
The _bt_tuplecompare() function mentioned in comment hasn't existed for a
long time.

Peter Geoghegan
2014-10-22 15:44:07 +03:00
Peter Eisentraut 6f04368cfc Allow input format xxxx-xxxx-xxxx for macaddr type
Author: Herwin Weststrate <herwin@quarantainenet.nl>
Reviewed-by: Ali Akbar <the.apaan@gmail.com>
2014-10-21 16:16:39 -04:00
Andres Freund 5e5b65f359 Don't duplicate log_checkpoint messages for both of restart and checkpoints.
The duplication originated in cdd46c765, where restartpoints were
introduced.

In LogCheckpointStart's case the duplication actually lead to the
compiler's format string checking not to be effective because the
format string wasn't constant.

Arguably these messages shouldn't be elog(), but ereport() style
messages. That'd even allow to translate the messages... But as
there's more mistakes of that kind in surrounding code, it seems
better to change that separately.
2014-10-21 01:01:56 +02:00
Andres Freund 11abd6c90f Renumber CHECKPOINT_* flags.
Commit 7dbb606938 added a new CHECKPOINT_FLUSH_ALL flag. As that
commit needed to be backpatched I didn't change the numeric values of
the existing flags as that could lead to nastly problems if any
external code issued checkpoints. That's not a concern on master, so
renumber them there.

Also add a comment about CHECKPOINT_FLUSH_ALL above
CreateCheckPoint().
2014-10-21 00:20:08 +02:00
Andres Freund 7dbb606938 Flush unlogged table's buffers when copying or moving databases.
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.

That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.

This was reported in bug #10675 by Maxim Boguk.

Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.

Backpatch to 9.1 where unlogged tables were introduced.
2014-10-20 23:43:46 +02:00
Tom Lane f330a6d140 Fix mishandling of FieldSelect-on-whole-row-Var in nested lateral queries.
If an inline-able SQL function taking a composite argument is used in a
LATERAL subselect, and the composite argument is a lateral reference,
the planner could fail with "variable not found in subplan target list",
as seen in bug #11703 from Karl Bartel.  (The outer function call used in
the bug report and in the committed regression test is not really necessary
to provoke the bug --- you can get it if you manually expand the outer
function into "LATERAL (SELECT inner_function(outer_relation))", too.)

The cause of this is that we generate the reltargetlist for the referenced
relation before doing eval_const_expressions() on the lateral sub-select's
expressions (cf find_lateral_references()), so what's scheduled to be
emitted by the referenced relation is a whole-row Var, not the simplified
single-column Var produced by optimizing the function's FieldSelect on the
whole-row Var.  Then setrefs.c fails to match up that lateral reference to
what's available from the outer scan.

Preserving the FieldSelect optimization in such cases would require either
major planner restructuring (to recursively do expression simplification
on sub-selects much earlier) or some amazingly ugly kluge to change the
reltargetlist of a possibly-already-planned relation.  It seems better
just to skip the optimization when the Var is from an upper query level;
the case is not so common that it's likely anyone will notice a few
wasted cycles.

AFAICT this problem only occurs for uplevel LATERAL references, so
back-patch to 9.3 where LATERAL was added.
2014-10-20 12:23:42 -04:00
Robert Haas bc279c92f0 Fix typos.
David Rowley
2014-10-20 10:33:16 -04:00
Robert Haas 0f565c0743 Fix typos.
Etsuro Fujita
2014-10-20 10:23:40 -04:00
Peter Eisentraut 7feaccc217 Allow setting effective_io_concurrency even on unsupported systems
This matches the behavior of other parameters that are unsupported on
some systems (e.g., ssl).

Also document the default value.
2014-10-18 21:35:46 -04:00
Bruce Momjian b87671f1b6 Shorten warning about hash creation
Also document that PITR is also affected.
2014-10-18 10:36:09 -04:00
Bruce Momjian 417f92484d interval: tighten precision specification
interval precision can only be specified after the "interval" keyword if
no units are specified.

Previously we incorrectly checked the units to see if the precision was
legal, causing confusion.

Report by Alvaro Herrera
2014-10-18 10:31:00 -04:00
Tom Lane 5ba062ee44 Avoid core dump in _outPathInfo() for Path without a parent RelOptInfo.
Nearly all Paths have parents, but a ResultPath representing an empty FROM
clause does not.  Avoid a core dump in such cases.  I believe this is only
a hazard for debugging usage, not for production, else we'd have heard
about it before.  Nonetheless, back-patch to 9.1 where the troublesome code
was introduced.  Noted while poking at bug #11703.
2014-10-17 22:33:52 -04:00
Tom Lane b2cbced9ee Support timezone abbreviations that sometimes change.
Up to now, PG has assumed that any given timezone abbreviation (such as
"EDT") represents a constant GMT offset in the usage of any particular
region; we had a way to configure what that offset was, but not for it
to be changeable over time.  But, as with most things horological, this
view of the world is too simplistic: there are numerous regions that have
at one time or another switched to a different GMT offset but kept using
the same timezone abbreviation.  Almost the entire Russian Federation did
that a few years ago, and later this month they're going to do it again.
And there are similar examples all over the world.

To cope with this, invent the notion of a "dynamic timezone abbreviation",
which is one that is referenced to a particular underlying timezone
(as defined in the IANA timezone database) and means whatever it currently
means in that zone.  For zones that use or have used daylight-savings time,
the standard and DST abbreviations continue to have the property that you
can specify standard or DST time and get that time offset whether or not
DST was theoretically in effect at the time.  However, the abbreviations
mean what they meant at the time in question (or most recently before that
time) rather than being absolutely fixed.

The standard abbreviation-list files have been changed to use this behavior
for abbreviations that have actually varied in meaning since 1970.  The
old simple-numeric definitions are kept for abbreviations that have not
changed, since they are a bit faster to resolve.

While this is clearly a new feature, it seems necessary to back-patch it
into all active branches, because otherwise use of Russian zone
abbreviations is going to become even more problematic than it already was.
This change supersedes the changes in commit 513d06ded et al to modify the
fixed meanings of the Russian abbreviations; since we've not shipped that
yet, this will avoid an undesirably incompatible (not to mention incorrect)
change in behavior for timestamps between 2011 and 2014.

This patch makes some cosmetic changes in ecpglib to keep its usage of
datetime lookup tables as similar as possible to the backend code, but
doesn't do anything about the increasingly obsolete set of timezone
abbreviation definitions that are hard-wired into ecpglib.  Whatever we
do about that will likely not be appropriate material for back-patching.
Also, a potential free() of a garbage pointer after an out-of-memory
failure in ecpglib has been fixed.

This patch also fixes pre-existing bugs in DetermineTimeZoneOffset() that
caused it to produce unexpected results near a timezone transition, if
both the "before" and "after" states are marked as standard time.  We'd
only ever thought about or tested transitions between standard and DST
time, but that's not what's happening when a zone simply redefines their
base GMT offset.

In passing, update the SGML documentation to refer to the Olson/zoneinfo/
zic timezone database as the "IANA" database, since it's now being
maintained under the auspices of IANA.
2014-10-16 15:22:10 -04:00
Tom Lane 90063a7612 Print planning time only in EXPLAIN ANALYZE, not plain EXPLAIN.
We've gotten enough push-back on that change to make it clear that it
wasn't an especially good idea to do it like that.  Revert plain EXPLAIN
to its previous behavior, but keep the extra output in EXPLAIN ANALYZE.
Per discussion.

Internally, I set this up as a separate flag ExplainState.summary that
controls printing of planning time and execution time.  For now it's
just copied from the ANALYZE option, but we could consider exposing it
to users.
2014-10-15 18:50:13 -04:00
Heikki Linnakangas e0d97d77bf Fix deadlock with LWLockAcquireWithVar and LWLockWaitForVar.
LWLockRelease should release all backends waiting with LWLockWaitForVar,
even when another backend has already been woken up to acquire the lock,
i.e. when releaseOK is false. LWLockWaitForVar can return as soon as the
protected value changes, even if the other backend will acquire the lock.
Fix that by resetting releaseOK to true in LWLockWaitForVar, whenever
adding itself to the wait queue.

This should fix the bug reported by MauMau, where the system occasionally
hangs when there is a lot of concurrent WAL activity and a checkpoint.
Backpatch to 9.4, where this code was added.
2014-10-14 10:06:47 +03:00
Bruce Momjian fe65280fea C comments: adjust execTuples.c for new structure
Report by Peter Geoghegan
2014-10-13 16:54:38 -04:00
Bruce Momjian d6c8e8d7ac Consistently use NULL for invalid GUC unit strings
Patch by Euler Taveira
2014-10-13 16:11:43 -04:00
Kevin Grittner 30d7ae3c76 Increase number of hash join buckets for underestimate.
If we expect batching at the very beginning, we size nbuckets for
"full work_mem" (see how many tuples we can get into work_mem,
while not breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the
optimal nbuckets value), we keep the value.

At the end of the first batch, we check whether (nbuckets !=
nbuckets_optimal) and resize the hash table if needed. Also, we
keep this value for all batches (it's OK because it assumes full
work_mem, and it makes the batchno evaluation trivial). So the
resize happens only once.

There could be cases where it would improve performance to allow
the NTUP_PER_BUCKET threshold to be exceeded to keep everything in
one batch rather than spilling to a second batch, but attempts to
generate such a case have so far been unsuccessful; that issue may
be addressed with a follow-on patch after further investigation.

Tomas Vondra with minor format and comment cleanup by me
Reviewed by Robert Haas, Heikki Linnakangas, and Kevin Grittner
2014-10-13 10:16:36 -05:00
Peter Eisentraut b7a08c8028 Message improvements 2014-10-12 01:06:35 -04:00
Tom Lane 4a50de1312 Fix bogus optimization in JSONB containment tests.
When determining whether one JSONB object contains another, it's okay to
make a quick exit if the first object has fewer pairs than the second:
because we de-duplicate keys within objects, it is impossible that the
first object has all the keys the second does.  However, the code was
applying this rule to JSONB arrays as well, where it does *not* hold
because arrays can contain duplicate entries.  The test was really in
the wrong place anyway; we should do it within JsonbDeepContains, where
it can be applied to nested objects not only top-level ones.

Report and test cases by Alexander Korotkov; fix by Peter Geoghegan and
Tom Lane.
2014-10-11 14:13:51 -04:00
Alvaro Herrera 7b1c2a0f20 Split builtins.h to a new header ruleutils.h
The new header contains many prototypes for functions in ruleutils.c
that are not exposed to the SQL level.

Reviewed by Andres Freund and Michael Paquier.
2014-10-08 18:10:47 -03:00
Robert Haas 7bb0e97407 Extend shm_mq API with new functions shm_mq_sendv, shm_mq_set_handle.
shm_mq_sendv sends a message to the queue assembled from multiple
locations.  This is expected to be used by forthcoming patches to
allow frontend/backend protocol messages to be sent via shm_mq, but
might be useful for other purposes as well.

shm_mq_set_handle associates a BackgroundWorkerHandle with an
already-existing shm_mq_handle.  This solves a timing problem when
creating a shm_mq to communicate with a newly-launched background
worker: if you attach to the queue first, and the background worker
fails to start, you might block forever trying to do I/O on the queue;
but if you start the background worker first, but then die before
attaching to the queue, the background worrker might block forever
trying to do I/O on the queue.  This lets you attach before starting
the worker (so that the worker is protected) and then associate the
BackgroundWorkerHandle later (so that you are also protected).

Patch by me, reviewed by Stephen Frost.
2014-10-08 14:38:31 -04:00
Alvaro Herrera df630b0dd5 Implement SKIP LOCKED for row-level locks
This clause changes the behavior of SELECT locking clauses in the
presence of locked rows: instead of causing a process to block waiting
for the locks held by other processes (or raise an error, with NOWAIT),
SKIP LOCKED makes the new reader skip over such rows.  While this is not
appropriate behavior for general purposes, there are some cases in which
it is useful, such as queue-like tables.

Catalog version bumped because this patch changes the representation of
stored rules.

Reviewed by Craig Ringer (based on a previous attempt at an
implementation by Simon Riggs, who also provided input on the syntax
used in the current patch), David Rowley, and Álvaro Herrera.

Author: Thomas Munro
2014-10-07 17:23:34 -03:00
Robert Haas c421efd213 Fix typo in elog message. 2014-10-07 00:08:59 -04:00
Peter Eisentraut 1ec4a970fe Translation updates 2014-10-05 23:23:50 -04:00
Robert Haas d0410d6603 Eliminate one background-worker-related flag variable.
Teach sigusr1_handler() to use the same test for whether a worker
might need to be started as ServerLoop().  Aside from being perhaps
a bit simpler, this prevents a potentially-unbounded delay when
starting a background worker.  On some platforms, select() doesn't
return when interrupted by a signal, but is instead restarted,
including a reset of the timeout to the originally-requested value.
If signals arrive often enough, but no connection requests arrive,
sigusr1_handler() will be executed repeatedly, but the body of
ServerLoop() won't be reached.  This change ensures that, even in
that case, background workers will eventually get launched.

This is far from a perfect fix; really, we need select() to return
control to ServerLoop() after an interrupt, either via the self-pipe
trick or some other mechanism.  But that's going to require more
work and discussion, so let's do this for now to at least mitigate
the damage.

Per investigation of test_shm_mq failures on buildfarm member anole.
2014-10-04 21:25:41 -04:00
Tom Lane 513d06ded1 Update time zone data files to tzdata release 2014h.
Most zones in the Russian Federation are subtracting one or two hours
as of 2014-10-26.  Update the meanings of the abbreviations IRKT, KRAT,
MAGT, MSK, NOVT, OMST, SAKT, VLAT, YAKT, YEKT to match.

The IANA timezone database has adopted abbreviations of the form AxST/AxDT
for all Australian time zones, reflecting what they believe to be current
majority practice Down Under.  These names do not conflict with usage
elsewhere (other than ACST for Acre Summer Time, which has been in disuse
since 1994).  Accordingly, adopt these names into our "Default" timezone
abbreviation set.  The "Australia" abbreviation set now contains only
CST,EAST,EST,SAST,SAT,WST, all of which are thought to be mostly historical
usage.  Note that SAST has also been changed to be South Africa Standard
Time in the "Default" abbreviation set.

Add zone abbreviations SRET (Asia/Srednekolymsk) and XJT (Asia/Urumqi),
and use WSST/WSDT for western Samoa.

Also a DST law change in the Turks & Caicos Islands (America/Grand_Turk),
and numerous corrections for historical time zone data.
2014-10-04 14:18:19 -04:00
Stephen Frost 78d72563ef Fix CreatePolicy, pg_dump -v; psql and doc updates
Peter G pointed out that valgrind was, rightfully, complaining about
CreatePolicy() ending up copying beyond the end of the parsed policy
name.  Name is a fixed-size type and we need to use namein (through
DirectFunctionCall1()) to flush out the entire array before we pass
it down to heap_form_tuple.

Michael Paquier pointed out that pg_dump --verbose was missing a
newline and Fabrízio de Royes Mello further pointed out that the
schema was also missing from the messages, so fix those also.

Also, based on an off-list comment from Kevin, rework the psql \d
output to facilitate copy/pasting into a new CREATE or ALTER POLICY
command.

Lastly, improve the pg_policies view and update the documentation for
it, along with a few other minor doc corrections based on an off-list
discussion with Adam Brightwell.
2014-10-03 16:31:53 -04:00
Alvaro Herrera 1021bd6a89 Don't balance vacuum cost delay when per-table settings are in effect
When there are cost-delay-related storage options set for a table,
trying to make that table participate in the autovacuum cost-limit
balancing algorithm produces undesirable results: instead of using the
configured values, the global values are always used,
as illustrated by Mark Kirkwood in
http://www.postgresql.org/message-id/52FACF15.8020507@catalyst.net.nz

Since the mechanism is already complicated, just disable it for those
cases rather than trying to make it cope.  There are undesirable
side-effects from this too, namely that the total I/O impact on the
system will be higher whenever such tables are vacuumed.  However, this
is seen as less harmful than slowing down vacuum, because that would
cause bloat to accumulate.  Anyway, in the new system it is possible to
tweak options to get the precise behavior one wants, whereas with the
previous system one was simply hosed.

This has been broken forever, so backpatch to all supported branches.
This might affect systems where cost_limit and cost_delay have been set
for individual tables.
2014-10-03 13:01:27 -03:00
Robert Haas 017b2e9822 Fix typos in comments.
Etsuro Fujita
2014-10-03 11:47:27 -04:00
Heikki Linnakangas 7690ddea0c Check for GiST index tuples that don't fit on a page.
The page splitting code would go into infinite recursion if you try to
insert an index tuple that doesn't fit even on an empty page.

Per analysis and suggested fix by Andrew Gierth. Fixes bug #11555, reported
by Bryan Seitz (analysis happened over IRC). Backpatch to all supported
versions.
2014-10-03 14:49:52 +03:00
Robert Haas 3acc10c997 Increase the number of buffer mapping partitions to 128.
Testing by Amit Kapila, Andres Freund, and myself, with and without
other patches that also aim to improve scalability, seems to indicate
that this change is a significant win over the current value and over
smaller values such as 64.  It's not clear how high we can push this
value before it starts to have negative side-effects elsewhere, but
going this far looks OK.
2014-10-02 13:58:50 -04:00
Tom Lane 5a6c168c78 Fix some more problems with nested append relations.
As of commit a87c72915 (which later got backpatched as far as 9.1),
we're explicitly supporting the notion that append relations can be
nested; this can occur when UNION ALL constructs are nested, or when
a UNION ALL contains a table with inheritance children.

Bug #11457 from Nelson Page, as well as an earlier report from Elvis
Pranskevichus, showed that there were still nasty bugs associated with such
cases: in particular the EquivalenceClass mechanism could try to generate
"join" clauses connecting an appendrel child to some grandparent appendrel,
which would result in assertion failures or bogus plans.

Upon investigation I concluded that all current callers of
find_childrel_appendrelinfo() need to be fixed to explicitly consider
multiple levels of parent appendrels.  The most complex fix was in
processing of "broken" EquivalenceClasses, which are ECs for which we have
been unable to generate all the derived equality clauses we would like to
because of missing cross-type equality operators in the underlying btree
operator family.  That code path is more or less entirely untested by
the regression tests to date, because no standard opfamilies have such
holes in them.  So I wrote a new regression test script to try to exercise
it a bit, which turned out to be quite a worthwhile activity as it exposed
existing bugs in all supported branches.

The present patch is essentially the same as far back as 9.2, which is
where parameterized paths were introduced.  In 9.0 and 9.1, we only need
to back-patch a small fragment of commit 5b7b5518d, which fixes failure to
propagate out the original WHERE clauses when a broken EC contains constant
members.  (The regression test case results show that these older branches
are noticeably stupider than 9.2+ in terms of the quality of the plans
generated; but we don't really care about plan quality in such cases,
only that the plan not be outright wrong.  A more invasive fix in the
older branches would not be a good idea anyway from a plan-stability
standpoint.)
2014-10-01 19:31:12 -04:00
Heikki Linnakangas 5fa6c81a43 Remove num_xloginsert_locks GUC, replace with a #define
I left the GUC in place for the beta period, so that people could experiment
with different values. No-one's come up with any data that a different value
would be better under some circumstances, so rather than try to document to
users what the GUC, let's just hard-code the current value, 8.
2014-10-01 16:42:26 +03:00
Andres Freund a39e78b710 Block signals while computing the sleep time in postmaster's main loop.
DetermineSleepTime() was previously called without blocked
signals. That's not good, because it allows signal handlers to
interrupt its workings.

DetermineSleepTime() was added in 9.3 with the addition of background
workers (da07a1e856), where it only read from
BackgroundWorkerList.

Since 9.4, where dynamic background workers were added (7f7485a0cd),
the list is also manipulated in DetermineSleepTime(). That's bad
because the list now can be persistently corrupted if modified by both
a signal handler and DetermineSleepTime().

This was discovered during the investigation of hangs on buildfarm
member anole. It's unclear whether this bug is the source of these
hangs or not, but it's worth fixing either way. I have confirmed that
it can cause crashes.

It luckily looks like this only can cause problems when bgworkers are
actively used.

Discussion: 20140929193733.GB14400@awork2.anarazel.de

Backpatch to 9.3 where background workers were introduced.
2014-10-01 15:19:40 +02:00
Andres Freund 0ef3c29a4b Improve documentation about binary/textual output mode for output plugins.
Also improve related error message as it contributed to the confusion.

Discussion: CAB7nPqQrqFzjqCjxu4GZzTrD9kpj6HMn9G5aOOMwt1WZ8NfqeA@mail.gmail.com,
    CAB7nPqQXc_+g95zWnqaa=mVQ4d3BVRs6T41frcEYi2ocUrR3+A@mail.gmail.com

Per discussion between Michael Paquier, Robert Haas and Andres Freund

Backpatch to 9.4 where logical decoding was introduced.
2014-10-01 13:22:17 +02:00
Andres Freund ef8863844b Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE.
As noted in http://bugs.debian.org/763098 there is a conflict between
postgres' definition of CACHE_LINE_SIZE and the definition by various
*bsd platforms. It's debatable who has the right to define such a
name, but postgres' use was only introduced in 375d8526f2 (9.4), so
it seems like a good idea to rename it.

Discussion: 20140930195756.GC27407@msg.df7cb.de

Per complaint of Christoph Berg in the above email, although he's not
the original bug reporter.

Backpatch to 9.4 where the define was introduced.
2014-10-01 12:17:03 +02:00
Stephen Frost c8a026e4f1 Revert 95d737ff to add 'ignore_nulls'
Per discussion, revert the commit which added 'ignore_nulls' to
row_to_json.  This capability would be better added as an independent
function rather than being bolted on to row_to_json.  Additionally,
the implementation didn't address complex JSON objects, and so was
incomplete anyway.

Pointed out by Tom and discussed with Andrew and Robert.
2014-09-29 13:32:22 -04:00
Tom Lane def4c28cf9 Change JSONB's on-disk format for improved performance.
The original design used an array of offsets into the variable-length
portion of a JSONB container.  However, such an array is basically
uncompressible by simple compression techniques such as TOAST's LZ
compressor.  That's bad enough, but because the offset array is at the
front, it tended to trigger the give-up-after-1KB heuristic in the TOAST
code, so that the entire JSONB object was stored uncompressed; which was
the root cause of bug #11109 from Larry White.

To fix without losing the ability to extract a random array element in O(1)
time, change this scheme so that most of the JEntry array elements hold
lengths rather than offsets.  With data that's compressible at all, there
tend to be fewer distinct element lengths, so that there is scope for
compression of the JEntry array.  Every N'th entry is still an offset.
To determine the length or offset of any specific element, we might have
to examine up to N preceding JEntrys, but that's still O(1) so far as the
total container size is concerned.  Testing shows that this cost is
negligible compared to other costs of accessing a JSONB field, and that
the method does largely fix the incompressible-data problem.

While at it, rearrange the order of elements in a JSONB object so that
it's "all the keys, then all the values" not alternating keys and values.
This doesn't really make much difference right at the moment, but it will
allow providing a fast path for extracting individual object fields from
large JSONB values stored EXTERNAL (ie, uncompressed), analogously to the
existing optimization for substring extraction from large EXTERNAL text
values.

Bump catversion to denote the incompatibility in on-disk format.
We will need to fix pg_upgrade to disallow upgrading jsonb data stored
with 9.4 betas 1 and 2.

Heikki Linnakangas and Tom Lane
2014-09-29 12:29:21 -04:00
Stephen Frost ff27fcfa0a Fix relcache for policies, and doc updates
Andres pointed out that there was an extra ';' in equalPolicies, which
made me realize that my prior testing with CLOBBER_CACHE_ALWAYS was
insufficient (it didn't always catch the issue, just most of the time).
Thanks to that, a different issue was discovered, specifically in
equalRSDescs.  This change corrects eqaulRSDescs to return 'true' once
all policies have been confirmed logically identical.  After stepping
through both functions to ensure correct behavior, I ran this for
about 12 hours of CLOBBER_CACHE_ALWAYS runs of the regression tests
with no failures.

In addition, correct a few typos in the documentation which were pointed
out by Thom Brown (thanks!) and improve the policy documentation further
by adding a flushed out usage example based on a unix passwd file.

Lastly, clean up a few comments in the regression tests and pg_dump.h.
2014-09-26 12:46:26 -04:00
Peter Eisentraut d11339c099 Fix whitespace 2014-09-26 02:43:46 -04:00
Andres Freund b64d92f1a5 Add a basic atomic ops API abstracting away platform/architecture details.
Several upcoming performance/scalability improvements require atomic
operations. This new API avoids the need to splatter compiler and
architecture dependent code over all the locations employing atomic
ops.

For several of the potential usages it'd be problematic to maintain
both, a atomics using implementation and one using spinlocks or
similar. In all likelihood one of the implementations would not get
tested regularly under concurrency. To avoid that scenario the new API
provides a automatic fallback of atomic operations to spinlocks. All
properties of atomic operations are maintained. This fallback -
obviously - isn't as fast as just using atomic ops, but it's not bad
either. For one of the future users the atomics ontop spinlocks
implementation was actually slightly faster than the old purely
spinlock using implementation. That's important because it reduces the
fear of regressing older platforms when improving the scalability for
new ones.

The API, loosely modeled after the C11 atomics support, currently
provides 'atomic flags' and 32 bit unsigned integers. If the platform
efficiently supports atomic 64 bit unsigned integers those are also
provided.

To implement atomics support for a platform/architecture/compiler for
a type of atomics 32bit compare and exchange needs to be
implemented. If available and more efficient native support for flags,
32 bit atomic addition, and corresponding 64 bit operations may also
be provided. Additional useful atomic operations are implemented
generically ontop of these.

The implementation for various versions of gcc, msvc and sun studio have
been tested. Additional existing stub implementations for
* Intel icc
* HUPX acc
* IBM xlc
are included but have never been tested. These will likely require
fixes based on buildfarm and user feedback.

As atomic operations also require barriers for some operations the
existing barrier support has been moved into the atomics code.

Author: Andres Freund with contributions from Oskari Saarenmaa
Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera
Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com,
    20131015123303.GH5300@awork2.anarazel.de,
    20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
Andrew Dunstan 9111d46351 Remove ill-conceived ban on zero length json object keys.
We removed a similar ban on this in json_object recently, but the ban in
datum_to_json was left, which generate4d sprutious errors in othee json
generators, notable json_build_object.

Along the way, add an assertion that datum_to_json is not passed a null
key. All current callers comply with this rule, but the assertion will
catch any possible future misbehaviour.
2014-09-25 15:08:42 -04:00
Robert Haas 5d7962c679 Change locking regimen around buffer replacement.
Previously, we used an lwlock that was held from the time we began
seeking a candidate buffer until the time when we found and pinned
one, which is disastrous for concurrency.  Instead, use a spinlock
which is held just long enough to pop the freelist or advance the
clock sweep hand, and then released.  If we need to advance the clock
sweep further, we reacquire the spinlock once per buffer.

This represents a significant increase in atomic operations around
buffer eviction, but it still wins on many workloads.  On others, it
may result in no gain, or even cause a regression, unless the number
of buffer mapping locks is also increased.  However, that seems like
material for a separate commit.  We may also need to consider other
methods of mitigating contention on this spinlock, such as splitting
it into multiple locks or jumping the clock sweep hand more than one
buffer at a time, but those, too, seem like separate improvements.

Patch by me, inspired by a much larger patch from Amit Kapila.
Reviewed by Andres Freund.
2014-09-25 10:43:24 -04:00
Andres Freund 56a312aac8 Fix VPATH builds of the replication parser from git for some !gcc compilers.
Some compilers don't automatically search the current directory for
included files. 9cc2c182fc fixed that for builds from tarballs by
adding an include to the source directory. But that doesn't work when
the scanner is generated in the VPATH directory. Use the same search
path as the other parsers in the tree.

One compiler that definitely was affected is solaris' sun cc.

Backpatch to 9.1 which introduced using an actual parser for
replication commands.
2014-09-25 15:22:26 +02:00
Andrew Dunstan ecacbdbcee Return NULL from json_object_agg if it gets no rows.
This makes it consistent with the docs and with all other builtin
aggregates apart from count().
2014-09-25 08:18:18 -04:00
Stephen Frost afd1d95f5b Copy-editing of row security
Address a few typos in the row security update, pointed out
off-list by Adam Brightwell.  Also include 'ALL' in the list
of commands supported, for completeness.
2014-09-24 17:45:11 -04:00
Stephen Frost 6550b901fe Code review for row security.
Buildfarm member tick identified an issue where the policies in the
relcache for a relation were were being replaced underneath a running
query, leading to segfaults while processing the policies to be added
to a query.  Similar to how TupleDesc RuleLocks are handled, add in a
equalRSDesc() function to check if the policies have actually changed
and, if not, swap back the rsdesc field (using the original instead of
the temporairly built one; the whole structure is swapped and then
specific fields swapped back).  This now passes a CLOBBER_CACHE_ALWAYS
for me and should resolve the buildfarm error.

In addition to addressing this, add a new chapter in Data Definition
under Privileges which explains row security and provides examples of
its usage, change \d to always list policies (even if row security is
disabled- but note that it is disabled, or enabled with no policies),
rework check_role_for_policy (it really didn't need the entire policy,
but it did need to be using has_privs_of_role()), and change the field
in pg_class to relrowsecurity from relhasrowsecurity, based on
Heikki's suggestion.  Also from Heikki, only issue SET ROW_SECURITY in
pg_restore when talking to a 9.5+ server, list Bypass RLS in \du, and
document --enable-row-security options for pg_dump and pg_restore.

Lastly, fix a number of minor whitespace and typo issues from Heikki,
Dimitri, add a missing #include, per Peter E, fix a few minor
variable-assigned-but-not-used and resource leak issues from Coverity
and add tab completion for role attribute bypassrls as well.
2014-09-24 16:32:22 -04:00
Tom Lane 3f6f9260e3 Fix bogus variable-mangling in security_barrier_replace_vars().
This function created new Vars with varno different from varnoold, which
is a condition that should never prevail before setrefs.c does the final
variable-renumbering pass.  The created Vars could not be seen as equal()
to normal Vars, which among other things broke equivalence-class processing
for them.  The consequences of this were indeed visible in the regression
tests, in the form of failure to propagate constants as one would expect.
I stumbled across it while poking at bug #11457 --- after intentionally
disabling join equivalence processing, the security-barrier regression
tests started falling over with fun errors like "could not find pathkey
item to sort", because of failure to match the corrupted Vars to normal
ones.
2014-09-24 15:59:34 -04:00
Tom Lane 3694b4d7e1 Fix incorrect search for "x?" style matches in creviterdissect().
When the number of allowed iterations is limited (either a "?" quantifier
or a bound expression), the last sub-match has to reach to the end of the
target string.  The previous coding here first tried the shortest possible
match (one character, usually) and then gave up and back-tracked if that
didn't work, typically leading to failure to match overall, as shown in
bug #11478 from Christoph Berg.  The minimum change to fix that would be to
not decrement k before "goto backtrack"; but that would be a pretty stupid
solution, because we'd laboriously try each possible sub-match length
before finally discovering that only ending at the end can work.  Instead,
force the sub-match endpoint limit up to the end for even the first
shortest() call if we cannot have any more sub-matches after this one.

Bug introduced in my rewrite that added the iterdissect logic, commit
173e29aa5d.  The shortest-first search code
was too closely modeled on the longest-first code, which hasn't got this
issue since it tries a match reaching to the end to start with anyway.
Back-patch to all affected branches.
2014-09-23 20:26:14 -04:00
Stephen Frost 43bed84c32 Log ALTER SYSTEM statements as DDL
Per discussion in bug #11350, log ALTER SYSTEM commands at the
log_statement=ddl level, rather than at the log_statement=all level.

Pointed out by Tomonari Katsumata.

Back-patch to 9.4 where ALTER SYSTEM was introduced.
2014-09-22 20:50:17 -04:00
Stephen Frost 6ef8c658af Process withCheckOption exprs in setrefs.c
While withCheckOption exprs had been handled in many cases by
happenstance, they need to be handled during set_plan_references and
more specifically down in set_plan_refs for ModifyTable plan nodes.
This is to ensure that the opfuncid's are set for operators referenced
in the withCheckOption exprs.

Identified as an issue by Thom Brown

Patch by Dean Rasheed

Back-patch to 9.4, where withCheckOption was introduced.
2014-09-22 20:12:51 -04:00
Andres Freund 6ba4ecbf47 Remove most volatile qualifiers from xlog.c
For the reason outlined in df4077cda2 also remove volatile qualifiers
from xlog.c. Some of these uses of volatile have been added after
noticing problems back when spinlocks didn't imply compiler
barriers. So they are a good test - in fact removing the volatiles
breaks when done without the barriers in spinlocks present.

Several uses of volatile remain where they are explicitly used to
access shared memory without locks. These locations are ok with
slightly out of date data, but removing the volatile might lead to the
variables never being reread from memory. These uses could also be
replaced by barriers, but that's a separate change of doubtful value.
2014-09-22 23:35:08 +02:00
Robert Haas df4077cda2 Remove volatile qualifiers from lwlock.c.
Now that spinlocks (hopefully!) act as compiler barriers, as of commit
0709b7ee72, this should be safe.  This
serves as a demonstration of the new coding style, and may be optimized
better on some machines as well.
2014-09-22 16:42:14 -04:00
Robert Haas e38da8d6b1 Fix compiler warning.
It is meaningless to declare a pass-by-value return type const.
2014-09-22 16:32:35 -04:00
Robert Haas 763ba1b0f2 Fix mishandling of CreateEventTrigStmt's eventname field.
It's a string, not a scalar.

Petr Jelinek
2014-09-22 16:05:51 -04:00
Andres Freund 0926ef43c1 Remove postgres --help blurb about the removed -A option.
I missed this in 3bdcf6a5a7.

Noticed by Merlin Moncure
Discussion: CAHyXU0yC7uPeeVzQROwtnrOP9dxTEUPYjB0og4qUnbipMEV57w@mail.gmail.com
2014-09-22 17:54:34 +02:00
Andres Freund 604f7956b9 Improve code around the recently added rm_identify rmgr callback.
There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2:

* append_init() in heapdesc.c was ugly and required that rm_identify
  return values are only valid till the next call. Instead just add a
  couple more switch() cases for the INIT_PAGE cases. Now the returned
  value will always be valid.
* a couple rm_identify() callbacks missed masking xl_info with
  ~XLR_INFO_MASK.
* pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar
  string.
* append_init() was called when id=NULL - which should never actually
  happen. But it's better to be careful.
2014-09-22 17:49:34 +02:00
Robert Haas e246b3d6ea Add a fast pre-check for equality of equal-length strings.
Testing reveals that that doing a memcmp() before the strcoll() costs
practically nothing, at least on the systems we tested, and it speeds
up sorts containing many equal strings significatly.

Peter Geoghegan.  Review by myself and Heikki Linnakangas.  Comments
rewritten by me.
2014-09-19 12:39:00 -04:00
Stephen Frost 491c029dbc Row-Level Security Policies (RLS)
Building on the updatable security-barrier views work, add the
ability to define policies on tables to limit the set of rows
which are returned from a query and which are allowed to be added
to a table.  Expressions defined by the policy for filtering are
added to the security barrier quals of the query, while expressions
defined to check records being added to a table are added to the
with-check options of the query.

New top-level commands are CREATE/ALTER/DROP POLICY and are
controlled by the table owner.  Row Security is able to be enabled
and disabled by the owner on a per-table basis using
ALTER TABLE .. ENABLE/DISABLE ROW SECURITY.

Per discussion, ROW SECURITY is disabled on tables by default and
must be enabled for policies on the table to be used.  If no
policies exist on a table with ROW SECURITY enabled, a default-deny
policy is used and no records will be visible.

By default, row security is applied at all times except for the
table owner and the superuser.  A new GUC, row_security, is added
which can be set to ON, OFF, or FORCE.  When set to FORCE, row
security will be applied even for the table owner and superusers.
When set to OFF, row security will be disabled when allowed and an
error will be thrown if the user does not have rights to bypass row
security.

Per discussion, pg_dump sets row_security = OFF by default to ensure
that exports and backups will have all data in the table or will
error if there are insufficient privileges to bypass row security.
A new option has been added to pg_dump, --enable-row-security, to
ask pg_dump to export with row security enabled.

A new role capability, BYPASSRLS, which can only be set by the
superuser, is added to allow other users to be able to bypass row
security using row_security = OFF.

Many thanks to the various individuals who have helped with the
design, particularly Robert Haas for his feedback.

Authors include Craig Ringer, KaiGai Kohei, Adam Brightwell, Dean
Rasheed, with additional changes and rework by me.

Reviewers have included all of the above, Greg Smith,
Jeff McCormick, and Robert Haas.
2014-09-19 11:18:35 -04:00
Andres Freund 728f152e07 Add rmgr callback to name xlog record types for display purposes.
This is primarily useful for the upcoming pg_xlogdump --stats feature,
but also allows to remove some duplicated code in the rmgr_desc
routines.

Due to the separation and harmonization, the output of dipsplayed
records changes somewhat. But since this isn't enduser oriented
content that's ok.

It's potentially desirable to further change pg_xlogdump's display of
records. It previously wasn't possible to show the record type
separately from the description forcing it to be in the last
column. But that's better done in a separate commit.

Author: Abhijit Menon-Sen, slightly editorialized by me
Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas
Discussion: 20140604104716.GA3989@toroid.org
2014-09-19 16:20:29 +02:00
Heikki Linnakangas 2df465e696 Fix pointer type in size passed to memset.
Pointers are all the same size, so it makes no practical difference, but
let's be tidy.

Found by Coverity, noted off-list by Tom Lane.
2014-09-14 16:48:57 +03:00
Tom Lane fe550b2ac2 Invent PGC_SU_BACKEND and mark log_connections/log_disconnections that way.
This new GUC context option allows GUC parameters to have the combined
properties of PGC_BACKEND and PGC_SUSET, ie, they don't change after
session start and non-superusers can't change them.  This is a more
appropriate choice for log_connections and log_disconnections than their
previous context of PGC_BACKEND, because we don't want non-superusers
to be able to affect whether their sessions get logged.

Note: the behavior for log_connections is still a bit odd, in that when
a superuser attempts to set it from PGOPTIONS, the setting takes effect
but it's too late to enable or suppress connection startup logging.
It's debatable whether that's worth fixing, and in any case there is
a reasonable argument for PGC_SU_BACKEND to exist.

In passing, re-pgindent the files touched by this commit.

Fujii Masao, reviewed by Joe Conway and Amit Kapila
2014-09-13 21:01:57 -04:00
Bruce Momjian 95c38a9895 Revert f68dc5d86b
Renaming will have to be more comprehensive, so I need approval.
2014-09-12 20:42:19 -04:00
Bruce Momjian f68dc5d86b More formatting.c variable renaming, for clarity 2014-09-12 20:35:07 -04:00
Robert Haas 8cce08f168 Change NTUP_PER_BUCKET to 1 to improve hash join lookup speed.
Since this makes the bucket headers use ~10x as much memory, properly
account for that memory when we figure out whether everything fits
in work_mem.  This might result in some cases that previously used
only a single batch getting split into multiple batches, but it's
unclear as yet whether we need defenses against that case, and if so,
what the shape of those defenses should be.

It's worth noting that even in these edge cases, users should still be
no worse off than they would have been last week, because commit
45f6240a8f saved a big pile of memory
on exactly the same workloads.

Tomas Vondra, reviewed and somewhat revised by me.
2014-09-12 16:18:09 -04:00
Fujii Masao 4ad2a54805 Add GUC to enable logging of replication commands.
Previously replication commands like IDENTIFY_COMMAND were not logged
even when log_statements is set to all. Some users who want to audit
all types of statements were not satisfied with this situation. To
address the problem, this commit adds new GUC log_replication_commands.
If it's enabled, all replication commands are logged in the server log.

There are many ways to allow us to enable that logging. For example,
we can extend log_statement so that replication commands are logged
when it's set to all. But per discussion in the community, we reached
the consensus to add separate GUC for that.

Reviewed by Ian Barwick, Robert Haas and Heikki Linnakangas.
2014-09-13 02:55:45 +09:00
Heikki Linnakangas 774a78ffe4 Fix GIN data page split ratio calculation.
The code that tried to split a page at 75/25 ratio, when appending to the
end of an index, was buggy in two ways. First, there was a silly typo that
caused it to just fill the left page as full as possible. But the logic as
it was intended wasn't correct either, and would actually have given a ratio
closer to 60/40 than 75/25.

Gaetano Mendola spotted the typo. Backpatch to 9.4, where this code was added.
2014-09-12 11:27:56 +03:00
Tom Lane 1d352325b8 Fix power_var_int() for large integer exponents.
The code for raising a NUMERIC value to an integer power wasn't very
careful about large powers.  It got an outright wrong answer for an
exponent of INT_MIN, due to failure to consider overflow of the Abs(exp)
operation; which is fixable by using an unsigned rather than signed
exponent value after that point.  Also, even though the number of
iterations of the power-computation loop is pretty limited, it's easy for
the repeated squarings to result in ridiculously enormous intermediate
values, which can take unreasonable amounts of time/memory to process,
or even overflow the internal "weight" field and so produce a wrong answer.
We can forestall misbehaviors of that sort by bailing out as soon as the
weight value exceeds what will fit in int16, since then the final answer
must overflow (if exp > 0) or underflow (if exp < 0) the packed numeric
format.

Per off-list report from Pavel Stehule.  Back-patch to all supported
branches.
2014-09-11 23:30:51 -04:00
Stephen Frost 95d737ff45 Add 'ignore_nulls' option to row_to_json
Provide an option to skip NULL values in a row when generating a JSON
object from that row with row_to_json.  This can reduce the size of the
JSON object in cases where columns are NULL without really reducing the
information in the JSON object.

This also makes row_to_json into a single function with default values,
rather than having multiple functions.  In passing, change array_to_json
to also be a single function with default values (we don't add an
'ignore_nulls' option yet- it's not clear that there is a sensible
use-case there, and it hasn't been asked for in any case).

Pavel Stehule
2014-09-11 21:23:51 -04:00
Heikki Linnakangas aae7af3df8 Remove dead InRecovery check.
With the new B-tree incomplete split handling in 9.4, _bt_insert_parent is
never called in recovery.
2014-09-11 22:43:56 +03:00
Bruce Momjian 849462a9fa improve hash creation warning message
This improves the wording of commit 84aa8ba128.

Report by Kevin Grittner
2014-09-11 13:40:06 -04:00
Robert Haas 68e66923ff Add missing volatile qualifier.
Yet another silly mistake in 0709b7ee72,
again found by buildfarm member castoroides.
2014-09-11 09:07:32 -04:00
Heikki Linnakangas 0ed41529f6 Silence compiler warning on Windows.
David Rowley.
2014-09-11 13:50:14 +03:00
Bruce Momjian 36ad1a87a3 Implement mxid_age() to compute multi-xid age
Report by Josh Berkus
2014-09-10 17:13:04 -04:00
Bruce Momjian 84aa8ba128 Issue a warning during the creation of hash indexes 2014-09-10 16:54:47 -04:00
Heikki Linnakangas 45f6240a8f Pack tuples in a hash join batch densely, to save memory.
Instead of palloc'ing each HashJoinTuple individually, allocate 32kB chunks
and pack the tuples densely in the chunks. This avoids the AllocChunk
header overhead, and the space wasted by standard allocator's habit of
rounding sizes up to the nearest power of two.

This doesn't contain any planner changes, because the planner's estimate of
memory usage ignores the palloc overhead. Now that the overhead is smaller,
the planner's estimates are in fact more accurate.

Tomas Vondra, reviewed by Robert Haas.
2014-09-10 21:24:52 +03:00
Tom Lane 1b4cc493d2 Preserve AND/OR flatness while extracting restriction OR clauses.
The code I added in commit f343a880d5 was
careless about preserving AND/OR flatness: it could create a structure with
an OR node directly underneath another one.  That breaks an assumption
that's fairly important for planning efficiency, not to mention triggering
various Asserts (as reported by Benjamin Smith).  Add a trifle more logic
to handle the case properly.
2014-09-09 18:35:31 -04:00
Robert Haas 0709b7ee72 Change the spinlock primitives to function as compiler barriers.
Previously, they functioned as barriers against CPU reordering but not
compiler reordering, an odd API that required extensive use of volatile
everywhere that spinlocks are used.  That's error-prone and has negative
implications for performance, so change it.

In theory, this makes it safe to remove many of the uses of volatile
that we currently have in our code base, but we may find that there are
some bugs in this effort when we do.  In the long run, though, this
should make for much more maintainable code.

Patch by me.  Review by Andres Freund.
2014-09-09 17:48:50 -04:00
Tom Lane e80252d424 Add width_bucket(anyelement, anyarray).
This provides a convenient method of classifying input values into buckets
that are not necessarily equal-width.  It works on any sortable data type.

The choice of function name is a bit debatable, perhaps, but showing that
there's a relationship to the SQL standard's width_bucket() function seems
more attractive than the other proposals.

Petr Jelinek, reviewed by Pavel Stehule
2014-09-09 15:34:14 -04:00
Peter Eisentraut 57b1085df5 Allow empty content in xml type
The xml type previously rejected "content" that is empty or consists
only of spaces.  But the SQL/XML standard allows that, so change that.
The accepted values for XML "documents" are not changed.

Reviewed-by: Ali Akbar <the.apaan@gmail.com>
2014-09-09 11:34:52 -04:00
Stephen Frost f0051c1a14 Move ALTER ... ALL IN to ProcessUtilitySlow
Now that ALTER TABLE .. ALL IN TABLESPACE has replaced the previous
ALTER TABLESPACE approach, it makes sense to move the calls down in
to ProcessUtilitySlow where the rest of ALTER TABLE is handled.

This also means that event triggers will support ALTER TABLE .. ALL
(which was the impetus for the original change, though it has other
good qualities also).

Álvaro Herrera

Back-patch to 9.4 as the original rework was.
2014-09-09 10:58:48 -04:00
Andres Freund 07968dbfaa Fix spinlock implementation for some !solaris sparc platforms.
Some Sparc CPUs can be run in various coherence models, ranging from
RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always
run CPUs in TSO mode while in userland, but linux didn't use to and
the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK
were only correct under TSO. Fix that by adding the necessary memory
barrier instructions. On sparcv8+, which should be all relevant CPUs,
these are treated as NOPs if the current consistency model doesn't
require the barriers.

Discussion: 20140630222854.GW26930@awork2.anarazel.de

Will be backpatched to all released branches once a few buildfarm
cycles haven't shown up problems. As I've no access to sparc, this is
blindly written.
2014-09-09 00:47:32 +02:00
Bruce Momjian a9c22d1480 Rename C variables in formatting.c, for clarity
Also add C comments.  This should help future debugging of this
notorious file.
2014-09-05 09:52:31 -04:00
Peter Eisentraut 303f4d1012 Assorted message fixes and improvements 2014-09-05 01:25:27 -04:00
Fujii Masao a73c9dbab0 Fix segmentation fault that an empty prepared statement could cause.
Back-patch to all supported branches.

Per bug #11335 from Haruka Takatsuka
2014-09-05 02:17:57 +09:00
Bruce Momjian 4011f8c98b Issue clearer notice when inherited merged columns are moved
CREATE TABLE INHERIT moves user-specified columns to the location of the
inherited column.

Report by Fatal Majid
2014-09-03 11:54:31 -04:00
Heikki Linnakangas f8f4227976 Refactor per-page logic common to all redo routines to a new function.
Every redo routine uses the same idiom to determine what to do to a page:
check if there's a backup block for it, and if not read, the buffer if the
block exists, and check its LSN. Refactor that into a common function,
XLogReadBufferForRedo, making all the redo routines shorter and more
readable.

This has no user-visible effect, and makes no changes to the WAL format.

Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
2014-09-02 15:10:28 +03:00
Fujii Masao bd3b7a9eef Support ALTER SYSTEM RESET command.
This patch allows us to execute ALTER SYSTEM RESET command to
remove the configuration entry from postgresql.auto.conf.

Vik Fearing, reviewed by Amit Kapila and me.
2014-09-02 16:06:58 +09:00
Tom Lane 01b6976c13 Fix unportable use of isspace().
Introduced in commit 11a020eb6.
2014-09-01 18:37:45 -04:00
Andres Freund 5a64cb740d Fix s/pluggins/plugins/ typo in two comments.
Michael Paquier
2014-09-01 12:01:29 +02:00
Andres Freund 9c4b55db1d Declare lwlock.c's LWLockAcquireCommon() as a static inline.
68a2e52bba has introduced LWLockAcquireCommon() containing the
previous contents of LWLockAcquire() plus added functionality. The
latter then calls it, just like LWLockAcquireWithVar(). Because the
majority of callers don't need the added functionality, declare the
common code as inline. The compiler then can optimize away the unused
code. Doing so is also useful when looking at profiles, to
differentiate the users.

Backpatch to 9.4, the first branch to contain LWLockAcquireCommon().
2014-09-01 00:16:55 +02:00
Andres Freund 5c1faa7ba7 Protect definition of SpinlockSemaArray, just like its declaration.
Found via clang's -Wmissing-variable-declarations.
2014-09-01 00:03:53 +02:00
Andres Freund 8fff977e29 Declare two variables in snapbuild.c as static.
Neither is accessed externally, I just seem to have missed the static
when writing the code.
2014-08-31 23:53:12 +02:00
Andres Freund 4b4b680c3d Make backend local tracking of buffer pins memory efficient.
Since the dawn of time (aka Postgres95) multiple pins of the same
buffer by one backend have been optimized not to modify the shared
refcount more than once. This optimization has always used a NBuffer
sized array in each backend keeping track of a backend's pins.

That array (PrivateRefCount) was one of the biggest per-backend memory
allocations, depending on the shared_buffers setting. Besides the
waste of memory it also has proven to be a performance bottleneck when
assertions are enabled as we make sure that there's no remaining pins
left at the end of transactions. Also, on servers with lots of memory
and a correspondingly high shared_buffers setting the amount of random
memory accesses can also lead to poor cpu cache efficiency.

Because of these reasons a backend's buffers pins are now kept track
of in a small statically sized array that overflows into a hash table
when necessary. Benchmarks have shown neutral to positive performance
results with considerably lower memory usage.

Patch by me, review by Robert Haas.

Discussion: 20140321182231.GA17111@alap3.anarazel.de
2014-08-30 14:03:21 +02:00
Bruce Momjian 01363beae5 pg_is_xlog_replay_paused(): remove super-user-only restriction
Also update docs to mention which function are super-user-only.

Report by sys-milan@statpro.com

Backpatch through 9.4
2014-08-29 09:06:05 -04:00
Heikki Linnakangas 88231ec578 Fix bug in compressed GIN data leaf page splitting code.
The list of posting lists it's dealing with can contain placeholders for
deleted posting lists. The placeholders are kept around so that they can
be WAL-logged, but we must be careful to not try to access them.

This fixes bug #11280, reported by Mårten Svantesson. Backpatch to 9.4,
where the compressed data leaf page code was added.
2014-08-29 14:22:25 +03:00
Peter Eisentraut 65c9dc231a Assorted message improvements 2014-08-29 00:26:17 -04:00
Tom Lane 6c40f8316e Add min and max aggregates for inet/cidr data types.
Haribabu Kommi, reviewed by Muhammad Asif Naeem
2014-08-28 22:37:58 -04:00
Fujii Masao 9df492664a Revert "Allow units to be specified in relation option setting value."
This reverts commit e23014f3d4.

As the side effect of the reverted commit, when the unit is
specified, the reloption was stored in the catalog with the unit.
This broke pg_dump (specifically, it prevented pg_dump from
outputting restorable backup regarding the reloption) and
turned the buildfarm red. Revert the commit until the fixed
version is ready.
2014-08-29 05:10:47 +09:00
Andres Freund 11a020eb6e Allow escaping of option values for options passed at connection start.
This is useful to allow to set GUCs to values that include spaces;
something that wasn't previously possible. The primary case motivating
this is the desire to set default_transaction_isolation to 'repeatable
read' on a per connection basis, but other usecases like seach_path do
also exist.

This introduces a slight backward incompatibility: Previously a \ in
an option value would have been passed on literally, now it'll be
taken as an escape.

The relevant mailing list discussion starts with
20140204125823.GJ12016@awork2.anarazel.de.
2014-08-28 13:59:29 +02:00
Fujii Masao e23014f3d4 Allow units to be specified in relation option setting value.
This introduces an infrastructure which allows us to specify the units
like ms (milliseconds) in integer relation option, like GUC parameter.
Currently only autovacuum_vacuum_cost_delay reloption can accept
the units.

Reviewed by Michael Paquier
2014-08-28 15:55:50 +09:00
Jeff Davis 8167a3883a Allow multibyte characters as escape in SIMILAR TO and SUBSTRING.
Previously, only a single-byte character was allowed as an
escape. This patch allows it to be a multi-byte character, though it
still must be a single character.

Reviewed by Heikki Linnakangas and Tom Lane.
2014-08-27 21:07:36 -07:00
Alvaro Herrera 1c9701cfe5 Fix FOR UPDATE NOWAIT on updated tuple chains
If SELECT FOR UPDATE NOWAIT tries to lock a tuple that is concurrently
being updated, it might fail to honor its NOWAIT specification and block
instead of raising an error.

Fix by adding a no-wait flag to EvalPlanQualFetch which it can pass down
to heap_lock_tuple; also use it in EvalPlanQualFetch itself to avoid
blocking while waiting for a concurrent transaction.

Authors: Craig Ringer and Thomas Munro, tweaked by Álvaro
http://www.postgresql.org/message-id/51FB6703.9090801@2ndquadrant.com

Per Thomas Munro in the course of his SKIP LOCKED feature submission,
who also provided one of the isolation test specs.

Backpatch to 9.4, because that's as far back as it applies without
conflicts (although the bug goes all the way back).  To that branch also
backpatch Thomas Munro's new NOWAIT test cases, committed in master by
Heikki as commit 9ee16b49f0 .
2014-08-27 19:15:18 -04:00
Stephen Frost e414ba93ad Fix Var handling for security barrier views
In some cases, not all Vars were being correctly marked as having been
modified for updatable security barrier views, which resulted in invalid
plans (eg: when security barrier views were created over top of
inheiritance structures).

In passing, be sure to update both varattno and varonattno, as _equalVar
won't consider the Vars identical otherwise.  This isn't known to cause
any issues with updatable security barrier views, but was noticed as
missing while working on RLS and makes sense to get fixed.

Back-patch to 9.4 where updatable security barrier views were
introduced.
2014-08-26 23:08:41 -04:00
Robert Haas 9522ec3e70 Fix typo in b34e37bfef.
Spotted by Peter Geoghegan.
2014-08-26 15:58:50 -04:00
Kevin Grittner a9d0f1cff3 Fix superuser concurrent refresh of matview owned by another.
Use SECURITY_LOCAL_USERID_CHANGE while building temporary tables;
only escalate to SECURITY_RESTRICTED_OPERATION while potentially
running user-supplied code.  The more secure mode was preventing
temp table creation.  Add regression tests to cover this problem.

This fixes Bug #11208 reported by Bruno Emanuel de Andrade Silva.

Backpatch to 9.4, where the bug was introduced.
2014-08-26 09:56:26 -05:00
Heikki Linnakangas 0076f264b6 Implement IF NOT EXISTS for CREATE SEQUENCE.
Fabrízio de Royes Mello
2014-08-26 16:18:17 +03:00
Bruce Momjian a7ae1dcf49 pg_upgrade: prevent automatic oid assignment
Prevent automatic oid assignment when in binary upgrade mode.  Also
throw an error when contrib/pg_upgrade_support functions are called when
not in binary upgrade mode.

This prevent automatically-assigned oids from conflicting with later
pre-assigned oids coming from the old cluster.  It also makes sure oids
are preserved in call important cases.
2014-08-25 22:19:05 -04:00
Bruce Momjian 73fe87503f rename macro isTempOrToastNamespace to isTempOrTempToastNamespace
Done for clarity
2014-08-25 21:28:19 -04:00
Bruce Momjian 6cb74a67e2 revert "Throw error for ALTER TABLE RESET of an invalid option"
Reverts commits 73d78e11a0 and
b0488e5c4f.  Also reverts pg_upgrade
changes.
2014-08-25 20:07:37 -04:00
Bruce Momjian 73d78e11a0 Throw error for ALTER TABLE RESET of an invalid option
Also adjust pg_upgrade to not use this method for optional TOAST table
creation.

Patch by Fabrízio de Royes Mello
2014-08-25 17:06:40 -04:00
Alvaro Herrera 3adba73662 Revert XactLockTableWait context setup in conditional multixact wait
There's no point in setting up a context error callback when doing
conditional lock acquisition, because we never actually wait and so the
user wouldn't be able to see the context message anywhere.  In fact,
this is more in line with what ConditionalXactLockTableWait is doing.

Backpatch to 9.4, where this was added.
2014-08-25 15:33:17 -04:00
Alvaro Herrera 6f822952ee Use newly added InvalidCommandId instead of 0
The symbol was added by 71901ab6d; the original code was introduced by
6868ed749.  Development of both overlapped which is why we apparently
failed to notice.

This is a (very slight) behavior change, so I'm not backpatching this to
9.4 for now, even though the symbol does exist there.
2014-08-25 15:32:30 -04:00
Alvaro Herrera 832a12f65e DefineType: return base type OID, not its array
Event triggers want to know the OID of the interesting object created,
which is the main type.  The array created as part of the operation is
just a subsidiary object which is not of much interest.
2014-08-25 15:32:26 -04:00
Alvaro Herrera 301fcf33eb Have CREATE TABLE AS and REFRESH return an OID
Other DDL commands are already returning the OID, which is required for
future additional event trigger work.  This is merely making these
commands in line with the rest of utility command support.
2014-08-25 15:32:18 -04:00
Alvaro Herrera d6d6020f1c More psprintf goodness 2014-08-25 15:32:15 -04:00
Alvaro Herrera ac41769fd9 Oops, forgot to "git add" one last change 2014-08-25 15:32:06 -04:00
Alvaro Herrera 7636c0c821 Editorial review of SET UNLOGGED
Add a succint comment explaining why it's correct to change the
persistence in this way.  Also s/loggedness/persistence/ because native
speakers didn't like the latter term.

Fabrízio and Álvaro
2014-08-25 13:50:19 -04:00
Tom Lane 73eba19aeb Fix another ancient memory-leak bug in relcache.c.
CheckConstraintFetch() leaked a cstring in the caller's context for each
CHECK constraint expression it copied into the relcache.  Ordinarily that
isn't problematic, but it can be during CLOBBER_CACHE testing because so
many reloads can happen during a single query; so complicate the code
slightly to allow freeing the cstring after use.  Per testing on buildfarm
member barnacle.

This is exactly like the leak fixed in AttrDefaultFetch() by commit
078b2ed291.  (Yes, this time I did look for
other instances of the same coding pattern :-(.)  Like that patch, no
back-patch, since it seems unlikely that there's any problem except under
very artificial test conditions.

BTW, it strikes me that both of these places would require further work
comparable to commit ab8c84db2f, if we ever
supported defaults or check constraints on system catalogs: they both
assume they are copying into an empty relcache data structure, and that
conceivably wouldn't be the case during recursive reloading of a system
catalog.  This does not seem worth worrying about for the moment, since
there is no near-term prospect of supporting any such thing.  So I'll
just note the possibility for the archives' sake.
2014-08-24 11:56:52 -04:00
Alvaro Herrera f41872d0c1 Implement ALTER TABLE .. SET LOGGED / UNLOGGED
This enables changing permanent (logged) tables to unlogged and
vice-versa.

(Docs for ALTER TABLE / SET TABLESPACE got shuffled in an order that
hopefully makes more sense than the original.)

Author: Fabrízio de Royes Mello
Reviewed by: Christoph Berg, Andres Freund, Thom Brown
Some tweaking by Álvaro Herrera
2014-08-22 14:27:00 -04:00
Alvaro Herrera 01d15a2677 Fix outdated comment 2014-08-22 14:03:11 -04:00
Tom Lane 41dd50e84d Fix corner-case behaviors in JSON/JSONB field extraction operators.
Cause the path extraction operators to return their lefthand input,
not NULL, if the path array has no elements.  This seems more consistent
since the case ought to correspond to applying the simple extraction
operator (->) zero times.

Cause other corner cases in field/element/path extraction to return NULL
rather than failing.  This behavior is arguably more useful than throwing
an error, since it allows an expression index using these operators to be
built even when not all values in the column are suitable for the
extraction being indexed.  Moreover, we already had multiple
inconsistencies between the path extraction operators and the simple
extraction operators, as well as inconsistencies between the JSON and
JSONB code paths.  Adopt a uniform rule of returning NULL rather than
throwing an error when the JSON input does not have a structure that
permits the request to be satisfied.

Back-patch to 9.4.  Update the release notes to list this as a behavior
change since 9.3.
2014-08-22 13:17:58 -04:00
Stephen Frost 3c4cf08087 Rework 'MOVE ALL' to 'ALTER .. ALL IN TABLESPACE'
As 'ALTER TABLESPACE .. MOVE ALL' really didn't change the tablespace
but instead changed objects inside tablespaces, it made sense to
rework the syntax and supporting functions to operate under the
'ALTER (TABLE|INDEX|MATERIALIZED VIEW)' syntax and to be in
tablecmds.c.

Pointed out by Alvaro, who also suggested the new syntax.

Back-patch to 9.4.
2014-08-21 19:06:17 -04:00
Tom Lane 9bac66020d Fix core dump in jsonb #> operator, and add regression test cases.
jsonb's #> operator segfaulted (dereferencing a null pointer) if the RHS
was a zero-length array, as reported in bug #11207 from Justin Van Winkle.
json's #> operator returns NULL in such cases, so for the moment let's
make jsonb act likewise.

Also add a bunch of regression test queries memorializing the -> and #>
operators' behavior for this and other corner cases.

There is a good argument for changing some of these behaviors, as they
are not very consistent with each other, and throwing an error isn't
necessarily a desirable behavior for operators that are likely to be
used in indexes.  However, everybody can agree that a core dump is the
Wrong Thing, and we need test cases even if we decide to change their
expected output later.
2014-08-20 16:48:53 -04:00
Heikki Linnakangas 02587dcddc Use comma+space as the separator in the default search_path.
While the space is optional, it seems nicer to be consistent with what
you get if you do "SET search_path=...". SET always normalizes the
separator to be comma+space.

Christoph Martin
2014-08-20 12:06:08 +03:00
Fujii Masao c476288653 Revert "Fix bug in checking of IDENTIFY_SYSTEM result."
This reverts commit 083d29c65b.

The commit changed the code so that it causes an errors when
IDENTIFY_SYSTEM returns three columns. But which prevents us
from using the replication-related utilities against the server
with older version. This is not what we want. For that
compatibility, we allow the utilities to receive three columns
as the result of IDENTIFY_SYSTEM eventhough it actually returns
four columns in 9.4 or later.

Pointed out by Andres Freund.
2014-08-19 18:30:38 +09:00
Fujii Masao 083d29c65b Fix bug in checking of IDENTIFY_SYSTEM result.
5a991ef869 added new column into
the result of IDENTIFY_SYSTEM command. But it was not reflected into
several codes checking that result. Specifically though the number of
columns in the result was increased to 4, it was still compared with 3
in some replication codes.

Back-patch to 9.4 where the number of columns in IDENTIFY_SYSTEM
result was increased.

Report from Michael Paquier
2014-08-19 17:26:07 +09:00
Noah Misch ee9569e4df Finish adding file version information to installed Windows binaries.
In support of this, have the MSVC build follow GNU make in preferring
GNUmakefile over Makefile when a directory contains both.

Michael Paquier, reviewed by MauMau.
2014-08-18 22:59:53 -04:00
Noah Misch fb2aece8ae Replace a few strncmp() calls with strlcpy().
strncmp() is a specialized API unsuited for routine copying into
fixed-size buffers.  On a system where the length of a single filename
can exceed MAXPGPATH, the pg_archivecleanup change prevents a simple
crash in the subsequent strlen().  Few filesystems support names that
long, and calling pg_archivecleanup with untrusted input is still not a
credible use case.  Therefore, no back-patch.

David Rowley
2014-08-18 22:59:31 -04:00
Heikki Linnakangas 48d50840d5 Reorganize functions in be-secure-openssl.c
Move the functions within the file so that public interface functions come
first, followed by internal functions. Previously, be_tls_write was first,
then internal stuff, and finally the rest of the public interface, which
clearly didn't make much sense.

Per Andres Freund's complaint.
2014-08-18 13:12:40 +03:00
Tom Lane e56ec50c16 Use ISO 8601 format for dates converted to JSON, too.
Commit f30015b6d7 made this happen for
timestamp and timestamptz, but it seems pretty inconsistent to not
do it for simple dates as well.

(In passing, I re-pgindent'd json.c.)
2014-08-17 22:57:46 -04:00
Tom Lane 737cdc2d14 Fix bogus return macros in range_overright_internal().
PG_RETURN_BOOL() should only be used in functions following the V1 SQL
function API.  This coding accidentally fails to fail since letting the
compiler coerce the Datum representation of bool back to plain bool
does give the right answer; but that doesn't make it a good idea.

Back-patch to older branches just to avoid unnecessary code divergence.
2014-08-16 13:48:39 -04:00
Robert Haas b34e37bfef Add sortsupport routines for text.
This provides a small but worthwhile speedup when sorting text, at least
in cases to which the sortsupport machinery applies.

Robert Haas and Peter Geoghegan
2014-08-14 12:09:52 -04:00
Peter Eisentraut 5333c72c95 Fix whitespace 2014-08-13 23:15:26 -04:00
Tom Lane a844c29966 Prevent memory leaks in parseRelOptions().
parseRelOptions() tended to leak memory in the caller's context.  Most
of the time this doesn't really matter since the caller's context is
at most query-lifespan, and the function won't be invoked very many times.
However, when testing with CLOBBER_CACHE_RECURSIVELY, the same relcache
entry can get rebuilt a *lot* of times in one query, leading to significant
intraquery memory bloat if it has any reloptions.  Noted while
investigating a related report from Tomas Vondra.

In passing, get rid of some Asserts that are redundant with the one
done by deconstruct_array().

As with other patches to avoid leaks in CLOBBER_CACHE testing, it doesn't
really seem worth back-patching this.
2014-08-13 11:35:51 -04:00
Tom Lane ab8c84db2f Prevent memory leaks in RelationGetIndexList, RelationGetIndexAttrBitmap.
When replacing rd_indexlist, rd_indexattr, etc, we neglected to pfree any
old value of these fields.  Under ordinary circumstances, the old value
would always be NULL, so this seemed reasonable enough.  However, in cases
where we're rebuilding a system catalog's relcache entry and another cache
flush occurs on that same catalog meanwhile, it's possible for the field to
not be NULL when we return to the outer level, because we already refilled
it while recovering from the inner flush.  This leads to a fairly small
session-lifespan leak in CacheMemoryContext.  In real-world usage the leak
would be too small to notice; but in testing with CLOBBER_CACHE_RECURSIVELY
the leakage can add up to the point of causing OOM failures, as reported by
Tomas Vondra.

The issue has been there a long time, but it only seems worth fixing in
HEAD, like the previous fix in this area (commit 078b2ed291).
2014-08-13 11:27:28 -04:00
Andres Freund 41d5f8ad73 Be less aggressive in asking for feedback of logical walsender clients.
When doing logical decoding using START_LOGICAL_REPLICATION in a
walsender process the walsender sometimes was sending out keepalive
messages too frequently. Asking for feedback every time.

WalSndWaitForWal() sends out keepalive messages when it's waiting for
new WAL to be generated locally when it sees that the remote side
hasn't yet flushed WAL up to the local position. That generally is
good but causes problems if the remote side only writes but doesn't
flush changes yet. So check for both remote write and flush position.

Additionally we've asked for feedback to the keepalive message which
isn't warranted when waiting for WAL in contrast to preventing
timeouts because of wal_sender_timeout.

Complaint and patch by Steve Singer.
2014-08-12 11:04:50 +02:00
Fujii Masao 3e3f65973a Change first call of ProcessConfigFile so as to process only data_directory.
When both postgresql.conf and postgresql.auto.conf have their own entry of
the same parameter, PostgreSQL uses the entry in postgresql.auto.conf because
it appears last in the configuration scan. IOW, the other entries which appear
earlier are ignored. But, previously, ProcessConfigFile() detected the invalid
settings of even those unused entries and emitted the error messages
complaining about them, at postmaster startup. Complaining about the entries
to ignore is basically useless.

This problem happened because ProcessConfigFile() was called twice at
postmaster startup and the first call read only postgresql.conf. That is, the
first call could check the entry which might be ignored eventually by
the second call which read both postgresql.conf and postgresql.auto.conf.
To work around the problem, this commit changes ProcessConfigFile so that
its first call processes only data_directory and the second one does all the
entries. It's OK to process data_directory in the first call because it's
ensured that data_directory doesn't exist in postgresql.auto.conf.

Back-patch to 9.4 where postgresql.auto.conf was added.

Patch by me. Review by Amit Kapila
2014-08-12 16:50:09 +09:00
Heikki Linnakangas 680513ab79 Break out OpenSSL-specific code to separate files.
This refactoring is in preparation for adding support for other SSL
implementations, with no user-visible effects. There are now two #defines,
USE_OPENSSL which is defined when building with OpenSSL, and USE_SSL which
is defined when building with any SSL implementation. Currently, OpenSSL is
the only implementation so the two #defines go together, but USE_SSL is
supposed to be used for implementation-independent code.

The libpq SSL code is changed to use a custom BIO, which does all the raw
I/O, like we've been doing in the backend for a long time. That makes it
possible to use MSG_NOSIGNAL to block SIGPIPE when using SSL, which avoids
a couple of syscall for each send(). Probably doesn't make much performance
difference in practice - the SSL encryption is expensive enough to mask the
effect - but it was a natural result of this refactoring.

Based on a patch by Martijn van Oosterhout from 2006. Briefly reviewed by
Alvaro Herrera, Andreas Karlsson, Jeff Janes.
2014-08-11 11:54:19 +03:00
Tom Lane 92f57c9ae9 Clean up handling of unknown-type inputs in json_build_object and friends.
There's actually no need for any special case for unknown-type literals,
since we only need to push the value through its output function and
unknownout() works fine.  The code that was here was completely bizarre
anyway, and would fail outright in cases that should work, not to mention
suffering from some copy-and-paste bugs.
2014-08-09 17:31:13 -04:00
Tom Lane 495cadda5e Further cleanup of JSON-specific error messages.
Fix an obvious typo in json_build_object()'s complaint about invalid
number of arguments, and make the errhint a bit more sensible too.

Per discussion about how to word the improved hint, change the few places
in the documentation that refer to JSON object field names as "names" to
say "keys" instead, since that's what we've said in the vast majority of
places in the docs.  Arguably "name" is more correct, since that's the
terminology used in RFC 7159; but we're stuck with "key" in view of the
naming of json_object_keys() so let's at least be self-consistent.

I adjusted a few code comments to match this as well, and failed to
resist the temptation to clean up some odd whitespace choices in the
same area, as well as a useless duplicate PG_ARGISNULL() check.  There's
still quite a bit of code that uses the phrase "field name" in non-user-
visible ways, so I left those usages alone.
2014-08-09 16:35:29 -04:00
Tom Lane 9da8675373 Reject duplicate column names in foreign key referenced-columns lists.
Such cases are disallowed by the SQL spec, and even if we wanted to allow
them, the semantics seem ambiguous: how should the FK columns be matched up
with the columns of a unique index?  (The matching could be significant in
the presence of opclasses with different notions of equality, so this issue
isn't just academic.)  However, our code did not previously reject such
cases, but instead would either fail to match to any unique index, or
generate a bizarre opclass-lookup error because of sloppy thinking in the
index-matching code.

David Rowley
2014-08-09 13:46:34 -04:00
Bruce Momjian 4c6780fd17 pg_upgrade: prevent oid conflicts with new-cluster TOAST tables
Previously, TOAST tables only required in the new cluster could cause
oid conflicts if they were auto-numbered and a later conflicting oid had
to be assigned.

Backpatch through 9.3
2014-08-07 14:56:13 -04:00
Robert Haas 1d41739e5a Don't require sort support functions to provide a comparator.
This could be useful for datatypes like text, where we might want
to optimize for some collations but not others.  However, this patch
doesn't introduce any new sortsupport functions that work this way;
it merely revises the code so that future patches may do so.

Patch by me.  Review by Peter Geoghegan.
2014-08-06 16:06:06 -04:00
Fujii Masao e3da0d4d1a Change ParseConfigFp() so that it doesn't process unused entry of each parameter.
When more than one setting entries of same parameter exist in the
configuration file, PostgreSQL uses only entry appearing last in
configuration file scan. Since the other entries are not used,
ParseConfigFp() doesn't need to process them, but previously it did
that. This problematic behavior caused the configuration file scan
to detect invalid settings of unused entries (e.g., existence of
multiple entries of PGC_POSTMASTER parameter) and log the messages
complaining about them.

This commit changes the configuration file scan so that it processes
only last entry of each parameter.

Note that when multiple entries of same parameter exist both in
postgresql.conf and postgresql.auto.conf, unused entries in
postgresql.conf are still processed only at postmaster startup.

The problem has existed since old version, but a user is more likely
to encounter it since 9.4 where ALTER SYSTEM command was introduced.
So back-patch to 9.4.

Amit Kapila, slightly modified by me. Per report from Christoph Berg.
2014-08-06 14:49:43 +09:00
Kevin Grittner 49d1e03d64 Fix typo in C comment. 2014-08-05 14:17:21 -05:00
Robert Haas 0ef99bdce3 Improve some JSON error messages.
These messages are new in 9.4, which hasn't been released yet, so
back-patch to REL9_4_STABLE.

Daniele Varrazzo
2014-08-05 12:28:57 -04:00
Heikki Linnakangas 54685338e3 Move log_newpage and log_newpage_buffer to xlog.c.
log_newpage is used by many indexams, in addition to heap, but for
historical reasons it's always been part of the heapam rmgr. Starting with
9.3, we have another WAL record type for logging an image of a page,
XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to
xlog.c, and switch to using the XLOG_FPI record type.

Bump the WAL version number because the code to replay the old HEAP_NEWPAGE
records is removed.
2014-07-31 16:48:55 +03:00
Tom Lane f51ead09df Avoid wholesale autovacuuming when autovacuum is nominally off.
When autovacuum is nominally off, we will still launch autovac workers
to vacuum tables that are at risk of XID wraparound.  But after we'd done
that, an autovac worker would proceed to autovacuum every table in the
targeted database, if they meet the usual thresholds for autovacuuming.
This is at best pretty unexpected; at worst it delays response to the
wraparound threat.  Fix it so that if autovacuum is nominally off, we
*only* do forced vacuums and not any other work.

Per gripe from Andrey Zhidenkov.  This has been like this all along,
so back-patch to all supported branches.
2014-07-30 14:41:35 -04:00
Robert Haas e280c630a8 Fix mishandling of background worker PGPROCs in EXEC_BACKEND builds.
InitProcess() relies on IsBackgroundWorker to decide whether the PGPROC
for a new backend should be taken from ProcGlobal's freeProcs or from
bgworkerFreeProcs.  In EXEC_BACKEND builds, InitProcess() is called
sooner than in non-EXEC_BACKEND builds, and IsBackgroundWorker wasn't
getting initialized soon enough.

Report by Noah Misch.  Diagnosis and fix by me.
2014-07-30 11:34:06 -04:00
Alvaro Herrera 0531549801 Avoid uselessly looking up old LOCK_ONLY multixacts
Commit 0ac5ad5134 removed an optimization in multixact.c that skipped
fetching members of MultiXactId that were older than our
OldestVisibleMXactId value.  The reason this was removed is that it is
possible for multixacts that contain updates to be older than that
value.  However, if the caller is certain that the multi does not
contain an update (because the infomask bits say so), it can pass this
info down to GetMultiXactIdMembers, enabling it to use the old
optimization.

Pointed out by Andres Freund in 20131121200517.GM7240@alap2.anarazel.de
2014-07-29 15:41:06 -04:00
Alvaro Herrera c2581794f3 Simplify multixact freezing a bit
Testing for abortedness of a multixact member that's being frozen is
unnecessary: we only need to know whether the transaction is still in
progress or committed to determine whether it must be kept or not.  This
let us simplify the code a bit and avoid a useless TransactionIdDidAbort
test.

Suggested by Andres Freund awhile back.
2014-07-29 15:40:55 -04:00
Heikki Linnakangas 60d931827b Oops, fix recoveryStopsBefore functions for regular commits.
Pointed out by Tom Lane. Backpatch to 9.4, the code was structured
differently in earlier branches and didn't have this mistake.
2014-07-29 17:19:43 +03:00
Heikki Linnakangas e74e0906fa Treat 2PC commit/abort the same as regular xacts in recovery.
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:

* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)

The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.

Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
2014-07-29 11:59:22 +03:00
Tom Lane 248fc1f107 Fix obsolete statement in smgr/README.
Since commit 2d00190495, fork numbers are
defined in relpath.h not relfilenode.h.

Fabrízio de Royes Mello
2014-07-28 16:30:14 -04:00
Noah Misch de35a97710 Handle WAIT_IO_COMPLETION return from WaitForMultipleObjectsEx().
This return code is possible wherever we pass bAlertable = TRUE; it
arises when Windows caused the current thread to run an "I/O completion
routine" or an "asynchronous procedure call".  PostgreSQL does not
provoke either of those Windows facilities, hence this bug remaining
largely unnoticed, but other local code might do so.  Due to a shortage
of complaints, no back-patch for now.

Per report from Shiv Shivaraju Gowda, this bug can cause
PGSemaphoreLock() to PANIC.  The bug can also cause select() to report
timeout expiration too early, which might confuse pgstat_init() and
CheckRADIUSAuth().
2014-07-25 18:51:48 -04:00
Robert Haas 1144ea3421 Prevent shm_mq_send from reading uninitialized memory.
shm_mq_send_bytes didn't invariably initialize *bytes_written before
returning, which would cause shm_mq_send to read from uninitialized
memory and add the value it found there to mqh->mqh_partial_bytes.
This could cause the next attempt to send a message via the queue to
fail an assertion (if the queue was detached) or copy data from a
garbage pointer value into the queue (if non-blocking mode was in use).
2014-07-24 09:23:22 -04:00
Robert Haas 250c26ba9c Fix checkpointer crash in EXEC_BACKEND builds.
Nothing in the checkpointer calls InitXLOGAccess(), so WALInsertLocks
never got initialized there.  Without EXEC_BACKEND, it works anyway
because the correct value is inherited from the postmaster, but
with EXEC_BACKEND we've got a problem.  The problem appears to have
been introduced by commit 68a2e52bba.

To fix, move the relevant initialization steps from InitXLOGAccess()
to XLOGShmemInit(), making this more parallel to what we do
elsewhere.

Amit Kapila
2014-07-24 09:12:38 -04:00