Commit Graph

14332 Commits

Author SHA1 Message Date
Andres Freund 51adcaa0df Add cluster_name GUC which is included in process titles if set.
When running several postgres clusters on one OS instance it's often
inconveniently hard to identify which "postgres" process belongs to
which postgres instance.

Add the cluster_name GUC, whose value will be included as part of the
process titles if set. With that processes can more easily identified
using tools like 'ps'.

To avoid problems with encoding mismatches between postgresql.conf,
consoles, and individual databases replace non-ASCII chars in the name
with question marks. The length is limited to NAMEDATALEN to make it
less likely to truncate important information at the end of the
status.

Thomas Munro, with some adjustments by me and review by a host of people.
2014-06-29 14:15:09 +02:00
Andres Freund a6d488cb53 Remove Alpha and Tru64 support.
Support for running postgres on Alpha hasn't been tested for a long
while. Due to Alpha's uniquely lax cache coherency model it's a hard
to develop for platform (especially blindly!) and thought to be
unlikely to currently work correctly.

As Alpha is the only supported architecture for Tru64 drop support for
it as well. Tru64's support has ended 2012 and it has been in
maintenance-only mode for much longer.

Also remove stray references to __ksr__ and ultrix defines.
2014-06-28 21:46:15 +02:00
Tom Lane d222585a9f Allow pushdown of WHERE quals into subqueries with window functions.
We can allow this even without any specific knowledge of the semantics
of the window function, so long as pushed-down quals will either accept
every row in a given window partition, or reject every such row.  Because
window functions act only within a partition, such a case can't result
in changing the window functions' outputs for any surviving row.
Eliminating entire partitions in this way obviously can reduce the cost
of the window-function computations substantially.

The fly in the ointment is that it's hard to be entirely sure whether
this is true for an arbitrary qual condition.  This patch allows pushdown
if (a) the qual references only partitioning columns, and (b) the qual
contains no volatile functions.  We are at risk of incorrect results if
the qual can produce different answers for values that the partitioning
equality operator sees as equal.  While it's not hard to invent cases
for which that can happen, it seems to seldom be a problem in practice,
since no one has complained about a similar assumption that we've had
for many years with respect to DISTINCT.  The potential performance
gains seem to be worth the risk.

David Rowley, reviewed by Vik Fearing; some credit is due also to
Thomas Mayer who did considerable preliminary investigation.
2014-06-27 23:08:08 -07:00
Alvaro Herrera f741300c90 Have multixact be truncated by checkpoint, not vacuum
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time.  The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.

Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum.  Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.

Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.

While at it, update some comments that hadn't been updated since
multixacts were changed.

Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134.
2014-06-27 14:43:53 -04:00
Alvaro Herrera b7e51d9c06 Don't allow relminmxid to go backwards during VACUUM FULL
We were allowing a table's pg_class.relminmxid value to move backwards
when heaps were swapped by VACUUM FULL or CLUSTER.  There is a
similar protection against relfrozenxid going backwards, which we
neglected to clone when the multixact stuff was rejiggered by commit
0ac5ad5134.

Backpatch to 9.3, where relminmxid was introduced.

As reported by Heikki in
http://www.postgresql.org/message-id/52401AEA.9000608@vmware.com
2014-06-27 14:43:46 -04:00
Alvaro Herrera b277057648 Fix broken Assert() introduced by 8e9a16ab8f
Don't assert MultiXactIdIsRunning if the multi came from a tuple that
had been share-locked and later copied over to the new cluster by
pg_upgrade.  Doing that causes an error to be raised unnecessarily:
MultiXactIdIsRunning is not open to the possibility that its argument
came from a pg_upgraded tuple, and all its other callers are already
checking; but such multis cannot, obviously, have transactions still
running, so the assert is pointless.

Noticed while investigating the bogus pg_multixact/offsets/0000 file
left over by pg_upgrade, as reported by Andres Freund in
http://www.postgresql.org/message-id/20140530121631.GE25431@alap3.anarazel.de

Backpatch to 9.3, as the commit that introduced the buglet.
2014-06-27 14:43:39 -04:00
Tom Lane 1147035203 Disallow pushing volatile qual expressions down into DISTINCT subqueries.
A WHERE clause applied to the output of a subquery with DISTINCT should
theoretically be applied only once per distinct row; but if we push it
into the subquery then it will be evaluated at each row before duplicate
elimination occurs.  If the qual is volatile this can give rise to
observably wrong results, so don't do that.

While at it, refactor a little bit to allow subquery_is_pushdown_safe
to report more than one kind of restrictive condition without indefinitely
expanding its argument list.

Although this is a bug fix, it seems unwise to back-patch it into released
branches, since it might de-optimize plans for queries that aren't giving
any trouble in practice.  So apply to 9.4 but not further back.
2014-06-27 11:08:48 -07:00
Tom Lane 798e235790 Rationalize error messages within jsonfuncs.c.
I noticed that the functions in jsonfuncs.c sometimes printed error
messages that claimed I'd called some other function.  Investigation showed
that this was from repurposing code into "worker" functions without taking
much care as to whether it would mention the right SQL-level function if it
threw an error.  Moreover, there was a weird mismash of messages that
contained a fixed function name, messages that used %s for a function name,
and messages that constructed a function name out of spare parts, like
"json%s_populate_record" (which, quite aside from being ugly as sin, wasn't
even sufficient to cover all the cases).  This would put an undue burden on
our long-suffering translators.  Standardize on inserting the SQL function
name with %s so as to reduce the number of translatable strings, and pass
function names around as needed to make sure we can report the right one.
Fix up some gratuitous variations in wording, too.
2014-06-25 15:25:22 -07:00
Tom Lane 8d2d7ad5ab Cosmetic improvements in jsonfuncs.c.
Re-pgindent, remove a lot of random vertical whitespace, remove useless
(if not counterproductive) inline markings, get rid of unnecessary
zero-padding of strings for hashtable searches.  No functional changes.
2014-06-25 11:22:18 -07:00
Tom Lane 57d8c1270e Fix handling of nested JSON objects in json_populate_recordset and friends.
populate_recordset_object_start() improperly created a new hash table
(overwriting the link to the existing one) if called at nest levels
greater than one.  This resulted in previous fields not appearing in
the final output, as reported by Matti Hameister in bug #10728.
In 9.4 the problem also affects json_to_recordset.

This perhaps missed detection earlier because the default behavior is to
throw an error for nested objects: you have to pass use_json_as_text = true
to see the problem.

In addition, fix query-lifespan leakage of the hashtable created by
json_populate_record().  This is pretty much the same problem recently
fixed in dblink: creating an intended-to-be-temporary context underneath
the executor's per-tuple context isn't enough to make it go away at the
end of the tuple cycle, because MemoryContextReset is not
MemoryContextResetAndDeleteChildren.

Michael Paquier and Tom Lane
2014-06-24 21:22:40 -07:00
Heikki Linnakangas a87a7dc8b6 Don't allow foreign tables with OIDs.
The syntax doesn't let you specify "WITH OIDS" for foreign tables, but it
was still possible with default_with_oids=true. But the rest of the system,
including pg_dump, isn't prepared to handle foreign tables with OIDs
properly.

Backpatch down to 9.1, where foreign tables were introduced. It's possible
that there are databases out there that already have foreign tables with
OIDs. There isn't much we can do about that, but at least we can prevent
them from being created in the future.

Patch by Etsuro Fujita, reviewed by Hadi Moshayedi.
2014-06-24 13:27:18 +03:00
Robert Haas c922353b1c Check for interrupts during tuple-insertion loops.
Normally, this won't matter too much; but if I/O is really slow, for
example because the system is overloaded, we might write many pages
before checking for interrupts.  A single toast insertion might
write up to 1GB of data, and a multi-insert could write hundreds
of tuples (and their corresponding TOAST data).
2014-06-23 21:45:21 -04:00
Heikki Linnakangas 85ba0748ed Fix bug in WAL_DEBUG.
The record header was not copied correctly to the buffer that was passed
to the rm_desc function. Broken by my rm_desc signature refactoring patch.
2014-06-23 12:22:36 +03:00
Tom Lane 8b38a538c0 Add Asserts to verify that catalog cache keys are unique and not null.
The catcache code is effectively assuming this already, so let's insist
that the catalog and index are actually declared that way.

Having done that, the comments in indexing.h about non-unique indexes
not being used for catcaches are completely redundant not just mostly so;
and we didn't have such a comment for every such index anyway.  So let's
get rid of them.

Per discussion of whether we should identify primary keys for catalogs.
We might or might not take that further step, but this change in itself
will allow quicker detection of misdeclared catcaches, so it seems worth
doing in any case.
2014-06-20 18:21:05 -04:00
Andres Freund ecac0e2b9e Do all-visible handling in lazy_vacuum_page() outside its critical section.
Since fdf9e21196 lazy_vacuum_page() rechecks the all-visible status
of pages in the second pass over the heap. It does so inside a
critical section, but both visibilitymap_test() and
heap_page_is_all_visible() perform operations that should not happen
inside one. The former potentially performs IO and both potentially do
memory allocations.

To fix, simply move all the all-visible handling outside the critical
section. Doing so means that the PD_ALL_VISIBLE on the page won't be
included in the full page image of the HEAP2_CLEAN record anymore. But
that's fine, the flag will be set by the HEAP2_VISIBLE logged later.

Backpatch to 9.3 where the problem was introduced. The bug only came
to light due to the assertion added in 4a170ee9 and isn't likely to
cause problems in production scenarios. The worst outcome is a
avoidable PANIC restart.

This also gets rid of the difference in the order of operations
between master and standby mentioned in 2a8e1ac5.

Per reports from David Leverton and Keith Fiske in bug #10533.
2014-06-20 11:09:17 +02:00
Andres Freund 3bdcf6a5a7 Don't allow to disable backend assertions via the debug_assertions GUC.
The existance of the assert_enabled variable (backing the
debug_assertions GUC) reduced the amount of knowledge some static code
checkers (like coverity and various compilers) could infer from the
existance of the assertion. That could have been solved by optionally
removing the assertion_enabled variable from the Assert() et al macros
at compile time when some special macro is defined, but the resulting
complication doesn't seem to be worth the gain from having
debug_assertions. Recompiling is fast enough.

The debug_assertions GUC is still available, but readonly, as it's
useful when diagnosing problems. The commandline/client startup option
-A, which previously also allowed to enable/disable assertions, has
been removed as it doesn't serve a purpose anymore.

While at it, reduce code duplication in bufmgr.c and localbuf.c
assertions checking for spurious buffer pins. That code had to be
reindented anyway to cope with the assert_enabled removal.
2014-06-20 11:09:17 +02:00
Tom Lane 45b0f35723 Avoid leaking memory while evaluating arguments for a table function.
ExecMakeTableFunctionResult evaluated the arguments for a function-in-FROM
in the query-lifespan memory context.  This is insignificant in simple
cases where the function relation is scanned only once; but if the function
is in a sub-SELECT or is on the inside of a nested loop, any memory
consumed during argument evaluation can add up quickly.  (The potential for
trouble here had been foreseen long ago, per existing comments; but we'd
not previously seen a complaint from the field about it.)  To fix, create
an additional temporary context just for this purpose.

Per an example from MauMau.  Back-patch to all active branches.
2014-06-19 22:14:26 -04:00
Fujii Masao 9ba78fb0b9 Don't allow data_directory to be set in postgresql.auto.conf by ALTER SYSTEM.
data_directory could be set both in postgresql.conf and postgresql.auto.conf so far.
This could cause some problematic situations like circular definition. To avoid such
situations, this commit forbids a user to set data_directory in postgresql.auto.conf.

Backpatch this to 9.4 where ALTER SYSTEM command was introduced.

Amit Kapila, reviewed by Abhijit Menon-Sen, with minor adjustments by me.
2014-06-19 20:31:20 +09:00
Tom Lane df8b7bc9ff Improve our mechanism for controlling the Linux out-of-memory killer.
Arrange for postmaster child processes to respond to two environment
variables, PG_OOM_ADJUST_FILE and PG_OOM_ADJUST_VALUE, to determine whether
they reset their OOM score adjustments and if so to what.  This is superior
to the previous design involving #ifdef's in several ways.  The behavior is
now available in a default build, and both ends of the adjustment --- the
original adjustment of the postmaster's level and the subsequent
readjustment by child processes --- can now be controlled in one place,
namely the postmaster launch script.  So it's no longer necessary for the
launch script to act on faith that the server was compiled with the
appropriate options.  In addition, if someone wants to use an OOM score
other than zero for the child processes, that doesn't take a recompile
anymore; and we no longer have to cater separately to the two different
historical kernel APIs for this adjustment.

Gurjeet Singh, somewhat revised by me
2014-06-18 20:12:51 -04:00
Andrew Dunstan 960661980b Remove unnecessary check for jbvBinary in convertJsonbValue.
The check was confusing and is a condition that should never in fact
happen.

Per gripe from Dmitry Dolgov.
2014-06-18 19:28:20 -04:00
Tom Lane 66802246e2 Fix weird spacing in error message.
Seems to have been introduced in 1a3458b6d8.
2014-06-18 15:44:35 -04:00
Tom Lane 8f889b1083 Implement UPDATE tab SET (col1,col2,...) = (SELECT ...), ...
This SQL-standard feature allows a sub-SELECT yielding multiple columns
(but only one row) to be used to compute the new values of several columns
to be updated.  While the same results can be had with an independent
sub-SELECT per column, such a workaround can require a great deal of
duplicated computation.

The standard actually says that the source for a multi-column assignment
could be any row-valued expression.  The implementation used here is
tightly tied to our existing sub-SELECT support and can't handle other
cases; the Bison grammar would have some issues with them too.  However,
I don't feel too bad about this since other cases can be converted into
sub-SELECTs.  For instance, "SET (a,b,c) = row_valued_function(x)" could
be written "SET (a,b,c) = (SELECT * FROM row_valued_function(x))".
2014-06-18 13:22:34 -04:00
Tom Lane 2146f13408 Avoid recursion when processing simple lists of AND'ed or OR'ed clauses.
Since most of the system thinks AND and OR are N-argument expressions
anyway, let's have the grammar generate a representation of that form when
dealing with input like "x AND y AND z AND ...", rather than generating
a deeply-nested binary tree that just has to be flattened later by the
planner.  This avoids stack overflow in parse analysis when dealing with
queries having more than a few thousand such clauses; and in any case it
removes some rather unsightly inconsistencies, since some parts of parse
analysis were generating N-argument ANDs/ORs already.

It's still possible to get a stack overflow with weirdly parenthesized
input, such as "x AND (y AND (z AND ( ... )))", but such cases are not
mainstream usage.  The maximum depth of parenthesization is already
limited by Bison's stack in such cases, anyway, so that the limit is
probably fairly platform-independent.

Patch originally by Gurjeet Singh, heavily revised by me
2014-06-16 15:55:30 -04:00
Heikki Linnakangas 0ef0b6784c Change the signature of rm_desc so that it's passed a XLogRecord.
Just feels more natural, and is more consistent with rm_redo.
2014-06-14 10:46:48 +03:00
Tom Lane 3f8c23c4d3 Improve predtest.c's ability to reason about operator expressions.
We have for a long time been able to prove implications and refutations
between clauses structured like "expr op const" with the same subexpression
and btree-related operators; for example that "x < 4" implies "x <= 5".
The implication machinery is needed to detect usability of partial indexes,
and the refutation machinery is needed to implement constraint exclusion.

This patch extends that machinery to make proofs for operator expressions
involving the same two immutable-but-not-necessarily-just-Const input
expressions, ie does "expr1 op1 expr2" prove or refute "expr1 op2 expr2" or
"expr2 op2 expr1"?  An important example is that we can now prove "x = y"
given "y = x", which formerly the code could not deduce unless x or y was a
constant.  We can make use of the system's knowledge of operator commutator
and negator pairs, and can also make use of btree opclass relationships,
for example "x < y" implies "x <= y" and refutes "x > y" (notice that
neither of these could be proven just from commutator or negator links).

Inspired by a gripe from Brian Dunavant.  This seems more like a new
feature than a bug fix, though, so no back-patch.
2014-06-13 00:02:56 -04:00
Tom Lane 6554656ea2 Improve tuplestore's error messages for I/O failures.
We should report the errno when we get a failure from functions like
BufFileWrite.  "ERROR: write failed" is unreasonably taciturn for a
case that's well within the realm of possibility; I've seen it a
couple times in the buildfarm recently, in situations that were
probably out-of-disk-space, but it'd be good to see the errno
to confirm it.

I think this code was originally written without assuming that
the buffile.c functions would return useful errno; but most other
callers *are* assuming that, and a quick look at the buffile code
gives no reason to suppose otherwise.

Also, a couple of the old messages were phrased on the assumption
that a short read might indicate a logic bug in tuplestore itself;
but that code's pretty well tested by now, so a filesystem-level
problem seems much more likely.
2014-06-12 18:59:06 -04:00
Tom Lane 9d4444a6fc Preserve exposed type of subquery outputs when substituting NULLs.
I thought I could get away with hardcoded int4 here, but the buildfarm
says differently.
2014-06-12 17:11:53 -04:00
Tom Lane 154146d208 Rename lo_create(oid, bytea) to lo_from_bytea().
The previous naming broke the query that libpq's lo_initialize() uses
to collect the OIDs of the server-side functions it requires, because
that query effectively assumes that there is only one function named
lo_create in the pg_catalog schema (and likewise only one lo_open, etc).

While we should certainly make libpq more robust about this, the naive
query will remain in use in the field for the foreseeable future, so it
seems the only workable choice is to use a different name for the new
function.  lo_from_bytea() won a small straw poll.

Back-patch into 9.4 where the new function was introduced.
2014-06-12 15:39:09 -04:00
Alvaro Herrera 7937910781 Fix typos 2014-06-12 14:01:01 -04:00
Tom Lane 55d5b3c082 Remove unnecessary output expressions from unflattened subqueries.
If a sub-select-in-FROM gets flattened into the upper query, then we
naturally get rid of any output columns that are defined in the sub-select
text but not actually used in the upper query.  However, this doesn't
happen when it's not possible to flatten the subquery, for example because
it contains GROUP BY, LIMIT, etc.  Allowing the subquery to compute useless
output columns is often fairly harmless, but sometimes it has significant
performance cost: the unused output might be an expensive expression,
or it might be a Var from a relation that we could remove entirely (via
the join-removal logic) if only we realized that we didn't really need
that Var.  Situations like this are common when expanding views, so it
seems worth taking the trouble to detect and remove unused outputs.

Because the upper query's Var numbering for subquery references depends on
positions in the subquery targetlist, we don't want to renumber the items
we leave behind.  Instead, we can implement "removal" by replacing the
unwanted expressions with simple NULL constants.  This wastes a few cycles
at runtime, but not enough to justify more work in the planner.
2014-06-12 13:12:53 -04:00
Andres Freund e04a9ccd2c Consistency improvements for slot and decoding code.
Change the order of checks in similar functions to be the same; remove
a parameter that's not needed anymore; rename a memory context and
expand a couple of comments.

Per review comments from Amit Kapila
2014-06-12 13:33:27 +02:00
Noah Misch d098b236f3 Fix typos in comments. 2014-06-11 19:50:29 -04:00
Fujii Masao a26ae56f51 Fix typos in comments. 2014-06-11 20:54:06 +09:00
Tom Lane fd90b5d574 Fix ancient encoding error in hungarian.stop.
When we grabbed this file off the Snowball project's website, we mistakenly
supposed that it was in LATIN1 encoding, but evidently it was actually in
LATIN2.  This resulted in ő (o-double-acute, U+0151, which is code 0xF5 in
LATIN2) being misconverted into õ (o-tilde, U+00F5), as complained of in
bug #10589 from Zoltán Sörös.  We'd have messed up u-double-acute too,
but there aren't any of those in the file.  Other characters used in the
file have the same codes in LATIN1 and LATIN2, which no doubt helped hide
the problem for so long.

The error is not only ours: the Snowball project also was confused about
which encoding is required for Hungarian.  But dealing with that will
require source-code changes that I'm not at all sure we'll wish to
back-patch.  Fixing the stopword file seems reasonably safe to back-patch
however.
2014-06-10 22:48:16 -04:00
Tom Lane c170655cc8 Fix infinite loop when splitting inner tuples in SPGiST text indexes.
Previously, the code used a node label of zero both for strings that
contain no bytes beyond the inner tuple's prefix, and for cases where an
"allTheSame" inner tuple has to be split to allow a string with a different
next byte to be inserted into it.  Failing to distinguish these cases meant
that if a string ending with the current prefix needed to be inserted into
an allTheSame tuple, we got into an infinite loop, because after splitting
the tuple we'd descend into the child allTheSame tuple and then find we
need to split again.

To fix, instead use -1 and -2 as the node labels for these two cases.
This requires widening the node label type from "char" to int2, but
fortunately SPGiST stores all pass-by-value node label types in their
Datum representation, which means that this change is transparently upward
compatible so far as the on-disk representation goes.  We continue to
recognize zero as a dummy node label for reading purposes, but will not
attempt to push new index entries down into such a label, so that the loop
won't occur even when dealing with an existing index.

Per report from Teodor Sigaev.  Back-patch to 9.2 where the faulty
code was introduced.
2014-06-09 16:31:11 -04:00
Alvaro Herrera b0b263baab Wrap multixact/members correctly during extension, take 2
In a50d976254 I already changed this, but got it wrong for the case
where the number of members is larger than the number of entries that
fit in the last page of the last segment.

As reported by Serge Negodyuck in a followup to bug #8673.
2014-06-09 15:17:23 -04:00
Andres Freund fe7337f2dc Fix off-by-one in decoding causing one-record events to be skipped.
A ReorderBufferTransaction's end_lsn, the sentPtr advocated by
walsender keepalive messages, and the end location remembered by the
decoding get_*changes* SQL functions all use the location of the last
read record + 1. I.e. the LSN points to the beginning of the next
record. That cannot realistically be changed without changing the
replication protocol because that's how keepalive messages have worked
since 9.0.
The bug is that the logic inside the snapshot builder, which decides
whether a transaction's contents should be decoded, assumed the start
location would point towards the last byte of the last record. The
reason this didn't actually cause visible problems is that currently
that decision is only made for commit records. Since interesting
transactions always have at least one additional record - containing
actual data - we'd never skip a transaction.
But if there ever were transactions, or other events, with just one
record containing important information, we'd skip them after stopping
and restarting logical decoding.
2014-06-05 18:27:11 +02:00
Tom Lane 5f93c37805 Add defenses against running with a wrong selection of LOBLKSIZE.
It's critical that the backend's idea of LOBLKSIZE match the way data has
actually been divided up in pg_largeobject.  While we don't provide any
direct way to adjust that value, doing so is a one-line source code change
and various people have expressed interest recently in changing it.  So,
just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in
pg_control and cross-check that the backend's compiled-in setting matches
the on-disk data.

Also tweak the code in inv_api.c so that fetches from pg_largeobject
explicitly verify that the length of the data field is not more than
LOBLKSIZE.  Formerly we just had Asserts() for that, which is no protection
at all in production builds.  In some of the call sites an overlength data
value would translate directly to a security-relevant stack clobber, so it
seems worth one extra runtime comparison to be sure.

In the back branches, we can't change the contents of pg_control; but we
can still make the extra checks in inv_api.c, which will offer some amount
of protection against running with the wrong value of LOBLKSIZE.
2014-06-05 11:31:06 -04:00
Andres Freund f0c108560b Consistently spell a replication slot's name as slot_name.
Previously there's been a mix between 'slotname' and 'slot_name'. It's
not nice to be unneccessarily inconsistent in a new feature. As a post
beta1 initdb now is required in the wake of eeca4cd35e, fix the
inconsistencies.
Most the changes won't affect usage of replication slots because the
majority of changes is around function parameter names. The prominent
exception to that is that the recovery.conf parameter
'primary_slotname' is now named 'primary_slot_name'.
2014-06-05 16:29:20 +02:00
Heikki Linnakangas 8776faa81c Adjust SP-GiST WAL record formats to reduce alignment padding.
The way the code was written, the padding was copied from uninitialized
memory areas.. Because the structs are local variables in the code where
the WAL records are constructed, making them larger and zeroing the padding
bytes would not make the code very pretty, so rather than fixing this
directly by zeroing out the padding bytes, it seems more clear to not try to
align the tuples in the WAL records. The redo functions are taught to copy
the tuple header to a local variable to avoid unaligned access.

Stable-branches have the same problem, but we can't change the WAL format
there, so fix in master only. Reading a few random extra bytes at the stack
is harmless in practice, so it's not worth crafting a different
back-patchable fix.

Per reports from Kevin Grittner and Andres Freund, using clang static
analyzer and Valgrind, respectively.
2014-06-05 12:55:35 +03:00
Tom Lane 4c8ab1b91d Add btree and hash opclasses for pg_lsn.
This is needed to allow ORDER BY, DISTINCT, etc to work as expected for
pg_lsn values.

We had previously decided to put this off for 9.5, but in view of commit
eeca4cd35e there's no reason to avoid a
catversion bump for 9.4beta2, and this does make a pretty significant
usability difference for pg_lsn.

Michael Paquier, with fixes from Andres Freund and Tom Lane
2014-06-04 20:45:56 -04:00
Andres Freund 621a99a666 Fix longstanding bug in HeapTupleSatisfiesVacuum().
HeapTupleSatisfiesVacuum() didn't properly discern between
DELETE_IN_PROGRESS and INSERT_IN_PROGRESS for rows that have been
inserted in the current transaction and deleted in a aborted
subtransaction of the current backend. At the very least that caused
problems for CLUSTER and CREATE INDEX in transactions that had
aborting subtransactions producing rows, leading to warnings like:
WARNING:  concurrent delete in progress within table "..."
possibly in an endless, uninterruptible, loop.

Instead of treating *InProgress xmins the same as *IsCurrent ones,
treat them as being distinct like the other visibility routines. As
implemented this separatation can cause a behaviour change for rows
that have been inserted and deleted in another, still running,
transaction. HTSV will now return INSERT_IN_PROGRESS instead of
DELETE_IN_PROGRESS for those. That's both, more in line with the other
visibility routines and arguably more correct. The latter because a
INSERT_IN_PROGRESS will make callers look at/wait for xmin, instead of
xmax.
The only current caller where that's possibly worse than the old
behaviour is heap_prune_chain() which now won't mark the page as
prunable if a row has concurrently been inserted and deleted. That's
harmless enough.

As a cautionary measure also insert a interrupt check before the gotos
in IndexBuildHeapScan() that lead to the uninterruptible loop. There
are other possible causes, like a row that several sessions try to
update and all fail, for repeated loops and the cost of doing so in
the retry case is low.

As this bug goes back all the way to the introduction of
subtransactions in 573a71a5da backpatch to all supported releases.

Reported-By: Sandro Santilli
2014-06-04 21:36:19 +02:00
Fujii Masao 654e8e4447 Save pg_stat_statements statistics file into $PGDATA/pg_stat directory at shutdown.
187492b6c2 changed pgstat.c so that
the stats files were saved into $PGDATA/pg_stat directory when the server
was shutdowned. But it accidentally forgot to change the location of
pg_stat_statements permanent stats file. This commit fixes pg_stat_statements
so that its stats file is also saved into $PGDATA/pg_stat at shutdown.

Since this fix changes the file layout, we don't back-patch it to 9.3
where this oversight was introduced.
2014-06-04 12:09:45 +09:00
Andrew Dunstan ab14a73a6c Use EncodeDateTime instead of to_char to render JSON timestamps.
Per gripe from Peter Eisentraut and Tom Lane.

The output is slightly different, but still ISO 8601 compliant: to_char
doesn't output the minutes when time zone offset is an integer number of
hours, while EncodeDateTime outputs ":00".

The code is slightly adapted from code in xml.c
2014-06-03 18:26:47 -04:00
Andrew Dunstan 0ad1a81632 Do not escape a unicode sequence when escaping JSON text.
Previously, any backslash in text being escaped for JSON was doubled so
that the result was still valid JSON. However, this led to some perverse
results in the case of Unicode sequences, These are now detected and the
initial backslash is no longer escaped. All other backslashes are
still escaped. No validity check is performed, all that is looked for is
\uXXXX where X is a hexidecimal digit.

This is a change from the 9.2 and 9.3 behaviour as noted in the Release
notes.

Per complaint from Teodor Sigaev.
2014-06-03 16:11:31 -04:00
Andrew Dunstan f30015b6d7 Output timestamps in ISO 8601 format when rendering JSON.
Many JSON processors require timestamp strings in ISO 8601 format in
order to convert the strings. When converting a timestamp, with or
without timezone, to a JSON datum we therefore now use such a format
rather than the type's default text output, in functions such as
to_json().

This is a change in behaviour from 9.2 and 9.3, as noted in the release
notes.
2014-06-03 13:56:53 -04:00
Andres Freund 44445b28d2 Set the process latch when processing recovery conflict interrupts.
Because RecoveryConflictInterrupt() didn't set the process latch
anything using the latter to wait for events didn't get notified about
recovery conflicts. Most latch users are never the target of recovery
conflicts, which explains the lack of reports about this until
now.
Since 9.3 two possible affected users exist though: The sql callable
pg_sleep() now uses latches to wait and background workers are
expected to use latches in their main loop. Both would currently wait
until the end of WaitLatch's timeout.

Fix by adding a SetLatch() to RecoveryConflictInterrupt(). It'd also
be possible to fix the issue by having each latch user set
set_latch_on_sigusr1. That seems failure prone and though, as most of
these callsites won't often receive recovery conflicts and thus will
likely only be tested against normal query cancels et al. It'd also be
unnecessarily verbose.

Backpatch to 9.1 where latches were introduced. Arguably 9.3 would be
sufficient, because that's where pg_sleep() was converted to waiting
on the latch and background workers got introduced; but there could be
user level code making use of the latch pre 9.3.
2014-06-03 14:02:54 +02:00
Andrew Dunstan 1a4174a498 Improve the efficiency of certain jsonb get operations.
Instead of iterating over jsonb structures, use the inbuilt functions
findJsonbValueFromContainerLen() and getIthJsonbValueFromContainer() to
extract values directly. These functions use algorithms that are O(n log
n) and O(1) respectively, whereas iterating is O(n), so we should see
considerable speedup here.

Teodor Sigaev.
2014-06-01 19:04:02 -04:00
Tom Lane 71ed8b3ca7 Revert "Fix bogus %name-prefix option syntax in all our Bison files."
This reverts commit 45b7abe59e.

It turns out that the %name-prefix syntax without "=" does not work
at all in pre-2.4 Bison.  We are not prepared to make such a large
jump in minimum required Bison version just to suppress a warning
message in a version hardly any developers are using yet.
When 3.0 gets more popular, we'll figure out a way to deal with this.
In the meantime, BISONFLAGS=-Wno-deprecated is recommendable for
anyone using 3.0 who doesn't want to see the warning.
2014-05-28 19:21:01 -04:00
Andres Freund 21d48d66c8 Don't pay heed to wal_sender_timeout while creating a decoding slot.
Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for
further WAL using WalSndWaitForWal(). That used to always respect
wal_sender_timeout and kill the session when waiting long enough
because no feedback/ping messages can be sent while the slot is still
being created.
Introduce the notion that last_reply_timestamp = 0 means that the
walsender currently doesn't need timeout processing to avoid that
problem. Use that notion for CREATE_REPLICATION_SLOT ... LOGICAL.

Bugreport and initial patch by Steve Singer, revised by me.
2014-05-29 00:32:09 +02:00
Heikki Linnakangas d1d50bff24 Minor refactoring of jsonb_util.c
The only caller of compareJsonbScalarValue that needed locale-sensitive
comparison of strings was also the only caller that didn't just check for
equality. Separate the two cases for clarity: compareJsonbScalarValue now
does locale-sensitive comparison, and a new function,
equalsJsonbScalarValue, just checks for equality.
2014-05-28 23:48:02 +03:00
Heikki Linnakangas b3e5cfd5f9 Jsonb comparison bug fixes.
Fix an over-zealous assertion, which didn't take into account that sometimes
a scalar element can be compared against an array/object element.

Avoid comparing possibly-uninitialized local variables when end-of-array or
end-of-object is reached. Also fix and enhance comments a bit.

Peter Geoghegan, per reports by Pavel Stehule and me.
2014-05-28 22:47:04 +03:00
Tom Lane 45b7abe59e Fix bogus %name-prefix option syntax in all our Bison files.
%name-prefix doesn't use an "=" sign according to the Bison docs, but it
silently accepted one anyway, until Bison 3.0.  This was originally a
typo of mine in commit 012abebab1, and we
seem to have slavishly copied the error into all the other grammar files.

Per report from Vik Fearing; analysis by Peter Eisentraut.

Back-patch to all active branches, since somebody might try to build
a back branch with up-to-date tools.
2014-05-28 15:41:53 -04:00
Magnus Hagander 8232d6df4c Ensure cleanup in case of early errors in streaming base backups
Move the code that sends the initial status information as well as the
calculation of paths inside the ENSURE_ERROR_CLEANUP block. If this code
failed, we would "leak" a counter of number of concurrent backups, thereby
making the system always believe it was in backup mode. This could happen
if the sending failed (which it probably never did given that the small
amount of data to send would never cause a flush) or if the psprintf calls
ran out of memory. Both are very low risk, but all operations after
do_pg_start_backup should be protected.
2014-05-28 12:43:29 +02:00
Peter Eisentraut 0a5faaa907 Small typo and formatting fixes in postgresql.conf.sample 2014-05-25 23:21:41 -04:00
Heikki Linnakangas 8da3183780 Fix error when trying to delete page with half-dead left sibling.
The new page deletion code didn't cope with the case the target page's
right sibling was marked half-dead. It failed a sanity check which checked
that the downlinks in the parent page match the lower level, because a
half-dead page has no downlink. To cope, check for that condition, and
just give up on the deletion if it happens. The vacuum will finish the
deletion of the half-dead page when it gets there, and on the next vacuum
after that the empty can be deleted.

Reported by Jeff Janes.
2014-05-25 18:18:09 -04:00
Andres Freund 9fa93530c8 Don't allocate memory inside an Assert() iff in a critical section.
HeapTupleHeaderGetCmax() asserts that it is only used if the tuple has
been updated by the current transaction. That check is correct and
sensible but requires allocating memory if xmax is a multixact. When
wal_level is set to logical cmax needs to be included in a wal record
, generated inside a critical section, which can trigger the assertion
added in 4a170ee9e.

Reported-By: Steve Singer
2014-05-25 17:54:53 +02:00
Andres Freund 0564bbe7a1 Silence a couple of spurious valgrind warnings in inval.c.
Define padding bytes in SharedInvalidationMessage structs to be
defined. Otherwise the sinvaladt.c ringbuffer, which is accessed by
multiple processes, will cause spurious valgrind warnings about
undefined memory being used. That's because valgrind remembers the
undefined bytes from the last local process's store, not realizing
that another process has written since, filling the previously
uninitialized bytes.
2014-05-24 17:34:22 +02:00
Heikki Linnakangas 57b7e83b0d Fix misc typos in comments. 2014-05-23 08:16:21 -04:00
Robert Haas 11ad3b35c2 Remove unnecessary cleanup code.
This is all inside a block guarded by op == DSM_OP_ATTACH, so it can
never be the case that op == DSM_OP_CREATE.

Reported by Coverity.
2014-05-22 10:41:48 -04:00
Fujii Masao 19a683f69f Fix typos in comments. 2014-05-22 12:43:50 +09:00
Heikki Linnakangas 51f41e8c0a Fix typos in comments. 2014-05-21 23:19:01 -04:00
Tom Lane e416830a29 Prevent auto_explain from changing the output of a user's EXPLAIN.
Commit af7914c662, which introduced the
EXPLAIN (TIMING) option, for some reason coded explain.c to look at
planstate->instrument->need_timer rather than es->timing to decide
whether to print timing info.  However, the former flag might get set
as a result of contrib/auto_explain wanting timing information.  We
certainly don't want activation of auto_explain to change user-visible
statement behavior, so fix that.

Also fix an independent bug introduced in the same patch: in the code
path for a never-executed node with a machine-friendly output format,
if timing was selected, it would fail to print the Actual Rows and Actual
Loops items.

Per bug #10404 from Tomonari Katsumata.  Back-patch to 9.2 where the
faulty code was introduced.
2014-05-20 12:20:47 -04:00
Tom Lane a0841ecd25 Update obsolete comment.
Peter Geoghegan
2014-05-19 16:38:49 -04:00
Heikki Linnakangas c91a9b5a28 Fix backup-block numbering in redo of b-tree split.
I got the backup block numbers off-by-one in the commit that changed the
way incomplete-splits are handled. I blame the comments, which said
"backup block 1" and "backup block 2", even though the backup blocks
are numbered starting from 0, in the macros and functions used in replay.
Fix the comments and the code.

Per Jeff Janes' bug report about corruption caused by torn page writes.
The incorrect code is new in git master, but backpatch the comment change
down to 9.0, where the numbering in the redo-side macros  was changed.
2014-05-19 13:28:04 +03:00
Tom Lane 0c19aaba22 Ooops, I broke initdb with that last patch.
That's what I get for not fully retesting the final version of the patch.
The replace_allowed cross-check needs an additional special case for
bootstrapping.
2014-05-18 18:17:55 -04:00
Tom Lane 078b2ed291 Fix two ancient memory-leak bugs in relcache.c.
RelationCacheInsert() ignored the possibility that hash_search(HASH_ENTER)
might find a hashtable entry already present for the same OID.  However,
that can in fact occur during recursive relcache load scenarios.  When it
did happen, we overwrote the pointer to the pre-existing Relation, causing
a session-lifespan leakage of that entire structure.  As far as is known,
the pre-existing Relation would always have reference count zero by the
time we arrive back at the outer insertion, so add code that deletes the
pre-existing Relation if so.  If by some chance its refcount is positive,
elog a WARNING and allow the pre-existing Relation to be leaked as before.

Also, AttrDefaultFetch() was sloppy about leaking the cstring form of the
pg_attrdef.adbin value it's copying into the relcache structure.  This is
only a query-lifespan leakage, and normally not very significant, but it
adds up during CLOBBER_CACHE testing.

These bugs are of very ancient vintage, but I'll refrain from back-patching
since there's no evidence that these leaks amount to anything in ordinary
usage.
2014-05-18 16:51:46 -04:00
Tom Lane 44cd47c1d4 Make fallback implementation of pg_memory_barrier() work.
The fallback implementation involves acquiring and releasing a spinlock
variable that is otherwise unreferenced --- not even to the extent of
initializing it.  This accidentally fails to fail on platforms where
spinlocks should be initialized to zeroes, but elsewhere it results in
a "stuck spinlock" failure during startup.

I griped about this last July, and put in a hack that worked for gcc
on HPPA, but didn't get around to fixing the general case.  Per the
discussion back then, the best thing to do seems to be to initialize
dummy_spinlock in main.c.
2014-05-17 18:29:46 -04:00
Tom Lane c1907f0cc4 Fix a bunch of functions that were declared static then defined not-static.
Per testing with a compiler that whines about this.
2014-05-17 17:57:53 -04:00
Tom Lane 6c42b2b10a Fix unaligned accesses in DecodeUpdate().
The xl_heap_header_len structures in an XLOG_HEAP_UPDATE record aren't
necessarily aligned adequately.  The regular replay function for these
records is aware of that, but decode.c didn't get the memo.  I'm not
sure why the buildfarm failed to catch this; the test_decoding test
certainly blows up real good on my old HPPA box.

Also, I'm pretty sure that the address arithmetic was wrong for the
case of XLOG_HEAP_CONTAINS_OLD and not XLOG_HEAP_CONTAINS_NEW_TUPLE,
though this apparently can't happen when logical decoding is active.
2014-05-17 15:53:21 -04:00
Heikki Linnakangas a3655dd4a5 Update README, we don't do post-recovery cleanup actions anymore.
transam/README explained how B-tree incomplete splits were tracked and
fixed after recovery, as an example of handling complex actions that need
multiple WAL records, but that's not how it works anymore. Explain the new
paradigm.
2014-05-17 13:55:03 +03:00
Tom Lane 7894ac5004 Make sure chr(int) can't create invalid UTF8 sequences.
Several years ago we changed chr(int) so that if the database encoding is
UTF8, it would interpret its argument as a Unicode code point and expand it
into the appropriate multibyte sequence.  However, we weren't sufficiently
careful about checking validity of the input.  According to RFC3629, UTF8
disallows code points above U+10FFFF (note that the predecessor standard
RFC2279 was more liberal).  Also, both versions of the UTF8 spec agree
that Unicode surrogate-pair codes should never appear in UTF8.  Because
our encoding validity checks follow RFC3629, our failure to enforce these
restrictions in chr() means it could be used to produce text strings that
will be rejected when the database is dumped and reloaded.  To ensure
consistency with the input functions, let's actually apply
pg_utf8_islegal() to the proposed output of chr().

Per discussion, this seems like too much of a behavioral change to
back-patch, but it's not too late to squeeze it into 9.4.
2014-05-16 16:51:28 -04:00
Heikki Linnakangas 03e2b1017c Fix thinko in logical decoding of commit-prepared records.
The decoding of prepared transaction commits accidentally used the XID of
the transaction performing the COMMIT PREPARED, not the XID of the prepared
transaction. Before bb38fb0d43 that lead to those transactions not being
decoded, afterwards to a assertion failure.
2014-05-16 10:53:10 +03:00
Heikki Linnakangas 07a4a93a0e Initialize tsId and dbId fields in WAL record of COMMIT PREPARED.
Commit dd428c79 added dbId and tsId to the xl_xact_commit struct but missed
that prepared transaction commits reuse that struct. Fix that.

Because those fields were left unitialized, replaying a commit prepared WAL
record in a hot standby node would fail to remove the relcache init file.
That can lead to "could not open file" errors on the standby. Relcache init
file only needs to be removed when a system table/index is rewritten in the
transaction using two phase commit, so that should be rare in practice. In
HEAD, the incorrect dbId/tsId values are also used for filtering in logical
replication code, causing the transaction to always be filtered out.

Analysis and fix by Andres Freund. Backpatch to 9.0 where hot standby was
introduced.
2014-05-16 10:10:38 +03:00
Tom Lane f62d417825 Fix unportable setvbuf() usage in initdb.
In yesterday's commit 2dc4f011fd, I tried
to force buffering of stdout/stderr in initdb to be what it is by
default when the program is run interactively on Unix (since that's how
most manual testing is done).  This tripped over the fact that Windows
doesn't support _IOLBF mode.  We dealt with that a long time ago in
syslogger.c by falling back to unbuffered mode on Windows.  Export that
solution in port.h and use it in initdb.

Back-patch to 8.4, like the previous commit.
2014-05-15 15:57:54 -04:00
Heikki Linnakangas 8f9b9590d7 Handle duplicate XIDs in txid_snapshot.
The proc array can contain duplicate XIDs, when a transaction is just being
prepared for two-phase commit. To cope, remove any duplicates in
txid_current_snapshot(). Also ignore duplicates in the input functions, so
that if e.g. you have an old pg_dump file that already contains duplicates,
it will be accepted.

Report and fix by Jan Wieck. Backpatch to all supported versions.
2014-05-15 18:29:20 +03:00
Heikki Linnakangas bb38fb0d43 Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.

To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.

Backpatch to all supported versions.
2014-05-15 16:37:50 +03:00
Heikki Linnakangas ff810b4928 Misc message style and doc fixes.
Euler Taveira
2014-05-15 14:49:11 +03:00
Tom Lane b23b0f5588 Code review for recent changes in relcache.c.
rd_replidindex should be managed the same as rd_oidindex, and rd_keyattr
and rd_idattr should be managed like rd_indexattr.  Omissions in this area
meant that the bitmapsets computed for rd_keyattr and rd_idattr would be
leaked during any relcache flush, resulting in a slow but permanent leak in
CacheMemoryContext.  There was also a tiny probability of relcache entry
corruption if we ran out of memory at just the wrong point in
RelationGetIndexAttrBitmap.  Otherwise, the fields were not zeroed where
expected, which would not bother the code any AFAICS but could greatly
confuse anyone examining the relcache entry while debugging.

Also, create an API function RelationGetReplicaIndex rather than letting
non-relcache code be intimate with the mechanisms underlying caching of
that value (we won't even mention the memory leak there).

Also, fix a relcache flush hazard identified by Andres Freund:
RelationGetIndexAttrBitmap must not assume that rd_replidindex stays valid
across index_open.

The aspects of this involving rd_keyattr date back to 9.3, so back-patch
those changes.
2014-05-14 14:56:08 -04:00
Heikki Linnakangas f35aef415a Fix harmless access to uninitialized memory.
When cache invalidations arrive while ri_LoadConstraintInfo() is busy
filling a new cache entry, InvalidateConstraintCacheCallBack() compares
the - not yet initialized - oidHashValue field with the to-be-invalidated
hash value. To fix, check whether the entry is already marked as invalid.

Andres Freund
2014-05-13 19:18:28 +03:00
Tom Lane 195e81aff5 Find postgresql.auto.conf in PGDATA even when postgresql.conf is elsewhere.
The original coding for ALTER SYSTEM made a fundamentally bogus assumption
that postgresql.auto.conf could be sought relative to the main config file
if we hadn't yet determined the value of data_directory.  This fails for
common arrangements with the config file elsewhere, as reported by
Christoph Berg.

The simplest fix is to not try to read postgresql.auto.conf until after
SelectConfigFiles has chosen (and locked down) the data_directory setting.

Because of the logic in ProcessConfigFile for handling resetting of GUCs
that've been removed from the config file, we cannot easily read the main
and auto config files separately; so this patch adopts a brute force
approach of reading the main config file twice during postmaster startup.
That's a tad ugly, but the actual time cost is likely to be negligible,
and there's no time for a more invasive redesign before beta.

With this patch, any attempt to set data_directory via ALTER SYSTEM
will be silently ignored.  It would probably be better to throw an
error, but that can be dealt with later.  This bug, however, would
prevent any testing of ALTER SYSTEM by a significant fraction of the
userbase, so it seems important to get it fixed before beta.
2014-05-11 15:13:30 -04:00
Tom Lane 12e611d43e Rename jsonb_hash_ops to jsonb_path_ops.
There's no longer much pressure to switch the default GIN opclass for
jsonb, but there was still some unhappiness with the name "jsonb_hash_ops",
since hashing is no longer a distinguishing property of that opclass,
and anyway it seems like a relatively minor detail.  At the suggestion of
Heikki Linnakangas, we'll use "jsonb_path_ops" instead; that captures the
important characteristic that each index entry depends on the entire path
from the document root to the indexed value.

Also add a user-facing explanation of the implementation properties of
these two opclasses.
2014-05-11 12:06:04 -04:00
Peter Eisentraut e136271a94 Translation updates 2014-05-10 22:16:59 -04:00
Tom Lane 0d0b2bf175 Rename min_recovery_apply_delay to recovery_min_apply_delay.
Per discussion, this seems like a more consistent choice of name.

Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut;
some additional documentation wordsmithing by me
2014-05-10 19:46:19 -04:00
Heikki Linnakangas 866e6e1d04 Fix bug in lossy-page handling in GIN
When returning rows from a bitmap, as done with partial match queries, we
would get stuck in an infinite loop if the bitmap contained a lossy page
reference.

This bug is new in master, it was introduced by the patch to allow skipping
items refuted by other entries in GIN scans.

Report and fix by Alexander Korotkov
2014-05-10 23:28:26 +03:00
Tom Lane 3d8c2b496f Fix broken allocation logic in recently-rewritten jsonb_util.c.
reserveFromBuffer() failed to consider the possibility that it needs to
more-than-double the current buffer size.  Beyond that, it seems likely
that we'd someday need to worry about integer overflow of the buffer
length variable.  Rather than reinvent the logic that's already been
debugged in stringinfo.c, let's go back to using that logic.  We can
still have the same targeted API, but we'll rely on stringinfo.c to
manage reallocation.

Per report from Alexander Korotkov.
2014-05-09 18:24:17 -04:00
Tom Lane 0ca6bda8e7 Get rid of bogus dependency on typcategory in to_json() and friends.
These functions were relying on typcategory to identify arrays and
composites, which is not reliable and not the normal way to do it.
Using typcategory to identify boolean, numeric types, and json itself is
also pretty questionable, though the code in those cases didn't seem to be
at risk of anything worse than wrong output.  Instead, use the standard
lsyscache functions to identify arrays and composites, and rely on a direct
check of the type OID for the other cases.

In HEAD, also be sure to look through domains so that a domain is treated
the same as its base type for conversions to JSON.  However, this is a
small behavioral change; given the lack of field complaints, we won't
back-patch it.

In passing, refactor so that there's only one copy of the code that decides
which conversion strategy to apply, not multiple copies that could (and
have) gotten out of sync.
2014-05-09 12:55:31 -04:00
Robert Haas f1d8dd3647 Code review for logical decoding patch.
Post-commit review identified a number of places where addition was
used instead of multiplication or memory wasn't zeroed where it should
have been.  This commit also fixes one case where a structure member
was mis-initialized, and moves another memory allocation closer to
the place where the allocated storage is used for clarity.

Andres Freund
2014-05-09 10:44:04 -04:00
Robert Haas b2dada8f5f Remove overeager assertion in logical_heap_begin_rewrite.
It's legal to configure wal_level=logical and max_replication_slots=0
simultaneously.

Andres Freund
2014-05-09 10:36:12 -04:00
Tom Lane 62e57ff040 Teach add_json() that jsonb is of TYPCATEGORY_JSON.
This code really needs to be refactored so that there aren't so many copies
that can diverge.  Not to mention that this whole approach is probably
wrong.  But for the moment I'll just stick my finger in the dike.
Per report from Michael Paquier.
2014-05-09 09:44:11 -04:00
Heikki Linnakangas d9daff0e0c More jsonb cleanup.
Fix JSONB_MAX_ELEMS and JSONB_MAX_PAIRS macros to use CB_MASK in the
calculation. JENTRY_POSMASK happens to have the same value at the moment,
but that's just coincidental.

Refactor jsonb iterator functions, for readability.

Get rid of the JENTRY_ISFIRST flag. Whenever we handle JEntrys, we have
access to the whole array and have enough context information to know
which entry is the first. This frees up one bit in the JEntry header for
future use. While we're at it, shuffle the JEntry bits so that boolean
true and false go together, for aesthetic reasons.

Bump catalog version as this changes the on-disk format slightly.
2014-05-09 15:55:56 +03:00
Tom Lane 46dddf7673 Improve key representation for GIN jsonb_ops, and fix existence-search bug.
Change the key representation so that values that would exceed 127 bytes
are hashed into short strings, and so that the original JSON datatype of
each value is recorded in the index.  The hashing rule eliminates the major
objection to having this opclass be the default for jsonb, namely that it
could fail for plausible input data (due to GIN's restrictions on maximum
key length).  Preserving datatype information doesn't really buy us much
right now, but it requires no extra space compared to the previous way,
and it might be useful later.

Also, change the consistency-checking functions to request recheck for
exists (jsonb ? text) and related operators.  The original analysis that
this is an exactly checkable query was incorrect, since the index does
not preserve information about whether a key appears at top level in
the indexed JSON object.  Add a test case demonstrating the problem.

Make some other, mostly cosmetic improvements to the code in jsonb_gin.c
as well.

catversion bump due to on-disk data format change in jsonb_ops indexes.
2014-05-09 08:41:26 -04:00
Heikki Linnakangas ff7bbb0176 Minor cleanup of jsonb_util.c
Move the functions around to group related functions together. Remove
binequal argument from lengthCompareJsonbStringValue, moving that
responsibility to lengthCompareJsonbPair. Fix typo in comment.
2014-05-09 13:09:59 +03:00
Heikki Linnakangas d3c72e23df Avoid some pnstrdup()s when constructing jsonb
This speeds up text to jsonb parsing and hstore to jsonb conversions
somewhat.
2014-05-09 12:46:21 +03:00
Tom Lane b910d7ea35 Increase the default value of effective_cache_size to 4GB.
Per discussion, the old value of 128MB is ridiculously small on modern
machines; in fact, it's not even any larger than the default value of
shared_buffers, which it certainly should be.  Increase to 4GB, which
is unlikely to be any worse than the old default for anyone, and should
be noticeably better for most.  Eventually we might have an autotuning
scheme for this setting, but the recent attempt crashed and burned,
so for now just do this.
2014-05-08 21:11:47 -04:00
Tom Lane a16d421ca4 Revert "Auto-tune effective_cache size to be 4x shared buffers"
This reverts commit ee1e5662d8, as well as
a remarkably large number of followup commits, which were mostly concerned
with the fact that the implementation didn't work terribly well.  It still
doesn't: we probably need some rather basic work in the GUC infrastructure
if we want to fully support GUCs whose default varies depending on the
value of another GUC.  Meanwhile, it also emerged that there wasn't really
consensus in favor of the definition the patch tried to implement (ie,
effective_cache_size should default to 4 times shared_buffers).  So whack
it all back to where it was.  In a followup commit, I'll do what was
recently agreed to, which is to simply change the default to a higher
value.
2014-05-08 20:49:38 -04:00
Heikki Linnakangas 4f7bb4b2a3 Protect against torn pages when deleting GIN list pages.
To-be-deleted list pages contain no useful information, as they are being
deleted, but we must still protect the writes from being torn by a crash
after a partial write. To do that, re-initialize the pages on WAL replay.

Jeff Janes caught this with a test program to test partial writes.
Backpatch to all supported versions.
2014-05-08 14:50:22 +03:00
Robert Haas be7558162a When a background worker exists with code 0, unregister it.
The previous behavior was to restart immediately, which was generally
viewed as less useful.

Petr Jelinek, with some adjustments by me.
2014-05-07 17:44:42 -04:00
Robert Haas eee6cf1f33 When a bgworker exits, always call ReleasePostmasterChildSlot.
Commit e2ce9aa27b was insufficiently
well thought out.  Repair.
2014-05-07 16:30:23 -04:00
Robert Haas 970d1f76d1 Restart bgworkers immediately after a crash-and-restart cycle.
Just as we would start bgworkers immediately after an initial startup
of the server, we should restart them immediately when reinitializing.

Petr Jelinek and Robert Haas
2014-05-07 16:19:35 -04:00
Heikki Linnakangas 364ddc3e5c Clean up jsonb code.
The main target of this cleanup is the convertJsonb() function, but I also
touched a lot of other things that I spotted into in the process.

The new convertToJsonb() function uses an output buffer that's resized on
demand, so the code to estimate of the size of JsonbValue is removed.

The on-disk format was not changed, even though I refactored the structs
used to handle it. The term "superheader" is replaced with "container".

The jsonb_exists_any and jsonb_exists_all functions no longer sort the input
array. That was a premature optimization, the idea being that if there are
duplicates in the input array, you only need to check them once. Also,
sorting the array saves some effort in the binary search used to find a key
within an object. But there were drawbacks too: the sorting and
deduplicating obviously isn't free, and in the typical case there are no
duplicates to remove, and the gain in the binary search was minimal. Remove
all that, which makes the code simpler too.

This includes a bug-fix; the total length of the elements in a jsonb array
or object mustn't exceed 2^28. That is now checked.
2014-05-07 23:16:19 +03:00
Robert Haas 4d155d8b08 Detach shared memory from bgworkers without shmem access.
Since the postmaster won't perform a crash-and-restart sequence
for background workers which don't request shared memory access,
we'd better make sure that they can't corrupt shared memory.

Patch by me, review by Tom Lane.
2014-05-07 14:56:49 -04:00
Tom Lane 04e5025be8 Fix failure to set ActiveSnapshot while rewinding a cursor.
ActiveSnapshot needs to be set when we call ExecutorRewind because some
plan node types may execute user-defined functions during their ReScan
calls (nodeLimit.c does so, at least).  The wisdom of that is somewhat
debatable, perhaps, but for now the simplest fix is to make sure the
required context is valid.  Failure to do this typically led to a
null-pointer-dereference core dump, though it's possible that in more
complex cases a function could be executed with the wrong snapshot
leading to very subtle misbehavior.

Per report from Leif Jensen.  It's been broken for a long time, so
back-patch to all active branches.
2014-05-07 14:25:11 -04:00
Robert Haas e2ce9aa27b Never crash-and-restart for bgworkers without shared memory access.
The motivation for a crash and restart cycle when a backend dies is
that it might have corrupted shared memory on the way down; and we
can't recover reliably except by reinitializing everything.  But that
doesn't apply to processes that don't touch shared memory.  Currently,
there's nothing to prevent a background worker that doesn't request
shared memory access from touching shared memory anyway, but that's a
separate bug.

Previous to this commit, the coding in postmaster.c was inconsistent:
an exit status other than 0 or 1 didn't provoke a crash-and-restart,
but failure to release the postmaster child slot did.  This change
makes those cases consistent.
2014-05-07 13:19:02 -04:00
Tom Lane 1891b415f0 Fix some more confusion between uint32 and Datum. 2014-05-06 23:52:30 -04:00
Tom Lane 2c22afaa4e hash_any returns Datum, not uint32 (and definitely not "int").
The coding in JsonbHashScalarValue might have accidentally failed to fail
given current representational choices, but the key word there would be
"accidental".  Insert the appropriate datatype conversion macro.  And
use the right conversion macro for hash_numeric's result, too.

In passing make the code a bit cleaner and less repetitive by factoring
out the xor step from the switch.
2014-05-06 22:49:40 -04:00
Jeff Davis 35c0cd3b05 Improve comment for tricky aspect of index-only scans.
Index-only scans avoid taking a lock on the VM buffer, which would
cause a lot of contention. To be correct, that requires some intricate
assumptions that weren't completely documented in the previous
comment.

Reviewed by Robert Haas.
2014-05-06 19:27:43 -07:00
Bruce Momjian 84288a86ac With ecpg exclusion removed, re-run pgindent for 9.4
Report by Tom Lane
2014-05-06 20:39:28 -04:00
Robert Haas e0124230ba Fix logic bug in dsm_attach().
The previous coding would potentially cause attaching to segment A to
fail if segment B was at the same time in the process of going away.

Andres Freund, with a comment tweak by me
2014-05-06 13:40:34 -04:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Simon Riggs 2e54d88af1 Correct comment in Hot Standby nbtree handling
Logic is correct, matching handling of LP_DEAD elsewhere.
2014-05-06 14:44:18 +01:00
Heikki Linnakangas 3a8e9e977f Fix use of free in walsender error handling after a sysid mismatch.
Found via valgrind. The bug exists since the introduction of the walsender,
so backpatch to 9.0.

Andres Freund
2014-05-06 15:17:41 +03:00
Tom Lane 0f928a85ec Fix possible cache invalidation failure in ReceiveSharedInvalidMessages.
Commit fad153ec45 modified sinval.c to reduce
the number of calls into sinvaladt.c (which require taking a shared lock)
by keeping a local buffer of collected-but-not-yet-processed messages.
However, if processing of the last message in a batch resulted in a
recursive call to ReceiveSharedInvalidMessages, we could overwrite that
message with a new one while the outer invalidation function was still
working on it.  This would be likely to lead to invalidation of the wrong
cache entry, allowing subsequent processing to use stale cache data.
The fix is just to make a local copy of each message while we're processing
it.

Spotted by Andres Freund.  Back-patch to 8.4 where the bug was introduced.
2014-05-05 14:43:39 -04:00
Heikki Linnakangas 377790fbd7 Pass sensible value to memset() when randomizing reorderbuffer's tuple slab.
This is entirely harmless, but still wrong. Noticed by coverity.

Andres Freund
2014-05-05 16:22:15 +03:00
Heikki Linnakangas c834576839 Use Size instead of uint32 to store result of sizeof()
Silences coverity and is more consistent with other functions in the
same file.

Andres Freund
2014-05-05 16:17:16 +03:00
Heikki Linnakangas 1460b199e6 Assert that pre/post-fix updated tuples are on the same page during replay.
If they were not 'oldtup.t_data' would be dereferenced while set to NULL
in case of a full page image for block 0.

Do so primarily to silence coverity; but also to make sure this prerequisite
isn't changed without adapting the replay routine as that would appear to
work in many cases.

Andres Freund
2014-05-05 16:15:25 +03:00
Tom Lane 91e16b9806 Fix yet another corner case in dumping rules/views with USING clauses.
ruleutils.c tries to cope with additions/deletions/renamings of columns in
tables referenced by views, by means of adding machine-generated aliases to
the printed form of a view when needed to preserve the original semantics.
A recent blog post by Marko Tiikkaja pointed out a case I'd missed though:
if one input of a join with USING is itself a join, there is nothing to
stop the user from adding a column of the same name as the USING column to
whichever side of the sub-join didn't provide the USING column.  And then
there'll be an error when the view is re-parsed, since now the sub-join
exposes two columns matching the USING specification.  We were catching a
lot of related cases, but not this one, so add some logic to cope with it.

Back-patch to 9.3, which is the first release that makes any serious
attempt to cope with such cases (cf commit 2ffa740be and follow-ons).
2014-05-01 20:22:37 -04:00
Tom Lane 3f8c8e3c61 Fix failure to detoast fields in composite elements of structured types.
If we have an array of records stored on disk, the individual record fields
cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are
only prepared to deal with TOAST pointers appearing in top-level fields of
a stored row.  The same applies for ranges over composite types, nested
composites, etc.  However, the existing code only took care of expanding
sub-field TOAST pointers for the case of nested composites, not for other
structured types containing composites.  For example, given a command such
as

UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ...

where x is a direct reference to a field of an on-disk tuple, if that field
is long enough to be toasted out-of-line then the TOAST pointer would be
inserted as-is into the array column.  If the source record for x is later
deleted, the array field value would become a dangling pointer, leading
to errors along the line of "missing chunk number 0 for toast value ..."
when the value is referenced.  A reproducible test case for this was
provided by Jan Pecek, but it seems likely that some of the "missing chunk
number" reports we've heard in the past were caused by similar issues.

Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to
produce a self-contained Datum value if the Datum is of composite type.
Seen in this light, the problem is not just confined to arrays and ranges,
but could also affect some other places where detoasting is done in that
way, for example form_index_tuple().

I tried teaching the array code to apply toast_flatten_tuple_attribute()
along with PG_DETOAST_DATUM() when the array element type is composite,
but this was messy and imposed extra cache lookup costs whether or not any
TOAST pointers were present, indeed sometimes when the array element type
isn't even composite (since sometimes it takes a typcache lookup to find
that out).  The idea of extending that approach to all the places that
currently use PG_DETOAST_DATUM() wasn't attractive at all.

This patch instead solves the problem by decreeing that composite Datum
values must not contain any out-of-line TOAST pointers in the first place;
that is, we expand out-of-line fields at the point of constructing a
composite Datum, not at the point where we're about to insert it into a
larger tuple.  This rule is applied only to true composite Datums, not
to tuples that are being passed around the system as tuples, so it's not
as invasive as it might sound at first.  With this approach, the amount
of code that has to be touched for a full solution is greatly reduced,
and added cache lookup costs are avoided except when there actually is
a TOAST pointer that needs to be inlined.

The main drawback of this approach is that we might sometimes dereference
a TOAST pointer that will never actually be used by the query, imposing a
rather large cost that wasn't there before.  On the other side of the coin,
if the field value is used multiple times then we'll come out ahead by
avoiding repeat detoastings.  Experimentation suggests that common SQL
coding patterns are unaffected either way, though.  Applications that are
very negatively affected could be advised to modify their code to not fetch
columns they won't be using.

In future, we might consider reverting this solution in favor of detoasting
only at the point where data is about to be stored to disk, using some
method that can drill down into multiple levels of nested structured types.
That will require defining new APIs for structured types, though, so it
doesn't seem feasible as a back-patchable fix.

Note that this patch changes HeapTupleGetDatum() from a macro to a function
call; this means that any third-party code using that macro will not get
protection against creating TOAST-pointer-containing Datums until it's
recompiled.  The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER().
It seems likely that this is not a big problem in practice: most of the
tuple-returning functions in core and contrib produce outputs that could
not possibly be toasted anyway, and the same probably holds for third-party
extensions.

This bug has existed since TOAST was invented, so back-patch to all
supported branches.
2014-05-01 15:19:06 -04:00
Tom Lane 203b0d132f Improve error messages in reorderbuffer.c.
Be more clear about failure cases in relfilenode->relation lookup,
and fix some other places that were inconsistent or not per our
message style guidelines.

Andres Freund and Tom Lane
2014-04-30 18:16:53 -04:00
Robert Haas 5ec45bb7fa Consistently allow reading of messages from a detached shm_mq.
This was intended to work always, but the previous code only allowed
it if at least one message was successfully read by the receiver
before the sender detached the queue.

Report by Petr Jelinek.  Patch by me.
2014-04-30 17:38:18 -04:00
Tom Lane 2d00190495 Rationalize common/relpath.[hc].
Commit a730183926 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that.  To clean it up:

* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h.  We won't consider this symbol part of the FE/BE API.

* Push enum ForkNumber from relfilenode.h into relpath.h.  We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.

* So, relfilenode.h now includes relpath.h rather than vice-versa.  This
direction of dependency is fine.  (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)

* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.

* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().

* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers.  (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)

* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().

* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames.  This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.

The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.

Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
2014-04-30 17:30:50 -04:00
Tom Lane 0bff398761 Check for interrupts and stack overflow during rule/view dumps.
Since ruleutils.c recurses, it could be driven to stack overflow by
deeply nested constructs.  Very large queries might also take long
enough to deparse that a check for interrupts seems like a good idea.
Stick appropriate tests into a couple of key places.

Noted by Greg Stark.  Back-patch to all supported branches.
2014-04-30 13:46:13 -04:00
Tom Lane 41de93c53a Reduce indentation/parenthesization of set operations in rule/view dumps.
A query such as "SELECT x UNION SELECT y UNION SELECT z UNION ..."
produces a left-deep nested parse tree, which we formerly showed in its
full nested glory and with all the possible parentheses.  This does little
for readability, though, and long UNION lists resulting in excessive
indentation are common.  Instead, let's omit parentheses and indent all
the subqueries at the same level in such cases.

This patch skips indentation/parenthesization whenever the lefthand input
of a SetOperationStmt is another SetOperationStmt of the same kind and
ALL/DISTINCT property.  We could teach the code the exact syntactic
precedence of set operations and thereby avoid parenthesization in some
more cases, but it's not clear that that'd be a readability win: it seems
better to parenthesize if the set operation changes.  (As an example,
if there's one UNION in a long list of UNION ALL, it now stands out like
a sore thumb, which seems like a good thing.)

Back-patch to 9.3.  This completes our response to a complaint from Greg
Stark that since commit 62e666400d there's a performance problem in pg_dump
for views containing long UNION sequences (or other types of deeply nested
constructs).  The previous commit 0601cb54da
handles the general problem, but this one makes the specific case of UNION
lists look a lot nicer.
2014-04-30 13:26:26 -04:00
Tom Lane 0601cb54da Limit overall indentation in rule/view dumps.
Continuing to indent no matter how deeply nested we get doesn't really
do anything for readability; what's worse, it results in O(N^2) total
whitespace, which can become a performance and memory-consumption issue.

To address this, once we get past 40 characters of indentation, reduce
the indentation step distance 4x, and also limit the maximum indentation
by reducing it modulo 40.  This latter choice is a bit weird at first
glance, but it seems to preserve readability better than a simple cap
would do.

Back-patch to 9.3, because since commit 62e666400d the performance issue
is a hazard for pg_dump.

Greg Stark and Tom Lane
2014-04-30 12:48:12 -04:00
Tom Lane d166eed302 Fix indentation of JOIN clauses in rule/view dumps.
The code attempted to outdent JOIN clauses further left than the parent
FROM keyword, which was odd in any case, and led to inconsistent formatting
since in simple cases the clauses couldn't be moved any further left than
that.  And it left a permanent decrement of the indentation level, causing
subsequent lines to be much further left than they should be (again, this
couldn't be seen in simple cases for lack of indentation to give up).

After a little experimentation I chose to make it indent JOIN keywords
two spaces from the parent FROM, which is one space more than the join's
lefthand input in cases where that appears on a different line from FROM.

Back-patch to 9.3.  This is a purely cosmetic change, and the bug is quite
old, so that may seem arbitrary; but we are going to be making some other
changes to the indentation behavior in both HEAD and 9.3, so it seems
reasonable to include this in 9.3 too.  I committed this one first because
its effects are more visible in the regression test results as they
currently stand than they will be later.
2014-04-30 12:01:19 -04:00
Tom Lane 95811032d7 Improve planner to drop constant-NULL inputs of AND/OR where it's legal.
In general we can't discard constant-NULL inputs, since they could change
the result of the AND/OR to be NULL.  But at top level of WHERE, we do not
need to distinguish a NULL result from a FALSE result, so it's okay to
treat NULL as FALSE and then simplify AND/OR accordingly.

This is a very ancient oversight, but in 9.2 and later it can lead to
failure to optimize queries that previous releases did optimize, as a
result of more aggressive parameter substitution rules making it possible
to reduce more subexpressions to NULL constants.  This is the root cause of
bug #10171 from Arnold Scheffler.  We could alternatively have fixed that
by teaching orclauses.c to ignore constant-NULL OR arms, but it seems
better to get rid of them globally.

I resisted the temptation to back-patch this change into all active
branches, but it seems appropriate to back-patch as far as 9.2 so that
there will not be performance regressions of the kind shown in this bug.
2014-04-29 13:12:46 -04:00
Heikki Linnakangas d2722443d9 Fix two bugs in WAL-logging of GIN pending-list pages.
In writeListPage, never take a full-page image of the page, because we
have all the information required to re-initialize in the WAL record
anyway. Before this fix, a full-page image was always generated, unless
full_page_writes=off, because when the page is initialized its LSN is
always 0. In stable-branches, keep the code to restore the backup blocks
if they exist, in case that the WAL is generated with an older minor
version, but in master Assert that there are no full-page images.

In the redo routine, add missing "off++". Otherwise the tuples are added
to the page in reverse order. That happens to be harmless because we
always scan and remove all the tuples together, but it was clearly wrong.
Also, it was masked by the first bug unless full_page_writes=off, because
the page was always restored from a full-page image.

Backpatch to all supported versions.
2014-04-28 17:31:01 +03:00
Tom Lane 5035701e07 Improve generation algorithm for database system identifier.
As noted some time ago, the original coding had a typo ("|" for "^")
that made the result less unique than intended.  Even the intended
behavior is obsolete since it was based on wanting to produce a
usable value even if we didn't have int64 arithmetic --- a limitation
we stopped supporting years ago.  Instead, let's redefine the system
identifier as tv_sec in the upper 32 bits (same as before), tv_usec
in the next 20 bits, and the low 12 bits of getpid() in the remaining
bits.  This is still hardly guaranteed-universally-unique, but it's
noticeably better than before.  Per my proposal at
<29019.1374535940@sss.pgh.pa.us>
2014-04-26 15:11:10 -04:00
Tom Lane 39b0c7681e Record the proper typmod for an index expression column.
We should use exprTypmod() to extract the typmod of the expression,
instead of just blindly storing -1.  This seems to have been an aboriginal
oversight in commit fc8d970cbc which
introduced general-expression indexes.  The consequences are only cosmetic
at present, since the index machinery doesn't really look at typmod for
index columns; but still it seems best to describe the column type as
precisely as we can.  Per off-list complaint from Thomas Fanghaenel.
2014-04-26 12:22:09 -04:00
Tom Lane 4bfc5f1396 Fix off-by-one bug in LWLockRegisterTranche().
Original coding failed to enlarge the array as required if
the requested tranche_id was equal to LWLockTranchesAllocated.

In passing, fix poor style of not casting the result of (re)palloc.
2014-04-25 15:59:57 -04:00
Alvaro Herrera 1a917ae861 Fix race when updating a tuple concurrently locked by another process
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.

The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple.  But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same.  This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.

This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.

To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.

Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates.  This should protect
against other bugs that might cause similar bogus situations.

Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced.  (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything.  Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)

Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com

Bug analysis by Andres Freund.
2014-04-24 15:41:55 -03:00
Tom Lane d19bd29f07 Reset pg_stat_activity.xact_start during PREPARE TRANSACTION.
Once we've completed a PREPARE, our session is not running a transaction,
so its entry in pg_stat_activity should show xact_start as null, rather
than leaving the value as the start time of the now-prepared transaction.

I think possibly this oversight was triggered by faulty extrapolation
from the adjacent comment that says PrepareTransaction should not call
AtEOXact_PgStat, so tweak the wording of that comment.

Noted by Andres Freund while considering bug #10123 from Maxim Boguk,
although this error doesn't seem to explain that report.

Back-patch to all active branches.
2014-04-24 13:29:48 -04:00
Tom Lane f0fedfe82c Allow polymorphic aggregates to have non-polymorphic state data types.
Before 9.4, such an aggregate couldn't be declared, because its final
function would have to have polymorphic result type but no polymorphic
argument, which CREATE FUNCTION would quite properly reject.  The
ordered-set-aggregate patch found a workaround: allow the final function
to be declared as accepting additional dummy arguments that have types
matching the aggregate's regular input arguments.  However, we failed
to notice that this problem applies just as much to regular aggregates,
despite the fact that we had a built-in regular aggregate array_agg()
that was known to be undeclarable in SQL because its final function
had an illegal signature.  So what we should have done, and what this
patch does, is to decouple the extra-dummy-arguments behavior from
ordered-set aggregates and make it generally available for all aggregate
declarations.  We have to put this into 9.4 rather than waiting till
later because it slightly alters the rules for declaring ordered-set
aggregates.

The patch turned out a bit bigger than I'd hoped because it proved
necessary to record the extra-arguments option in a new pg_aggregate
column.  I'd thought we could just look at the final function's pronargs
at runtime, but that didn't work well for variadic final functions.
It's probably just as well though, because it simplifies life for pg_dump
to record the option explicitly.

While at it, fix array_agg() to have a valid final-function signature,
and add an opr_sanity test to notice future deviations from polymorphic
consistency.  I also marked the percentile_cont() aggregates as not
needing extra arguments, since they don't.
2014-04-23 19:17:41 -04:00
Heikki Linnakangas a4ad9afec2 Update obsolete comments.
We no longer have a TLI field in the page header.
2014-04-23 14:41:51 +03:00
Heikki Linnakangas 8fbfbf1472 Fix typos in comment. 2014-04-23 12:56:41 +03:00
Heikki Linnakangas 4fafc4ecd9 Cleanup of new b-tree page deletion code.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
2014-04-23 10:19:54 +03:00
Tom Lane d26b042ce5 Fix documentation of FmgrInfo.fn_nargs.
Some ancient comments claimed that fn_nargs could be -1 to indicate a
variable number of input arguments; but this was never implemented, and
is at variance with what we ultimately did with "variadic" functions.
Update the comments.
2014-04-22 23:22:12 -04:00
Tom Lane c6a4ace5bf Fix broken logic in logical_heap_rewrite_flush_mappings().
It's blatantly obvious that commit 4d0d607a45
wasn't tested.  The leak's real enough, though.
2014-04-22 22:33:35 -04:00
Bruce Momjian cee850c403 revert 4d0d607a45
Revert due to contrib/test_decoding regression failure
2014-04-22 22:21:54 -04:00
Bruce Momjian 4d0d607a45 release memory used while flushing logical mappings
Patch by Ants Aasma
2014-04-22 18:05:44 -04:00
Heikki Linnakangas 4a5d55ec2b Fix bug in the new B-tree incomplete-split code.
Forgot to update LSN of left sibling's page, when creating a new root.
I fixed this for regular insertions and page splits earlier, but missed
new root creation.
2014-04-22 22:40:44 +03:00
Heikki Linnakangas 45e67a2ad7 Fix Gin README.
The README incorrectly claimed that GIN posting tree pages contain an array
of uncompressed items in addition to compressed posting lists. Earlier
versions of the GIN posting list compression patch worked that way, but not
the one that was committed.
2014-04-22 22:39:50 +03:00
Heikki Linnakangas 77fe2b6d79 Fix bug in new B-tree page deletion code.
When modifying a page, must hold an exclusive lock. A shared lock is
obviously not good enough.
2014-04-22 15:34:54 +03:00
Heikki Linnakangas 7e30c186da Retain original physical order of tuples in redo of b-tree splits.
It makes no difference to the system, but minimizing the differences
between a master and standby makes debugging simpler.
2014-04-22 13:03:37 +03:00
Heikki Linnakangas 7d98054f0d Fix rm_desc routine of b-tree page delete records.
A couple of typos from my refactoring of the page deletion patch.
2014-04-22 13:02:52 +03:00
Heikki Linnakangas 8d34f68628 Avoid transient bogus page contents when creating a sequence.
Don't use simple_heap_insert to insert the tuple to a sequence relation.
simple_heap_insert creates a heap insertion WAL record, and replaying that
will create a regular heap page without the special area containing the
sequence magic constant, which is wrong for a sequence. That was not a bug
because we always created a sequence WAL record after that, and replaying
that overwrote the bogus heap page, and the transient state could never be
seen by another backend because it was only done when creating a new
sequence relation. But it's simpler and cleaner to avoid that in the first
place.
2014-04-22 10:40:23 +03:00
Robert Haas fab6170cab Fix typo.
Etsuro Fujita
2014-04-20 16:30:55 +02:00
Magnus Hagander 66b1084e2c Fix typo
Amit Langote
2014-04-18 12:49:54 +02:00
Bruce Momjian 83defef8c7 report stat() error in trigger file check
Permissions might prevent the existence of the trigger file from being
checked.

Per report from Andres Freund
2014-04-17 11:55:57 -04:00
Heikki Linnakangas 2a8e1ac598 Set the all-visible flag on heap page before writing WAL record, not after.
If we set the all-visible flag after writing WAL record, and XLogInsert
takes a full-page image of the page, the image would not include the flag.
We will then proceed to set the VM bit, which would then be set without the
corresponding all-visible flag on the heap page.

Found by comparing page images on master and standby, after writing/replaying
each WAL record. (There is still a discrepancy: the all-visible flag won't
be set after replaying the HEAP_CLEAN record, even though it is set in the
master. However, it will be set when replaying the HEAP2_VISIBLE record and
setting the VM bit, so the all-visible flag and VM bit are always consistent
on the standby, even though they are momentarily out-of-sync with master)

Backpatch to 9.3 where this code was introduced.
2014-04-17 17:47:50 +03:00
Tom Lane 5f86cbd714 Rename EXPLAIN ANALYZE's "total runtime" output to "execution time".
Now that EXPLAIN also outputs a "planning time" measurement, the use of
"total" here seems rather confusing: it sounds like it might include the
planning time which of course it doesn't.  Majority opinion was that
"execution time" is a better label, so we'll call it that.

This should be noted as a backwards incompatibility for tools that examine
EXPLAIN ANALYZE output.

In passing, I failed to resist the temptation to do a little editing on the
materialized-view example affected by this change.
2014-04-16 20:48:59 -04:00
Alvaro Herrera 83ab8e32f2 Fix object identities for text search objects
We were neglecting to schema-qualify them.

Backpatch to 9.3, where object identities were introduced as a concept
by commit f8348ea32e.
2014-04-16 18:25:44 -03:00
Tom Lane cad4fe6455 Use AF_UNSPEC not PF_UNSPEC in getaddrinfo calls.
According to the Single Unix Spec and assorted man pages, you're supposed
to use the constants named AF_xxx when setting ai_family for a getaddrinfo
call.  In a few places we were using PF_xxx instead.  Use of PF_xxx
appears to be an ancient BSD convention that was not adopted by later
standardization.  On BSD and most later Unixen, it doesn't matter much
because those constants have equivalent values anyway; but nonetheless
this code is not per spec.

In the same vein, replace PF_INET by AF_INET in one socket() call, which
wasn't even consistent with the other socket() call in the same function
let alone the remainder of our code.

Per investigation of a Cygwin trouble report from Marco Atzeri.  It's
probably a long shot that this will fix his issue, but it's wrong in
any case.
2014-04-16 13:21:20 -04:00
Robert Haas dfc0219f64 Add to_regprocedure() and to_regoperator().
These are natural complements to the functions added by commit
0886fc6a5c, but they weren't included
in the original patch for some reason.  Add them.

Patch by me, per a complaint by Tom Lane.  Review by Tatsuo
Ishii.
2014-04-16 12:21:43 -04:00
Robert Haas 1a81daab8b Try to fix spurious DSM failures on Windows.
Apparently, Windows can sometimes return an error code even when the
operation actually worked just fine.  Rearrange the order of checks
according to what appear to be the best practices in this area.

Amit Kapila
2014-04-16 12:04:44 -04:00
Bruce Momjian 4180934651 check socket creation errors against PGINVALID_SOCKET
Previously, in some places, socket creation errors were checked for
negative values, which is not true for Windows because sockets are
unsigned.  This masked socket creation errors on Windows.

Backpatch through 9.0.  8.4 doesn't have the infrastructure to fix this.
2014-04-16 10:45:48 -04:00
Heikki Linnakangas 848b9f05ab Use correctly-sized buffer when zero-filling a WAL file.
I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is
allocated a couple of weeks ago. With the default settings, they are both
8k, but they can be changed at compile-time.
2014-04-16 10:26:36 +03:00
Heikki Linnakangas f1dadd34fa Set pd_lower on internal GIN posting tree pages.
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.

In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.

Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.

This change is completely backwards-compatible, no effect on pg_upgrade.
2014-04-14 21:13:19 +03:00
Tom Lane 4dfb065b3a Fix bogus handling of bad strategy number in GIST consistent() functions.
Make sure we throw an error instead of silently doing the wrong thing when
fed a strategy number we don't recognize.  Also, in the places that did
already throw an error, spell the error message in a way more consistent
with our message style guidelines.

Per report from Paul Jones.  Although this is a bug, it won't occur unless
a superuser tries to do something he shouldn't, so it doesn't seem worth
back-patching.
2014-04-14 11:18:47 -04:00
Heikki Linnakangas e3e6e3af56 Remove dead checks for invalid left page in ginDeletePage.
In some places, the function assumes the left page is valid, and in others,
it checks if it is valid. Remove all the checks.
2014-04-14 15:27:32 +03:00
Heikki Linnakangas 1bd3842163 GIN entry pages follow the standard page layout - tell XLogInsert.
The entry B-tree pages all follow the standard page layout. The 9.3 code has
this right. I inadvertently changed this at some point during the big
refactorings in git master.
2014-04-14 14:51:28 +03:00
Tom Lane e0c91a7ff0 Improve some O(N^2) behavior in window function evaluation.
Repositioning the tuplestore seek pointer in window_gettupleslot() turns
out to be a very significant expense when the window frame is sizable and
the frame end can move.  To fix, introduce a tuplestore function for
skipping an arbitrary number of tuples in one call, parallel to the one we
introduced for tuplesort objects in commit 8d65da1f.  This reduces the cost
of window_gettupleslot() to O(1) if the tuplestore has not spilled to disk.
As in the previous commit, I didn't try to do any real optimization of
tuplestore_skiptuples for the case where the tuplestore has spilled to
disk.  There is probably no practical way to get the cost to less than O(N)
anyway, but perhaps someone can think of something later.

Also fix PersistHoldablePortal() to make use of this API now that we have
it.

Based on a suggestion by Dean Rasheed, though this turns out not to look
much like his patch.
2014-04-13 13:59:17 -04:00
Stephen Frost 5f508b6dea Make a dedicated AlterTblSpcStmt production
Given that ALTER TABLESPACE has moved on from just existing for
general purpose rename/owner changes, it deserves its own top-level
production in the grammar.  This also cleans up the RenameStmt to
only ever be used for actual RENAMEs again- it really wasn't
appropriate to hide non-RENAME productions under there.

Noted by Alvaro.
2014-04-13 01:02:44 -04:00
Tom Lane d95425c8b9 Provide moving-aggregate support for boolean aggregates.
David Rowley and Florian Pflug, reviewed by Dean Rasheed
2014-04-13 00:01:46 -04:00
Stephen Frost 842faa714c Make security barrier views automatically updatable
Views which are marked as security_barrier must have their quals
applied before any user-defined quals are called, to prevent
user-defined functions from being able to see rows which the
security barrier view is intended to prevent them from seeing.

Remove the restriction on security barrier views being automatically
updatable by adding a new securityQuals list to the RTE structure
which keeps track of the quals from security barrier views at each
level, independently of the user-supplied quals.  When RTEs are
later discovered which have securityQuals populated, they are turned
into subquery RTEs which are marked as security_barrier to prevent
any user-supplied quals being pushed down (modulo LEAKPROOF quals).

Dean Rasheed, reviewed by Craig Ringer, Simon Riggs, KaiGai Kohei
2014-04-12 21:04:58 -04:00
Tom Lane 9d229f399e Provide moving-aggregate support for a bunch of numerical aggregates.
First installment of the promised moving-aggregate support in built-in
aggregates: count(), sum(), avg(), stddev() and variance() for
assorted datatypes, though not for float4/float8.

In passing, remove a 2001-vintage kluge in interval_accum(): interval
array elements have been properly aligned since around 2003, but
nobody remembered to take out this workaround.  Also, fix a thinko
in the opr_sanity tests for moving-aggregate catalog entries.

David Rowley and Florian Pflug, reviewed by Dean Rasheed
2014-04-12 20:33:09 -04:00
Tom Lane a9d9acbf21 Create infrastructure for moving-aggregate optimization.
Until now, when executing an aggregate function as a window function
within a window with moving frame start (that is, any frame start mode
except UNBOUNDED PRECEDING), we had to recalculate the aggregate from
scratch each time the frame head moved.  This patch allows an aggregate
definition to include an alternate "moving aggregate" implementation
that includes an inverse transition function for removing rows from
the aggregate's running state.  As long as this can be done successfully,
runtime is proportional to the total number of input rows, rather than
to the number of input rows times the average frame length.

This commit includes the core infrastructure, documentation, and regression
tests using user-defined aggregates.  Follow-on commits will update some
of the built-in aggregates to use this feature.

David Rowley and Florian Pflug, reviewed by Dean Rasheed; additional
hacking by me
2014-04-12 12:03:30 -04:00
Heikki Linnakangas 614167c6d7 Fix bugs in GIN "fast scan" with partial match.
There were a couple of bugs here. First, if the fuzzy limit was exceeded,
the loop in entryGetItem might drop out too soon if a whole block needs to
be skipped because it's < advancePast ("continue" in a while-loop checks the
loop condition too). Secondly, the loop checked when stepping to a new page
that there is at least one offset on the page < advancePast, but we cannot
rely on that on subsequent calls of entryGetItem, because advancePast might
change in between. That caused the skipping loop to read bogus items in the
TbmIterateResult's offset array.

First item and fix by Alexander Korotkov, second bug pointed out by Fabrízio
de Royes Mello, by a small variation of Alexander's test query.
2014-04-10 23:42:04 +03:00
Bruce Momjian 8fcccadfea C comment: track_activity_query_size doesn't support memory units
And explain why.

Per report from Pavel Stehule
2014-04-10 09:57:04 -04:00
Heikki Linnakangas 787064cd00 Fix typo in comment.
Tomonari Katsumata
2014-04-10 13:11:49 +03:00
Heikki Linnakangas 150a9df528 Fix a few more misc typos in comments. 2014-04-10 00:53:55 +03:00
Heikki Linnakangas 5b075ae893 Fix misc typos in comments. 2014-04-09 23:16:35 +03:00
Robert Haas b082732061 Add missing include.
This is more cleanup from commit 11a65eed16.

Amit Kapila
2014-04-09 11:46:49 -04:00
Robert Haas 0c4ea7a309 Fix silly oversight in patch to remove dsm state file.
I'm not sure if this is what's causing the Windows buildfarm members
to get unhappy, but I don't think it can be helping anything...
2014-04-08 16:22:50 -04:00
Tom Lane f23a5630eb Add an in-core GiST index opclass for inet/cidr types.
This operator class can accelerate subnet/supernet tests as well as
btree-equivalent ordered comparisons.  It also handles a new network
operator inet && inet (overlaps, a/k/a "is supernet or subnet of"),
which is expected to be useful in exclusion constraints.

Ideally this opclass would be the default for GiST with inet/cidr data,
but we can't mark it that way until we figure out how to do a more or
less graceful transition from the current situation, in which the
really-completely-bogus inet/cidr opclasses in contrib/btree_gist are
marked as default.  Having the opclass in core and not default is better
than not having it at all, though.

While at it, add new documentation sections to allow us to officially
document GiST/GIN/SP-GiST opclasses, something there was never a clear
place to do before.  I filled these in with some simple tables listing
the existing opclasses and the operators they support, but there's
certainly scope to put more information there.

Emre Hasegeli, reviewed by Andreas Karlsson, further hacking by me
2014-04-08 15:46:43 -04:00
Robert Haas 11a65eed16 Get rid of the dynamic shared memory state file.
Instead of storing the ID of the dynamic shared memory control
segment in a file within the data directory, store it in the main
control segment.  This avoids a number of nasty corner cases,
most seriously that doing an online backup and then using it on
the same machine (e.g. to fire up a standby) would result in the
standby clobbering all of the master's dynamic shared memory
segments.

Per complaints from Heikki Linnakangas, Fujii Masao, and Tom
Lane.
2014-04-08 11:39:55 -04:00
Robert Haas 0886fc6a5c Add new to_reg* functions for error-free OID lookups.
These functions won't throw an error if the object doesn't exist,
or if (for functions and operators) there's more than one matching
object.

Yugo Nagata and Nozomi Anzai, reviewed by Amit Khandekar, Marti
Raudsepp, Amit Kapila, and me.
2014-04-08 10:27:56 -04:00
Heikki Linnakangas 7ca32e255b Fix hot standby bug with GiST scans.
Don't reset the rightlink of a page when replaying a page update record.
This was a leftover from pre-hot standby days, when it was not possible to
have scans concurrent with WAL replay. Resetting the right-link was not
necessary back then either, but it was done for the sake of tidiness. But
with hot standby, it's wrong, because a concurrent scan might still need it.

Backpatch all versions with hot standby, 9.0 and above.
2014-04-08 14:51:40 +03:00
Heikki Linnakangas 38a2b95c34 Zero padding byte at end of GIN posting list.
This isn't strictly necessary, but helps debugging.
2014-04-07 19:49:03 +03:00
Robert Haas f235db03ff Remove 'make clean' support for ipc_test.
I missed this in the previous commit; Tom Lane spotted my error.
2014-04-07 11:45:27 -04:00
Robert Haas 315772e4ec Assert that strong-lock count is >0 everywhere it's decremented.
The one existing assertion of this type has tripped a few times in the
buildfarm lately, but it's not clear whether the problem is really
originating there or whether it's leftovers from a trip through one
of the other two paths that lack a matching assertion.  So add one.

Since the same bug(s) most likely exist(s) in the back-branches also,
back-patch to 9.2, where the fast-path lock mechanism was added.
2014-04-07 10:59:42 -04:00
Robert Haas b8a721149b Remove ipc_test.
This doesn't seem to be useful any more, and it's not really worth the
effort to keep updating it every time relevant dependencies or calling
signatures in the shared memory or semaphore code change.
2014-04-07 10:40:47 -04:00
Heikki Linnakangas 594bac4272 Fix WAL replay bug in the new GIN incomplete-split code.
Forgot to set the incomplete-split flag on the left page half, in redo of a
page split.

Spotted this by comparing the page contents on master and standby, after
inserting/applying each WAL record.
2014-04-07 14:37:30 +03:00
Simon Riggs e5550d5fec Reduce lock levels of some ALTER TABLE cmds
VALIDATE CONSTRAINT

CLUSTER ON
SET WITHOUT CLUSTER

ALTER COLUMN SET STATISTICS
ALTER COLUMN SET ()
ALTER COLUMN RESET ()

All other sub-commands use AccessExclusiveLock

Simon Riggs and Noah Misch

Reviews by Robert Haas and Andres Freund
2014-04-06 11:13:43 -04:00
Tom Lane 5d8117e1f3 Block signals earlier during postmaster startup.
Formerly, we set up the postmaster's signal handling only when we were
about to start launching subprocesses.  This is a bad idea though, as
it means that for example a SIGINT arriving before that will kill the
postmaster instantly, perhaps leaving lockfiles, socket files, shared
memory, etc laying about.  We'd rather that such a signal caused orderly
postmaster termination including releasing of those resources.  A simple
fix is to move the PostmasterMain stanza that initializes signal handling
to an earlier point, before we've created any such resources.  Then, an
early-arriving signal will be blocked until we're ready to deal with it
in the usual way.  (The only part that really needs to be moved up is
blocking of signals, but it seems best to keep the signal handler
installation calls together with that; for one thing this ensures the
kernel won't drop any signals we wished to get.  The handlers won't get
invoked in any case until we unblock signals in ServerLoop.)

Per a report from MauMau.  He proposed changing the way "pg_ctl stop"
works to deal with this, but that'd just be masking one symptom not
fixing the core issue.

It's been like this since forever, so back-patch to all supported branches.
2014-04-05 18:16:08 -04:00
Heikki Linnakangas ffbba6ee12 Fix another palloc in critical section.
Also add a regression test for a GIN index with enough items with the same
key, so that a GIN posting tree gets created. Apparently none of the
existing GIN tests were large enough for that.

This code is new, no backpatching required.
2014-04-05 22:15:58 +03:00
Tom Lane 6862ca6970 Fix processing of PGC_BACKEND GUC parameters on Windows.
EXEC_BACKEND builds (i.e., Windows) failed to absorb values of PGC_BACKEND
parameters if they'd been changed post-startup via the config file.  This
for example prevented log_connections from working if it were turned on
post-startup.  The mechanism for handling this case has always been a bit
of a kluge, and it wasn't revisited when we implemented EXEC_BACKEND.
While in a normal forking environment new backends will inherit the
postmaster's value of such settings, EXEC_BACKEND backends have to read
the settings from the CONFIG_EXEC_PARAMS file, and they were mistakenly
rejecting them.  So this case has always been broken in the Windows port;
so back-patch to all supported branches.

Amit Kapila
2014-04-05 12:41:25 -04:00
Tom Lane abe075dfff Fix tablespace creation WAL replay to work on Windows.
The code segment that removes the old symlink (if present) wasn't clued
into the fact that on Windows, symlinks are junction points which have
to be removed with rmdir().

Backpatch to 9.0, where the failing code was introduced.

MauMau, reviewed by Muhammad Asif Naeem and Amit Kapila
2014-04-04 23:09:35 -04:00
Tom Lane b203c57bb7 Allow "-C variable" and "--describe-config" even to root users.
There's no really compelling reason to refuse to do these read-only,
non-server-starting options as root, and there's at least one good
reason to allow -C: pg_ctl uses -C to find out the true data directory
location when pointed at a config-only directory.  On Windows, this is
done before dropping administrator privileges, which means that pg_ctl
fails for administrators if and only if a config-only layout is used.

Since the root-privilege check is done so early in startup, it's a bit
awkward to check for these switches.  Make the somewhat arbitrary
decision that we'll only skip the root check if -C is the first switch.
This is not just to make the code a bit simpler: it also guarantees that
we can't misinterpret a --boot mode switch.  (While AuxiliaryProcessMain
doesn't currently recognize any such switch, it might have one in the
future.)  This is no particular problem for pg_ctl, and since the whole
behavior is undocumented anyhow, it's not a documentation issue either.
(--describe-config only works as the first switch anyway, so this is
no restriction for that case either.)

Back-patch to 9.2 where pg_ctl first began to use -C.

MauMau, heavily edited by me
2014-04-04 22:03:35 -04:00
Tom Lane 9aca512506 Make sure -D is an absolute path when starting server on Windows.
This is needed because Windows services may get started with a different
current directory than where pg_ctl is executed.  We want relative -D
paths to be interpreted relative to pg_ctl's CWD, similarly to what
happens on other platforms.

In support of this, move the backend's make_absolute_path() function
into src/port/path.c (where it probably should have been long since)
and get rid of the rather inferior version in pg_regress.

Kumar Rajeev Rastogi, reviewed by MauMau
2014-04-04 18:42:13 -04:00
Tom Lane 8120c7452a Fix bogus time printout in walreceiver's debug log messages.
The displayed sendtime and receipttime were always exactly equal, because
somebody forgot that timestamptz_to_str returns a static buffer (thereby
simplifying life for most callers, at the cost of complicating it for those
who need two results concurrently).  Apply the same pstrdup solution used
by the other call sites with this issue.  Back-patch to 9.2 where the
faulty code was introduced.  Per bug #9849 from Haruka Takatsuka, though
this is not exactly his patch.

Possibly we should change timestamptz_to_str's API, but I wouldn't want
to do so in the back branches.
2014-04-04 11:44:04 -04:00
Robert Haas 59202fae04 Fix some compiler warnings that clang emits with -pedantic.
Andres Freund
2014-04-04 11:29:50 -04:00
Heikki Linnakangas b1236f4b7b Move multixid allocation out of critical section.
It can fail if you run out of memory.

This call was added in 9.3, so backpatch to 9.3 only.
2014-04-04 18:20:22 +03:00
Heikki Linnakangas d9e7873bbb In checkpoint, move the check for in-progress xacts out of critical section.
GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical
section. I thought I covered this case with the exemption for the
checkpointer, but CreateCheckPoint is also called from the startup process.
2014-04-04 17:31:22 +03:00
Heikki Linnakangas 4a170ee9e0 Add an Assertion that you don't palloc within a critical section.
This caught a bunch of cases doing that already, which I just fixed in
previous commit. This is the assertion itself.

Per Tom Lane's idea.
2014-04-04 14:28:54 +03:00
Heikki Linnakangas 877b088785 Avoid allocations in critical sections.
If a palloc in a critical section fails, it becomes a PANIC.
2014-04-04 13:35:44 +03:00
Tom Lane c7b3539599 Fix non-equivalence of VARIADIC and non-VARIADIC function call formats.
For variadic functions (other than VARIADIC ANY), the syntaxes foo(x,y,...)
and foo(VARIADIC ARRAY[x,y,...]) should be considered equivalent, since the
former is converted to the latter at parse time.  They have indeed been
equivalent, in all releases before 9.3.  However, commit 75b39e790 made an
ill-considered decision to record which syntax had been used in FuncExpr
nodes, and then to make equal() test that in checking node equality ---
which caused the syntaxes to not be seen as equivalent by the planner.
This is the underlying cause of bug #9817 from Dmitry Ryabov.

It might seem that a quick fix would be to make equal() disregard
FuncExpr.funcvariadic, but the same commit made that untenable, because
the field actually *is* semantically significant for some VARIADIC ANY
functions.  This patch instead adopts the approach of redefining
funcvariadic (and aggvariadic, in HEAD) as meaning that the last argument
is a variadic array, whether it got that way by parser intervention or was
supplied explicitly by the user.  Therefore the value will always be true
for non-ANY variadic functions, restoring the principle of equivalence.
(However, the planner will continue to consider use of VARIADIC as a
meaningful difference for VARIADIC ANY functions, even though some such
functions might disregard it.)

In HEAD, this change lets us simplify the decompilation logic in
ruleutils.c, since the funcvariadic/aggvariadic flag tells directly whether
to print VARIADIC.  However, in 9.3 we have to continue to cope with
existing stored rules/views that might contain the previous definition.
Fortunately, this just means no change in ruleutils.c, since its existing
behavior effectively ignores funcvariadic for all cases other than VARIADIC
ANY functions.

In HEAD, bump catversion to reflect the fact that FuncExpr.funcvariadic
changed meanings; this is sort of pro forma, since I don't believe any
built-in views are affected.

Unfortunately, this patch doesn't magically fix everything for affected
9.3 users.  After installing 9.3.5, they might need to recreate their
rules/views/indexes containing variadic function calls in order to get
everything consistent with the new definition.  As in the cited bug,
the symptom of a problem would be failure to use a nominally matching
index that has a variadic function call in its definition.  We'll need
to mention this in the 9.3.5 release notes.
2014-04-03 22:02:24 -04:00
Tom Lane 741364bf5c Code review for commit d26888bc4d.
Mostly, copy-edit the comments; but also fix it to not reject domains over
arrays.
2014-04-03 16:57:45 -04:00
Heikki Linnakangas 04e298b826 Avoid palloc in critical section in GiST WAL-logging.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.

There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.

The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
2014-04-03 15:43:50 +03:00
Tom Lane fc752505a9 Fix assorted issues in client host name lookup.
The code for matching clients to pg_hba.conf lines that specify host names
(instead of IP address ranges) failed to complain if reverse DNS lookup
failed; instead it silently didn't match, so that you might end up getting
a surprising "no pg_hba.conf entry for ..." error, as seen in bug #9518
from Mike Blackwell.  Since we don't want to make this a fatal error in
situations where pg_hba.conf contains a mixture of host names and IP
addresses (clients matching one of the numeric entries should not have to
have rDNS data), remember the lookup failure and mention it as DETAIL if
we get to "no pg_hba.conf entry".  Apply the same approach to forward-DNS
lookup failures, too, rather than treating them as immediate hard errors.

Along the way, fix a couple of bugs that prevented us from detecting an
rDNS lookup error reliably, and make sure that we make only one rDNS lookup
attempt; formerly, if the lookup attempt failed, the code would try again
for each host name entry in pg_hba.conf.  Since more or less the whole
point of this design is to ensure there's only one lookup attempt not one
per entry, the latter point represents a performance bug that seems
sufficient justification for back-patching.

Also, adjust src/port/getaddrinfo.c so that it plays as well as it can
with this code.  Which is not all that well, since it does not have actual
support for rDNS lookup, but at least it should return the expected (and
required by spec) error codes so that the main code correctly perceives the
lack of functionality as a lookup failure.  It's unlikely that PG is still
being used in production on any machines that require our getaddrinfo.c,
so I'm not excited about working harder than this.

To keep the code in the various branches similar, this includes
back-patching commits c424d0d105 and
1997f34db4 into 9.2 and earlier.

Back-patch to 9.1 where the facility for hostnames in pg_hba.conf was
introduced.
2014-04-02 17:11:24 -04:00