Commit Graph

14736 Commits

Author SHA1 Message Date
Tom Lane f2874feb7c Some more FLEXIBLE_ARRAY_MEMBER fixes. 2015-02-21 01:46:43 -05:00
Tom Lane 33b2a2c97f Fix statically allocated struct with FLEXIBLE_ARRAY_MEMBER member.
clang complains about this, not unreasonably, so define another struct
that's explicitly for a WordEntryPos with exactly one element.

While at it, get rid of pretty dubious use of a static variable for
more than one purpose --- if it were being treated as const maybe
I'd be okay with this, but it isn't.
2015-02-20 17:50:18 -05:00
Tom Lane 33a3b03d63 Use FLEXIBLE_ARRAY_MEMBER in some more places.
Fix a batch of structs that are only visible within individual .c files.

Michael Paquier
2015-02-20 17:32:01 -05:00
Tom Lane c110eff132 Use FLEXIBLE_ARRAY_MEMBER in struct RecordIOData.
I (tgl) fixed this last night in rowtypes.c, but I missed that the
code had been copied into a couple of other places.

Michael Paquier
2015-02-20 17:03:12 -05:00
Tom Lane e38b1eb098 Use FLEXIBLE_ARRAY_MEMBER in struct varlena.
This forces some minor coding adjustments in tuptoaster.c and inv_api.c,
but the new coding there is cleaner anyway.

Michael Paquier
2015-02-20 16:51:53 -05:00
Alvaro Herrera d42358efb1 Have TRUNCATE update pgstat tuple counters
This works by keeping a per-subtransaction record of the ins/upd/del
counters before the truncate, and then resetting them; this record is
useful to return to the previous state in case the truncate is rolled
back, either in a subtransaction or whole transaction.  The state is
propagated upwards as subtransactions commit.

When the per-table data is sent to the stats collector, a flag indicates
to reset the live/dead counters to zero as well.

Catalog version bumped due to the change in pgstat format.

Author: Alexander Shulgin
Discussion: 1007.1207238291@sss.pgh.pa.us
Discussion: 548F7D38.2000401@BlueTreble.com
Reviewed-by: Álvaro Herrera, Jim Nasby
2015-02-20 12:10:01 -03:00
Tom Lane 5740be6d6e Some more FLEXIBLE_ARRAY_MEMBER hacking. 2015-02-20 02:28:03 -05:00
Tom Lane 9aa53bbd15 Remove unused variable.
Per buildfarm.
2015-02-20 00:47:28 -05:00
Tom Lane 09d8d110a6 Use FLEXIBLE_ARRAY_MEMBER in a bunch more places.
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.

This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.

Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent.  Instead use offsetof(struct, lastfield).  Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).

Michael Paquier, review and additional fixes by me.
2015-02-20 00:11:42 -05:00
Tom Lane 2fb7a75f37 Add pg_stat_get_snapshot_timestamp() to show statistics snapshot timestamp.
Per discussion, this could be useful for purposes such as programmatically
detecting a nonresponding stats collector.  We already have the timestamp
anyway, it's just a matter of providing a SQL-accessible function to fetch
it.

Matt Kelly, reviewed by Jim Nasby
2015-02-19 21:36:50 -05:00
Tom Lane 56a79a869b Split array_push into separate array_append and array_prepend functions.
There wasn't any good reason for a single C function to implement both
these SQL functions: it saved very little code overall, and it required
significant pushups to re-determine at runtime which case applied.  Redoing
it as two functions ends up with just slightly more lines of code, but it's
simpler to understand, and faster too because we need not repeat syscache
lookups on every call.

An important side benefit is that this eliminates the only case in which
different aliases of the same C function had both anyarray and anyelement
arguments at the same position, which would almost always be a mistake.
The opr_sanity regression test will now notice such mistakes since there's
no longer a valid case where it happens.
2015-02-18 20:53:33 -05:00
Alvaro Herrera 9c7dd35019 Fix opclass/opfamily identity strings
The original representation uses "opcname for amname", which is good
enough; but if we replace "for" with "using", we can apply the returned
identity directly in a DROP command, as in

DROP OPERATOR CLASS opcname USING amname

This slightly simplifies code using object identities to programatically
execute commands on these kinds of objects.

Note backwards-incompatible change:
The previous representation dates back to 9.3 when object identities
were introduced by commit f8348ea3, but we don't want to change the
behavior on released branches unnecessarily and so this is not
backpatched.
2015-02-18 14:44:27 -03:00
Alvaro Herrera 0d906798f6 Fix object identities for pg_conversion objects
We were neglecting to schema-qualify them.

Backpatch to 9.3, where object identities were introduced as a concept
by commit f8348ea32e.
2015-02-18 14:28:11 -03:00
Tom Lane abe45a9b31 Fix EXPLAIN output for cases where parent table is excluded by constraints.
The previous coding in EXPLAIN always labeled a ModifyTable node with the
name of the target table affected by its first child plan.  When originally
written, this was necessarily the parent table of the inheritance tree,
so everything was unconfusing.  But when we added NO INHERIT constraints,
it became possible for the parent table to be deleted from the plan by
constraint exclusion while still leaving child tables present.  This led to
the ModifyTable plan node being labeled with the first surviving child,
which was deemed confusing.  Fix it by retaining the parent table's RT
index in a new field in ModifyTable.

Etsuro Fujita, reviewed by Ashutosh Bapat and myself
2015-02-17 18:04:11 -05:00
Heikki Linnakangas 931bf3eb9b Fix a bug in pairing heap removal code.
After removal, the next_sibling pointer of a node was sometimes incorrectly
left to point to another node in the heap, which meant that a node was
sometimes linked twice into the heap. Surprisingly that didn't cause any
crashes in my testing, but it was clearly wrong and could easily segfault
in other scenarios.

Also always keep the prev_or_parent pointer as NULL on the root node. That
was not a correctness issue AFAICS, but let's be tidy.

Add a debugging function, to dump the contents of a pairing heap as a
string. It's #ifdef'd out, as it's not used for anything in any normal
code, but it was highly useful in debugging this. Let's keep it handy for
further reference.
2015-02-17 22:55:53 +02:00
Heikki Linnakangas d17b6df239 Fix knn-GiST queue comparison function to return heap tuples first.
The part of the comparison function that was supposed to keep heap tuples
ahead of index items was backwards. It would not lead to incorrect results,
but it is more efficient to return heap tuples first, before scanning more
index pages, when both have the same distance.

Alexander Korotkov
2015-02-17 22:33:38 +02:00
Tom Lane 2e105def09 Remove code to match IPv4 pg_hba.conf entries to IPv4-in-IPv6 addresses.
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f5 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented).  But
actually the logic dates back to commit 3c9bb8886d and was simply
broken by erroneous refactoring in the later commit.  A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses.  The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.

Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct.  If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases.  So let's let the post-9.0 behavior
stand, and change the documentation to match it.

In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit.  A lot of this text
hasn't been touched since we were IPv4-only.
2015-02-17 12:49:18 -05:00
Tom Lane cb66f495f5 Fix misuse of memcpy() in check_ip().
The previous coding copied garbage into a local variable, pretty much
ensuring that the intended test of an IPv6 connection address against a
promoted IPv4 address from pg_hba.conf would never match.  The lack of
field complaints likely indicates that nobody realized this was supposed
to work, which is unsurprising considering that no user-facing docs suggest
it should work.

In principle this could have led to a SIGSEGV due to reading off the end of
memory, but since the source address would have pointed to somewhere in the
function's stack frame, that's quite unlikely.  What led to discovery of
the bug is Hugo Osvaldo Barrera's report of a crash after an OS upgrade,
which is probably because he is now running a system in which memcpy raises
abort() upon detecting overlapping source and destination areas.  (You'd
have to additionally suppose some things about the stack frame layout to
arrive at this conclusion, but it seems plausible.)

This has been broken since the code was added, in commit f3aec2c7f5,
so back-patch to all supported branches.
2015-02-16 16:18:31 -05:00
Heikki Linnakangas 1c2b7c0879 Restore the SSL_set_session_id_context() call to OpenSSL renegotiation.
This reverts the removal of the call in commit (272923a0). It turns out it
wasn't superfluous after all: without it, renegotiation fails if a client
certificate was used. The rest of the changes in that commit are still OK
and not reverted.

Per investigation of bug #12769 by Arne Scheffer, although this doesn't fix
the reported bug yet.
2015-02-16 22:34:32 +02:00
Tom Lane e983c4d1aa Rationalize the APIs of array element/slice access functions.
The four functions array_ref, array_set, array_get_slice, array_set_slice
have traditionally declared their array inputs and results as being of type
"ArrayType *".  This is a lie, and has been since Berkeley days, because
they actually also support "fixed-length array" types such as "name" and
"point"; not to mention that the inputs could be toasted.  These values
should be declared Datum instead to avoid confusion.  The current coding
already risks possible misoptimization by compilers, and it'll get worse
when "expanded" array representations become a valid alternative.

However, there's a fair amount of code using array_ref and array_set with
arrays that *are* known to be ArrayType structures, and there might be more
such places in third-party code.  Rather than cluttering those call sites
with PointerGetDatum/DatumGetArrayTypeP cruft, what I did was to rename the
existing functions to array_get_element/array_set_element, fix their
signatures, then reincarnate array_ref/array_set as backwards compatibility
wrappers.

array_get_slice/array_set_slice have no such constituency in the core code,
and probably not in third-party code either, so I just changed their APIs.
2015-02-16 12:23:58 -05:00
Tom Lane 08361cea2b Fix null-pointer-deref crash while doing COPY IN with check constraints.
In commit bf7ca15875 I introduced an
assumption that an RTE referenced by a whole-row Var must have a valid eref
field.  This is false for RTEs constructed by DoCopy, and there are other
places taking similar shortcuts.  Perhaps we should make all those places
go through addRangeTableEntryForRelation or its siblings instead of having
ad-hoc logic, but the most reliable fix seems to be to make the new code in
ExecEvalWholeRowVar cope if there's no eref.  We can reasonably assume that
there's no need to insert column aliases if no aliases were provided.

Add a regression test case covering this, and also verifying that a sane
column name is in fact available in this situation.

Although the known case only crashes in 9.4 and HEAD, it seems prudent to
back-patch the code change to 9.2, since all the ingredients for a similar
failure exist in the variant patch applied to 9.3 and 9.2.

Per report from Jean-Pierre Pelletier.
2015-02-15 23:26:45 -05:00
Heikki Linnakangas 80788a431e Simplify waiting logic in reading from / writing to client.
The client socket is always in non-blocking mode, and if we actually want
blocking behaviour, we emulate it by sleeping and retrying. But we have
retry loops at different layers for reads and writes, which was confusing.
To simplify, remove all the sleeping and retrying code from the lower
levels, from be_tls_read and secure_raw_read and secure_raw_write, and put
all the logic in secure_read() and secure_write().
2015-02-13 21:46:14 +02:00
Heikki Linnakangas 272923a0a6 Simplify the way OpenSSL renegotiation is initiated in server.
At least in all modern versions of OpenSSL, it is enough to call
SSL_renegotiate() once, and then forget about it. Subsequent SSL_write()
and SSL_read() calls will finish the handshake.

The SSL_set_session_id_context() call is unnecessary too. We only have
one SSL context, and the SSL session was created with that to begin with.
2015-02-13 21:46:08 +02:00
Tom Lane 4f38a281a3 Fix missing PQclear() in libpqrcv_endstreaming().
This omission leaked one PGresult per WAL streaming cycle, which possibly
would never be enough to notice in the real world, but it's still a leak.

Per Coverity.  Back-patch to 9.3 where the error was introduced.
2015-02-11 19:20:49 -05:00
Tom Lane 58146d35de Fix minor memory leak in ident_inet().
We'd leak the ident_serv data structure if the second pg_getaddrinfo_all
(the one for the local address) failed.  This is not of great consequence
because a failure return here just leads directly to backend exit(), but
if this function is going to try to clean up after itself at all, it should
not have such holes in the logic.  Try to fix it in a future-proof way by
having all the failure exits go through the same cleanup path, rather than
"optimizing" some of them.

Per Coverity.  Back-patch to 9.2, which is as far back as this patch
applies cleanly.
2015-02-11 19:09:54 -05:00
Tom Lane 1a179f36f7 Fix GEQO to not assume its join order heuristic always works.
Back in commit 400e2c9344 I rewrote GEQO's
gimme_tree function to improve its heuristic for modifying the given tour
into a legal join order.  In what can only be called a fit of hubris,
I supposed that this new heuristic would *always* find a legal join order,
and ripped out the old logic that allowed gimme_tree to sometimes fail.

The folly of this is exposed by bug #12760, in which the "greedy" clumping
behavior of merge_clump() can lead it into a dead end which could only be
recovered from by un-clumping.  We have no code for that and wouldn't know
exactly what to do with it if we did.  Rather than try to improve the
heuristic rules still further, let's just recognize that it *is* a
heuristic and probably must always have failure cases.  So, put back the
code removed in the previous commit to allow for failure (but comment it
a bit better this time).

It's possible that this code was actually fully correct at the time and
has only been broken by the introduction of LATERAL.  But having seen this
example I no longer have much faith in that proposition, so back-patch to
all supported branches.
2015-02-10 20:37:19 -05:00
Tom Lane bc4de01db3 Minor cleanup/code review for "indirect toast" stuff.
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer.  Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding.  An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input.  Back-patch to 9.4
just in case that is reachable now.

In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
2015-02-09 12:30:52 -05:00
Heikki Linnakangas c619c2351f Move pg_crc.c to src/common, and remove pg_crc_tables.h
To get CRC functionality in a client program, you now need to link with
libpgcommon instead of libpgport. The CRC code has nothing to do with
portability, so libpgcommon is a better home. (libpgcommon didn't exist
when pg_crc.c was originally moved to src/port.)

Remove the possibility to get CRC functionality by just #including
pg_crc_tables.h. I'm not aware of any extensions that actually did that and
couldn't simply link with libpgcommon.

This also moves the pg_crc.h header file from src/include/utils to
src/include/common, which will require changes to any external programs
that currently does #include "utils/pg_crc.h". That seems acceptable, as
include/common is clearly the right home for it now, and the change needed
to any such programs is trivial.
2015-02-09 11:17:56 +02:00
Fujii Masao 40bede5477 Move pg_lzcompress.c to src/common.
The meta data of PGLZ symbolized by PGLZ_Header is removed, to make
the compression and decompression code independent on the backend-only
varlena facility. PGLZ_Header is being used to store some meta data
related to the data being compressed like the raw length of the uncompressed
record or some varlena-related data, making it unpluggable once PGLZ is
stored in src/common as it contains some backend-only code paths with
the management of varlena structures. The APIs of PGLZ are reworked
at the same time to do only compression and decompression of buffers
without the meta-data layer, simplifying its use for a more general usage.

On-disk format is preserved as well, so there is no incompatibility with
previous major versions of PostgreSQL for TOAST entries.

Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. Especially this commit is required
for upcoming WAL compression feature so that the WAL reader facility can
decompress the WAL data by using pglz_decompress.

Michael Paquier, reviewed by me.
2015-02-09 15:15:24 +09:00
Noah Misch 237795a7b4 Check DCH_MAX_ITEM_SIZ limits with <=, not <.
We reserve space for the full amount, not one less.  The affected checks
deal with localized month and day names.  Today's DCH_MAX_ITEM_SIZ value
would suffice for a 60-byte day name, while the longest known is the
49-byte mn_CN.utf-8 word for "Saturday."  Thus, the upshot of this
change is merely to avoid misdirecting future readers of the code; users
are not expected to see errors either way.
2015-02-06 23:39:52 -05:00
Noah Misch a7a4adcf8d Assert(PqCommReadingMsg) in pq_peekbyte().
Interrupting pq_recvbuf() can break protocol sync, so its callers all
deserve this assertion.  The one pq_peekbyte() caller suffices already.
2015-02-06 23:14:27 -05:00
Heikki Linnakangas ff16b40f8c Report WAL flush, not insert, position in replication IDENTIFY_SYSTEM
When beginning streaming replication, the client usually issues the
IDENTIFY_SYSTEM command, which used to return the current WAL insert
position. That's not suitable for the intended purpose of that field,
however. pg_receivexlog uses it to start replication from the reported
point, but if it hasn't been flushed to disk yet, it will fail. Change
IDENTIFY_SYSTEM to report the flush position instead.

Backpatch to 9.1 and above. 9.0 doesn't report any WAL position.
2015-02-06 11:26:50 +02:00
Heikki Linnakangas d88976cfa1 Use a separate memory context for GIN scan keys.
It was getting tedious to track and release all the different things that
form a scan key. We were leaking at least the queryCategories array, and
possibly more, on a rescan. That was visible if a GIN index was used in a
nested loop join. This also protects from leaks in extractQuery method.

No backpatching, given the lack of complaints from the field. Maybe later,
after this has received more field testing.
2015-02-04 17:40:25 +02:00
Heikki Linnakangas 57fe246890 Fix reference-after-free when waiting for another xact due to constraint.
If an insertion or update had to wait for another transaction to finish,
because there was another insertion with conflicting key in progress,
we would pass a just-free'd item pointer to XactLockTableWait().

All calls to XactLockTableWait() and MultiXactIdWait() had similar issues.
Some passed a pointer to a buffer in the buffer cache, after already
releasing the lock. The call in EvalPlanQualFetch had already released the
pin too. All but the call in execUtils.c would merely lead to reporting a
bogus ctid, however (or an assertion failure, if enabled).

All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus
anyway: if the tuple was updated (again) in the same transaction, its ctid
field would point to the next tuple in the chain, not the tuple itself.

Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added
(in commit f88d4cfc)
2015-02-04 16:00:34 +02:00
Andres Freund 3a54f4a494 Remove ill-conceived Assertion in ProcessClientWriteInterrupt().
It's perfectly fine to have blocked interrupts when
ProcessClientWriteInterrupt() is called. In fact it's commonly the
case when emitting error reports. And we deal with that correctly.

Even if that'd not be the case, it'd be a bad location for such a
assertion. Because ProcessClientWriteInterrupt() is only called when
the socket is blocked it's hard to hit.

Per Heikki and buildfarm animals nightjar and dunlin.
2015-02-03 23:52:15 +01:00
Andres Freund 2505ce0be0 Remove remnants of ImmediateInterruptOK handling.
Now that nothing sets ImmediateInterruptOK to true anymore, we can
remove all the supporting code.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:25:47 +01:00
Andres Freund d06995710b Remove the option to service interrupts during PGSemaphoreLock().
The remaining caller (lwlocks) doesn't need that facility, and we plan
to remove ImmedidateInterruptOK entirely. That means that interrupts
can't be serviced race-free and portably anyway, so there's little
reason for keeping the feature.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:25:00 +01:00
Andres Freund 6753333f55 Move deadlock and other interrupt handling in proc.c out of signal handlers.
Deadlock checking was performed inside signal handlers up to
now. While it's a remarkable feat to have made this work reliably,
it's quite complex to understand why that is the case. Partially it
worked due to the assumption that semaphores are signal safe - which
is not actually documented to be the case for sysv semaphores.

The reason we had to rely on performing this work inside signal
handlers is that semaphores aren't guaranteed to be interruptable by
signals on all platforms. But now that latches provide a somewhat
similar API, which actually has the guarantee of being interruptible,
we can avoid doing so.

Signalling between ProcSleep, ProcWakeup, ProcWaitForSignal and
ProcSendSignal is now done using latches. This increases the
likelihood of spurious wakeups. As spurious wakeup already were
possible and aren't likely to be frequent enough to be an actual
problem, this seems acceptable.

This change would allow for further simplification of the deadlock
checking, now that it doesn't have to run in a signal handler. But
even if I were motivated to do so right now, it would still be better
to do that separately. Such a cleanup shouldn't have to be reviewed a
the same time as the more fundamental changes in this commit.

There is one possible usability regression due to this commit. Namely
it is more likely than before that log_lock_waits messages are output
more than once.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:24:38 +01:00
Andres Freund 6647248e37 Don't allow immediate interrupts during authentication anymore.
We used to handle authentication_timeout by setting
ImmediateInterruptOK to true during large parts of the authentication
phase of a new connection.  While that happens to work acceptably in
practice, it's not particularly nice and has ugly corner cases.

Previous commits converted the FE/BE communication to use latches and
implemented support for interrupt handling during both
send/recv. Building on top of that work we can get rid of
ImmediateInterruptOK during authentication, by immediately treating
timeouts during authentication as a reason to die. As die interrupts
are handled immediately during client communication that provides a
sensibly quick reaction time to authentication timeout.

Additionally add a few CHECK_FOR_INTERRUPTS() to some more complex
authentication methods. More could be added, but this already should
provides a reasonable coverage.

While it this overall increases the maximum time till a timeout is
reacted to, it greatly reduces complexity and increases
reliability. That seems like a overall win. If the increase proves to
be noticeable we can deal with those cases by moving to nonblocking
network code and add interrupt checking there.

Reviewed-By: Heikki Linnakangas
2015-02-03 22:54:48 +01:00
Tom Lane cec916f35b Remove unused "m" field in LSEG.
This field has been unreferenced since 1998, and does not appear in lseg
values stored on disk (since sizeof(lseg) is only 32 bytes according to
pg_type).  There was apparently some idea of maintaining it just in values
appearing in memory, but the bookkeeping required to make that work would
surely far outweigh the cost of recalculating the line's slope when needed.
Remove it to (a) simplify matters and (b) suppress some uninitialized-field
whining from Coverity.
2015-02-03 16:53:32 -05:00
Andres Freund 4fe384bd85 Process 'die' interrupts while reading/writing from the client socket.
Up to now it was impossible to terminate a backend that was trying to
send/recv data to/from the client when the socket's buffer was already
full/empty. While the send/recv calls itself might have gotten
interrupted by signals on some platforms, we just immediately retried.

That could lead to situations where a backend couldn't be terminated ,
after a client died without the connection being closed, because it
was blocked in send/recv.

The problem was far more likely to be hit when sending data than when
reading. That's because while reading a command from the client, and
during authentication, we processed interrupts immediately . That
primarily left COPY FROM STDIN as being problematic for recv.

Change things so that that we process 'die' events immediately when
the appropriate signal arrives. We can't sensibly react to query
cancels at that point, because we might loose sync with the client as
we could be in the middle of writing a message.

We don't interrupt writes if the write buffer isn't full, as indicated
by write() returning EWOULDBLOCK, as that would lead to fewer error
messages reaching clients.

Per discussion with Kyotaro HORIGUCHI and Heikki Linnakangas

Discussion: 20140927191243.GD5423@alap3.anarazel.de
2015-02-03 22:45:45 +01:00
Andres Freund 4f85fde8eb Introduce and use infrastructure for interrupt processing during client reads.
Up to now large swathes of backend code ran inside signal handlers
while reading commands from the client, to allow for speedy reaction to
asynchronous events. Most prominently shared invalidation and NOTIFY
handling. That means that complex code like the starting/stopping of
transactions is run in signal handlers...  The required code was
fragile and verbose, and is likely to contain bugs.

That approach also severely limited what could be done while
communicating with the client. As the read might be from within
openssl it wasn't safely possible to trigger an error, e.g. to cancel
a backend in idle-in-transaction state. We did that in some cases,
namely fatal errors, nonetheless.

Now that FE/BE communication in the backend employs non-blocking
sockets and latches to block, we can quite simply interrupt reads from
signal handlers by setting the latch. That allows us to signal an
interrupted read, which is supposed to be retried after returning from
within the ssl library.

As signal handlers now only need to set the latch to guarantee timely
interrupt processing, remove a fair amount of complicated & fragile
code from async.c and sinval.c.

We could now actually start to process some kinds of interrupts, like
sinval ones, more often that before, but that seems better done
separately.

This work will hopefully allow to handle cases like being blocked by
sending data, interrupting idle transactions and similar to be
implemented without too much effort.  In addition to allowing getting
rid of ImmediateInterruptOK, that is.

Author: Andres Freund
Reviewed-By: Heikki Linnakangas
2015-02-03 22:25:20 +01:00
Andres Freund 387da18874 Use a nonblocking socket for FE/BE communication and block using latches.
This allows to introduce more elaborate handling of interrupts while
reading from a socket.  Currently some interrupt handlers have to do
significant work from inside signal handlers, and it's very hard to
correctly write code to do so.  Generic signal handler limitations,
combined with the fact that we can't safely jump out of a signal
handler while reading from the client have prohibited implementation
of features like timeouts for idle-in-transaction.

Additionally we use the latch code to wait in a couple places where we
previously only had waiting code on windows as other platforms just
busy looped.

This can increase the number of systemcalls happening during FE/BE
communication. Benchmarks so far indicate that the impact isn't very
high, and there's room for optimization in the latch code. The chance
of cleaning up the usage of latches gives us, seem to outweigh the
risk of small performance regressions.

This commit theoretically can't used without the next patch in the
series, as WaitLatchOrSocket is not defined to be fully signal
safe. As we already do that in some cases though, it seems better to
keep the commits separate, so they're easier to understand.

Author: Andres Freund
Reviewed-By: Heikki Linnakangas
2015-02-03 22:03:48 +01:00
Tom Lane 778d498c7d Fix breakage in GEODEBUG debug code.
LINE doesn't have an "m" field (anymore anyway).  Also fix unportable
assumption that %x can print the result of pointer subtraction.

In passing, improve single_decode() in minor ways:
* Remove unnecessary leading-whitespace skip (strtod does that already).
* Make GEODEBUG message more intelligible.
* Remove entirely-useless test to see if strtod returned a silly pointer.
* Don't bother computing trailing-whitespace skip unless caller wants
  an ending pointer.

This has been broken since 261c7d4b65.
Although it's only debug code, might as well fix the 9.4 branch too.
2015-02-03 15:20:45 -05:00
Heikki Linnakangas 809d9a260b Refactor page compactifying code.
The logic to compact away removed tuples from page was duplicated with
small differences in PageRepairFragmentation, PageIndexMultiDelete, and
PageIndexDeleteNoCompact. Put it into a common function.

Reviewed by Peter Geoghegan.
2015-02-03 14:09:29 +02:00
Heikki Linnakangas efba7a542f Fix typo in comment.
Amit Langote
2015-02-03 09:49:07 +02:00
Robert Haas 5d2f957f3f Add new function BackgroundWorkerInitializeConnectionByOid.
Sometimes it's useful for a background worker to be able to initialize
its database connection by OID rather than by name, so provide a way
to do that.
2015-02-02 16:23:59 -05:00
Heikki Linnakangas 2b3a8b20c2 Be more careful to not lose sync in the FE/BE protocol.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.

We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.

To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.

To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.

Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.

Security: CVE-2015-0244
2015-02-02 17:09:53 +02:00
Bruce Momjian 9241c84cbc to_char(): prevent writing beyond the allocated buffer
Previously very long localized month and weekday strings could
overflow the allocated buffers, causing a server crash.

Reported and patch reviewed by Noah Misch.  Backpatch to all
supported versions.

Security: CVE-2015-0241
2015-02-02 10:00:45 -05:00
Bruce Momjian 0150ab567b to_char(): prevent accesses beyond the allocated buffer
Previously very long field masks for floats could access memory
beyond the existing buffer allocated to hold the result.

Reported by Andres Freund and Peter Geoghegan.	Backpatch to all
supported versions.

Security: CVE-2015-0241
2015-02-02 10:00:44 -05:00
Peter Eisentraut f8948616c9 Translation updates
Source-Git-URL: git://git.postgresql.org/git/pgtranslation/messages.git
Source-Git-Hash: 19c72ea8d856d7b1d4f5d759a766c8206bf9ce53
2015-02-01 23:23:40 -05:00
Tom Lane 451d280815 Fix jsonb Unicode escape processing, and in consequence disallow \u0000.
We've been trying to support \u0000 in JSON values since commit
78ed8e03c6, and have introduced increasingly worse hacks to try to
make it work, such as commit 0ad1a81632.  However, it fundamentally
can't work in the way envisioned, because the stored representation looks
the same as for \\u0000 which is not the same thing at all.  It's also
entirely bogus to output \u0000 when de-escaped output is called for.

The right way to do this would be to store an actual 0x00 byte, and then
throw error only if asked to produce de-escaped textual output.  However,
getting to that point seems likely to take considerable work and may well
never be practical in the 9.4.x series.

To preserve our options for better behavior while getting rid of the nasty
side-effects of 0ad1a81632, revert that commit in toto and instead
throw error if \u0000 is used in a context where it needs to be de-escaped.
(These are the same contexts where non-ASCII Unicode escapes throw error
if the database encoding isn't UTF8, so this behavior is by no means
without precedent.)

In passing, make both the \u0000 case and the non-ASCII Unicode case report
ERRCODE_UNTRANSLATABLE_CHARACTER / "unsupported Unicode escape sequence"
rather than claiming there's something wrong with the input syntax.

Back-patch to 9.4, where we have to do something because 0ad1a81632
broke things for many cases having nothing to do with \u0000.  9.3 also has
bogus behavior, but only for that specific escape value, so given the lack
of field complaints it seems better to leave 9.3 alone.
2015-01-30 14:44:56 -05:00
Robert Haas bd4e2fd97d Provide a way to supress the "out of memory" error when allocating.
Using the new interface MemoryContextAllocExtended, callers can
specify MCXT_ALLOC_NO_OOM if they are prepared to handle a NULL
return value.

Michael Paquier, reviewed and somewhat revised by me.
2015-01-30 12:56:48 -05:00
Tom Lane 3d660d33aa Fix assorted oversights in range selectivity estimation.
calc_rangesel() failed outright when comparing range variables to empty
constant ranges with < or >=, as a result of missing cases in a switch.
It also produced a bogus estimate for > comparison to an empty range.

On top of that, the >= and > cases were mislabeled throughout.  For
nonempty constant ranges, they managed to produce the right answers
anyway as a result of counterbalancing typos.

Also, default_range_selectivity() omitted cases for elem <@ range,
range &< range, and range &> range, so that rather dubious defaults
were applied for these operators.

In passing, rearrange the code in rangesel() so that the elem <@ range
case is handled in a less opaque fashion.

Report and patch by Emre Hasegeli, some additional work by me
2015-01-30 12:30:59 -05:00
Heikki Linnakangas 68fa75f318 Fix query-duration memory leak with GIN rescans.
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.

It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.

(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)

Backpatch to 9.4, where GIN fast scan was added.
2015-01-30 17:58:23 +01:00
Stephen Frost 32bf6ee6ab Fix BuildIndexValueDescription for expressions
In 804b6b6db4 we modified
BuildIndexValueDescription to pay attention to which columns are visible
to the user, but unfortunatley that commit neglected to consider indexes
which are built on expressions.

Handle error-reporting of violations of constraint indexes based on
expressions by not returning any detail when the user does not have
table-level SELECT rights.

Backpatch to 9.0, as the prior commit was.

Pointed out by Tom.
2015-01-29 21:59:34 -05:00
Andres Freund 17792bfc5b Properly terminate the array returned by GetLockConflicts().
GetLockConflicts() has for a long time not properly terminated the
returned array. During normal processing the returned array is zero
initialized which, while not pretty, is sufficient to be recognized as
a invalid virtual transaction id. But the HotStandby case is more than
aesthetically broken: The allocated (and reused) array is neither
zeroed upon allocation, nor reinitialized, nor terminated.

Not having a terminating element means that the end of the array will
not be recognized and that recovery conflict handling will thus read
ahead into adjacent memory. Only terminating when hitting memory
content that looks like a invalid virtual transaction id.  Luckily
this seems so far not have caused significant problems, besides making
recovery conflict more expensive.

Discussion: 20150127142713.GD29457@awork2.anarazel.de

Backpatch into all supported branches.
2015-01-29 22:48:45 +01:00
Andres Freund ed127002d8 Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.

Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.

In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.

As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.

Discussion: 20140202151319.GD32123@awork2.anarazel.de

Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 22:48:45 +01:00
Andres Freund 7142bfbbd3 Fix #ifdefed'ed out code to compile again. 2015-01-29 22:48:45 +01:00
Heikki Linnakangas 31ed42b9a3 Fix bug where GIN scan keys were not initialized with gin_fuzzy_search_limit.
When gin_fuzzy_search_limit was used, we could jump out of startScan()
without calling startScanKey(). That was harmless in 9.3 and below, because
startScanKey()() didn't do anything interesting, but in 9.4 it initializes
information needed for skipping entries (aka GIN fast scans), and you
readily get a segfault if it's not done. Nevertheless, it was clearly wrong
all along, so backpatch all the way to 9.1 where the early return was
introduced.

(AFAICS startScanKey() did nothing useful in 9.3 and below, because the
fields it initialized were already initialized in ginFillScanKey(), but I
don't dare to change that in a minor release. ginFillScanKey() is always
called in gingetbitmap() even though there's a check there to see if the
scan keys have already been initialized, because they never are; ginrescan()
free's them.)

In the passing, remove unnecessary if-check from the second inner loop in
startScan(). We already check in the first loop that the condition is true
for all entries.

Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although
AFAICS it causes a live bug only in 9.4.
2015-01-29 19:35:55 +02:00
Robert Haas 3d6d1b5855 Move out-of-memory error checks from aset.c to mcxt.c
This potentially allows us to add mcxt.c interfaces that do something
other than throw an error when memory cannot be allocated.  We'll
handle adding those interfaces in a separate commit.

Michael Paquier, with minor changes by me
2015-01-29 10:23:38 -05:00
Stephen Frost c7cf9a2433 Add usebypassrls to pg_user and pg_shadow
The row level security patches didn't add the 'usebypassrls' columns to
the pg_user and pg_shadow views on the belief that they were deprecated,
but we havn't actually said they are and therefore we should include it.

This patch corrects that, adds missing documentation for rolbypassrls
into the system catalog page for pg_authid, along with the entries for
pg_user and pg_shadow, and cleans up a few other uses of 'row-level'
cases to be 'row level' in the docs.

Pointed out by Amit Kapila.

Catalog version bump due to system view changes.
2015-01-28 21:47:15 -05:00
Stephen Frost f8519a6a46 Clean up range-table building in copy.c
Commit 804b6b6db4 added the build of a
range table in copy.c to initialize the EState es_range_table since it
can be needed in error paths.  Unfortunately, that commit didn't
appreciate that some code paths might end up not initializing the rte
which is used to build the range table.

Fix that and clean up a couple others things along the way- build it
only once and don't explicitly set it on the !is_from path as it
doesn't make any sense there (cstate is palloc0'd, so this isn't an
issue from an initializing standpoint either).

The prior commit went back to 9.0, but this only goes back to 9.1 as
prior to that the range table build happens immediately after building
the RTE and therefore doesn't suffer from this issue.

Pointed out by Robert.
2015-01-28 17:42:28 -05:00
Stephen Frost 804b6b6db4 Fix column-privilege leak in error-message paths
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.

Instead, include only those columns which the user is providing or which
the user has select rights on.  If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription.  Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.

Further, in master only, do not return any data for cases where row
security is enabled on the relation and row security should be applied
for the user.  This required a bit of refactoring and moving of things
around related to RLS- note the addition of utils/misc/rls.c.

Back-patch all the way, as column-level privileges are now in all
supported versions.

This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
2015-01-28 12:31:30 -05:00
Heikki Linnakangas acc2b1e843 Fix typo in comment. 2015-01-28 10:26:30 +02:00
Heikki Linnakangas 670bf71f65 Remove dead NULL-pointer checks in GiST code.
gist_poly_compress() and gist_circle_compress() checked for a NULL-pointer
key argument, but that was dead code; the gist code never passes a
NULL-pointer to the "compress" method.

This commit also removes a documentation note added in commit a0a3883,
about doing NULL-pointer checks in the "compress" method. It was added
based on the fact that some implementations were doing NULL-pointer
checks, but those checks were unnecessary in the first place.

The NULL-pointer check in gbt_var_same() function was also unnecessary.
The arguments to the "same" method come from the "compress", "union", or
"picksplit" methods, but none of them return a NULL pointer.

None of this is to be confused with SQL NULL values. Those are dealt with
by the gist machinery, and are never passed to the GiST opclass methods.

Michael Paquier
2015-01-28 10:03:58 +02:00
Tom Lane 1a2b2034d4 Fix NUMERIC field access macros to treat NaNs consistently.
Commit 145343534c arranged to store numeric
NaN values as short-header numerics, but the field access macros did not
get the memo: they thought only "SHORT" numerics have short headers.

Most of the time this makes no difference because we don't access the
weight or dscale of a NaN; but numeric_send does that.  As pointed out
by Andrew Gierth, this led to fetching uninitialized bytes.

AFAICS this could not have any worse consequences than that; in particular,
an unaligned stored numeric would have been detoasted by PG_GETARG_NUMERIC,
so that there's no risk of a fetch off the end of memory.  Still, the code
is wrong on its own terms, and it's not hard to foresee future changes that
might expose us to real risks.  So back-patch to all affected branches.
2015-01-27 12:06:31 -05:00
Robert Haas 168a809d4b Re-enable abbreviated keys on Windows.
Commit 1be4eb1b2d disabled this, but I
think the real problem here was fixed by commit
b181a91981 and commit
d060e07fa9.  So let's try re-enabling
it now and see what happens.
2015-01-26 14:28:14 -05:00
Tom Lane c58accd70b Fix volatile-safety issue in asyncQueueReadAllNotifications().
The "pos" variable is modified within PG_TRY and then referenced
within PG_CATCH, so for strict POSIX conformance it must be marked
volatile.  Superficially the code looked safe because pos's address
was taken, which was sufficient to force it into memory ... but it's
not sufficient to ensure that the compiler applies updates exactly
where the program text says to.  The volatility marking has to extend
into a couple of subroutines too, but I think that's probably a good
thing because the risk of out-of-order updates is mostly in those
subroutines not asyncQueueReadAllNotifications() itself.  In principle
the compiler could have re-ordered operations such that an error could
be thrown while "pos" had an incorrect value.

It's unclear how real the risk is here, but for safety back-patch
to all active branches.
2015-01-26 11:57:33 -05:00
Tom Lane c70f9e8988 Further cleanup of ReorderBufferCommit().
On closer inspection, we can remove the "volatile" qualifier on
"using_subtxn" so long as we initialize that before the PG_TRY block,
which there's no particularly good reason not to do.
Also, push the "change" variable inside the PG_TRY so as to remove
all question of whether it needs "volatile", and remove useless
early initializations of "snapshow_now" and "using_subtxn".
2015-01-25 22:49:56 -05:00
Tom Lane bf007a27ac Clean up assorted issues in ALTER SYSTEM coding.
Fix unsafe use of a non-volatile variable in PG_TRY/PG_CATCH in
AlterSystemSetConfigFile().  While at it, clean up a bundle of other
infelicities and outright bugs, including corner-case-incorrect linked list
manipulation, a poorly designed and worse documented parse-and-validate
function (which even included some randomly chosen hard-wired substitutes
for the specified elevel in one code path ... wtf?), direct use of open()
instead of fd.c's facilities, inadequate checking of write()'s return
value, and generally poorly written commentary.
2015-01-25 20:19:04 -05:00
Tom Lane fd496129d1 Clean up some mess in row-security patches.
Fix unsafe coding around PG_TRY in RelationBuildRowSecurity: can't change
a variable inside PG_TRY and then use it in PG_CATCH without marking it
"volatile".  In this case though it seems saner to avoid that by doing
a single assignment before entering the TRY block.

I started out just intending to fix that, but the more I looked at the
row-security code the more distressed I got.  This patch also fixes
incorrect construction of the RowSecurityPolicy cache entries (there was
not sufficient care taken to copy pass-by-ref data into the cache memory
context) and a whole bunch of sloppiness around the definition and use of
pg_policy.polcmd.  You can't use nulls in that column because initdb will
mark it NOT NULL --- and I see no particular reason why a null entry would
be a good idea anyway, so changing initdb's behavior is not the right
answer.  The internal value of '\0' wouldn't be suitable in a "char" column
either, so after a bit of thought I settled on using '*' to represent ALL.
Chasing those changes down also revealed that somebody wasn't paying
attention to what the underlying values of ACL_UPDATE_CHR etc really were,
and there was a great deal of lackadaiscalness in the catalogs.sgml
documentation for pg_policy and pg_policies too.

This doesn't pretend to be a complete code review for the row-security
stuff, it just fixes the things that were in my face while dealing with
the bugs in RelationBuildRowSecurity.
2015-01-24 16:16:22 -05:00
Tom Lane f8a4dd2e14 Fix unsafe coding in ReorderBufferCommit().
"iterstate" must be marked volatile since it's changed inside the PG_TRY
block and then used in the PG_CATCH stanza.  Noted by Mark Wilding of
Salesforce.  (We really need to see if we can't get the C compiler to warn
about this.)

Also, reset iterstate to NULL after the mainline ReorderBufferIterTXNFinish
call, to ensure the PG_CATCH block doesn't try to do that a second time.
2015-01-24 13:25:19 -05:00
Tom Lane 586dd5d6a5 Replace a bunch more uses of strncpy() with safer coding.
strncpy() has a well-deserved reputation for being unsafe, so make an
effort to get rid of nearly all occurrences in HEAD.

A large fraction of the remaining uses were passing length less than or
equal to the known strlen() of the source, in which case no null-padding
can occur and the behavior is equivalent to memcpy(), though doubtless
slower and certainly harder to reason about.  So just use memcpy() in
these cases.

In other cases, use either StrNCpy() or strlcpy() as appropriate (depending
on whether padding to the full length of the destination buffer seems
useful).

I left a few strncpy() calls alone in the src/timezone/ code, to keep it
in sync with upstream (the IANA tzcode distribution).  There are also a
few such calls in ecpg that could possibly do with more analysis.

AFAICT, none of these changes are more than cosmetic, except for the four
occurrences in fe-secure-openssl.c, which are in fact buggy: an overlength
source leads to a non-null-terminated destination buffer and ensuing
misbehavior.  These don't seem like security issues, first because no stack
clobber is possible and second because if your values of sslcert etc are
coming from untrusted sources then you've got problems way worse than this.
Still, it's undesirable to have unpredictable behavior for overlength
inputs, so back-patch those four changes to all active branches.
2015-01-24 13:05:42 -05:00
Robert Haas d1747571b6 Fix typos, update README.
Peter Geoghegan
2015-01-23 15:06:53 -05:00
Robert Haas 5cefbf5a6c Don't use abbreviated keys for the final merge pass.
When we write tuples out to disk and read them back in, the abbreviated
keys become non-abbreviated, because the readtup routines don't know
anything about abbreviation.  But without this fix, the rest of the
code still thinks the abbreviation-aware compartor should be used,
so chaos ensues.

Report by Andrew Gierth; patch by Peter Geoghegan.
2015-01-23 11:58:31 -05:00
Robert Haas 6a3c6ba0ba Add an explicit cast to Size to hyperloglog.c
MSVC generates a warning here; we hope this will make it happy.

Report by Michael Paquier.  Patch by David Rowley.
2015-01-23 11:44:51 -05:00
Tom Lane eb213acfe2 Prevent duplicate escape-string warnings when using pg_stat_statements.
contrib/pg_stat_statements will sometimes run the core lexer a second time
on submitted statements.  Formerly, if you had standard_conforming_strings
turned off, this led to sometimes getting two copies of any warnings
enabled by escape_string_warning.  While this is probably no longer a big
deal in the field, it's a pain for regression testing.

To fix, change the lexer so it doesn't consult the escape_string_warning
GUC variable directly, but looks at a copy in the core_yy_extra_type state
struct.  Then, pg_stat_statements can change that copy to disable warnings
while it's redoing the lexing.

It seemed like a good idea to make this happen for all three of the GUCs
consulted by the lexer, not just escape_string_warning.  There's not an
immediate use-case for callers to adjust the other two AFAIK, but making
it possible is easy enough and seems like good future-proofing.

Arguably this is a bug fix, but there doesn't seem to be enough interest to
justify a back-patch.  We'd not be able to back-patch exactly as-is anyway,
for fear of breaking ABI compatibility of the struct.  (We could perhaps
back-patch the addition of only escape_string_warning by adding it at the
end of the struct, where there's currently alignment padding space.)
2015-01-22 18:11:00 -05:00
Peter Eisentraut f5f2c2de16 Fix whitespace 2015-01-22 16:57:16 -05:00
Alvaro Herrera 972bf7d6f1 Tweak BRIN minmax operator class
In the union support proc, we were not checking the hasnulls flag of
value A early enough, so it could be skipped if the "allnulls" flag in
value B is set.  Also, a check on the allnulls flag of value "B" was
redundant, so remove it.

Also change inet_minmax_ops to not be the default opclass for type inet,
as a future inclusion operator class would be more useful and it's
pretty difficult to change default opclass for a datatype later on.
(There is no catversion bump for this catalog change; this shouldn't be
a problem.)

Extracted from a larger patch to add an "inclusion" operator class.

Author: Emre Hasegeli
2015-01-22 17:01:09 -03:00
Robert Haas d060e07fa9 Repair brain fade in commit b181a91981.
The split between which things need to happen in the C-locale case and
which needed to happen in the locale-aware case was a few bricks short
of a load.  Try to fix that.
2015-01-22 12:51:20 -05:00
Bruce Momjian 59367fdf97 adjust ACL owners for REASSIGN and ALTER OWNER TO
When REASSIGN and ALTER OWNER TO are used, both the object owner and ACL
list should be changed from the old owner to the new owner. This patch
fixes types, foreign data wrappers, and foreign servers to change their
ACL list properly;  they already changed owners properly.

BACKWARD INCOMPATIBILITY?

Report by Alexey Bashtanov
2015-01-22 12:36:55 -05:00
Robert Haas b181a91981 More fixes for abbreviated keys infrastructure.
First, when LC_COLLATE = C, bttext_abbrev_convert should use memcpy()
rather than strxfrm() to construct the abbreviated key, because the
authoritative comparator uses memcpy().  If we do anything else here,
we might get inconsistent answers, and the buildfarm says this risk
is not theoretical.  It should be faster this way, too.

Second, while I'm looking at bttext_abbrev_convert, convert a needless
use of goto into the loop it's trying to implement into an actual
loop.

Both of the above problems date to the original commit of abbreviated
keys, commit 4ea51cdfe8.

Third, fix a bogus assignment to tss->locale before tss is set up.
That's a new goof in commit b529b65d1b.
2015-01-22 11:58:58 -05:00
Robert Haas b529b65d1b Heavily refactor btsortsupport_worker.
Prior to commit 4ea51cdfe8, this function
only had one job, which was to decide whether we could avoid trampolining
through the fmgr layer when performing sort comparisons.  As of that
commit, it has a second job, which is to decide whether we can use
abbreviated keys.  Unfortunately, those two tasks are somewhat intertwined
in the existing coding, which is likely why neither Peter Geoghegan nor
I noticed prior to commit that this calls pg_newlocale_from_collation() in
cases where it didn't previously.  The buildfarm noticed, though.

To fix, rewrite the logic so that the decision as to which comparator to
use is more cleanly separated from the decision about abbreviation.
2015-01-22 10:54:16 -05:00
Robert Haas 1be4eb1b2d Disable abbreviated keys on Windows.
Most of the Windows buildfarm members (bowerbird, hamerkop, currawong,
jacana, brolga) are unhappy with yesterday's abbreviated keys patch,
although there are some (narwhal, frogmouth) that seem OK with it.
Since there's no obvious pattern to explain why some are working and
others are failing, just disable this across-the-board on Windows for
now.  This is a bit unfortunate since the optimization will be a big
win in some cases, but we can't leave the buildfarm broken.
2015-01-20 20:32:21 -05:00
Tom Lane 75b48e1fff Adjust "pgstat wait timeout" message to be a translatable LOG message.
Per discussion, change the log level of this message to be LOG not WARNING.
The main point of this change is to avoid causing buildfarm run failures
when the stats collector is exceptionally slow to respond, which it not
infrequently is on some of the smaller/slower buildfarm members.

This change does lose notice to an interactive user when his stats query
is looking at out-of-date stats, but the majority opinion (not necessarily
that of yours truly) is that WARNING messages would probably not get
noticed anyway on heavily loaded production systems.  A LOG message at
least ensures that the problem is recorded somewhere where bulk auditing
for the issue is possible.

Also, instead of an untranslated "pgstat wait timeout" message, provide
a translatable and hopefully more understandable message "using stale
statistics instead of current ones because stats collector is not
responding".  The original text was written hastily under the assumption
that it would never really happen in practice, which we now know to be
unduly optimistic.

Back-patch to all active branches, since we've seen the buildfarm issue
in all branches.
2015-01-19 23:01:33 -05:00
Andres Freund 2d115e47c8 Fix various shortcomings of the new PrivateRefCount infrastructure.
As noted by Tom Lane the improvements in 4b4b680c3d had the problem
that in some situations we searched, entered and modified entries in
the private refcount hash while holding a spinlock. I had tried to
keep the logic entirely local to PinBuffer_Locked(), but that's not
really possible given it's called with a spinlock held...

Besides being disadvantageous from a performance point of view, this
also has problems with error handling safety. If we failed inserting
an entry into the hashtable due to an out of memory error, we'd error
out with a held spinlock. Not good.

Change the way private refcounts are manipulated: Before a buffer can
be tracked an entry has to be reserved using
ReservePrivateRefCountEntry(); then, if a entry is not found using
GetPrivateRefCountEntry(), it can be entered with
NewPrivateRefCountEntry().

Also take advantage of the fact that PinBuffer_Locked() currently is
never called for buffers that already have been pinned by the current
backend and don't search the private refcount entries for preexisting
local pins. That results in a small, but measurable, performance
improvement.

Additionally make ReleaseBuffer() always call UnpinBuffer() for shared
buffers. That avoids duplicating work in an eventual UnpinBuffer()
call that already has been done in ReleaseBuffer() and also saves some
code.

Per discussion with Tom Lane.

Discussion: 15028.1418772313@sss.pgh.pa.us
2015-01-19 23:59:41 +01:00
Robert Haas 4ea51cdfe8 Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.
2015-01-19 15:28:27 -05:00
Robert Haas 1605291b6c Typo fix.
Etsuro Fujita
2015-01-19 11:36:48 -05:00
Robert Haas 9d54b93239 BRIN typo fix.
Amit Langote
2015-01-19 08:34:29 -05:00
Tom Lane 75df6dc083 Fix ancient thinko in default table rowcount estimation.
The code used sizeof(ItemPointerData) where sizeof(ItemIdData) is correct,
since we're trying to account for a tuple's line pointer.  Spotted by
Tomonari Katsumata (bug #12584).

Although this mistake is of very long standing, no back-patch, since it's
a relatively harmless error and changing it would risk changing default
planner behavior in stable branches.  (I don't see any change in regression
test outputs here, but the buildfarm may think differently.)
2015-01-18 17:04:11 -05:00
Andres Freund ff44fba46c Replace walsender's latch with the general shared latch.
Relying on the normal shared latch simplifies interrupt/signal
handling because we can rely on all signal handlers setting the proc
latch. That in turn allows us to avoid the use of
ImmediateInterruptOK, which arguably isn't correct because
WaitLatchOrSocket isn't declared to be immediately interruptible.

Also change sections that wait on the walsender's latch to notice
interrupts quicker/more reliably and make them more consistent with
each other.

This is part of a larger "get rid of ImmediateInterruptOK" series.

Discussion: 20150115020335.GZ5245@awork2.anarazel.de
2015-01-17 13:00:42 +01:00
Tom Lane 20af53d719 Show sort ordering options in EXPLAIN output.
Up to now, EXPLAIN has contented itself with printing the sort expressions
in a Sort or Merge Append plan node.  This patch improves that by
annotating the sort keys with COLLATE, DESC, USING, and/or NULLS FIRST/LAST
whenever nondefault sort ordering options are used.  The output is now a
reasonably close approximation of an ORDER BY clause equivalent to the
plan's ordering.

Marius Timmer, Lukas Kreft, and Arne Scheffer; reviewed by Mike Blackwell.
Some additional hacking by me.
2015-01-16 18:19:00 -05:00
Heikki Linnakangas 9402869160 Advance backend's advertised xmin more aggressively.
Currently, a backend will reset it's PGXACT->xmin value when it doesn't
have any registered snapshots left. That covered the common case that a
transaction in read committed mode runs several queries, one after each
other, as there would be no snapshots active between those queries.
However, if you hold cursors across each of the query, we didn't get a
chance to reset xmin.

To make that better, keep all the registered snapshots in a pairing heap,
ordered by xmin so that it's always quick to find the snapshot with the
smallest xmin. That allows us to advance PGXACT->xmin whenever the oldest
snapshot is deregistered, even if there are others still active.

Per discussion originally started by Jeff Davis back in 2009 and more
recently by Robert Haas.
2015-01-17 01:15:23 +02:00
Tom Lane 779fdcdeee Improve new caching logic in tbm_add_tuples().
For no significant extra complexity, we can cache knowledge that the
target page is lossy, and save a hash_search per iteration in that
case as well.  This probably makes little difference, since the extra
rechecks that must occur when pages are lossy are way more expensive
than anything we can save here ... but we might as well do it if we're
going to cache anything.
2015-01-16 13:28:30 -05:00
Andres Freund f5ae3ba482 Make tbm_add_tuples more efficient by caching the last acccessed page.
When adding a large number of tuples to a TID bitmap using
tbm_add_tuples() sometimes a lot of time was spent looking up a page's
entry in the bitmap's internal hashtable.

Improve efficiency by caching the last accessed page, while iterating
over the passed in tuples, hoping consecutive tuples will often be on
the same page.  In many cases that's a good bet, and in the rest the
added overhead isn't big.

Discussion: 54479A85.8060309@sigaev.ru

Author: Teodor Sigaev
Reviewed-By: David Rowley
2015-01-16 17:47:59 +01:00
Tom Lane c480cb9d24 Fix use-of-already-freed-memory problem in EvalPlanQual processing.
Up to now, the "child" executor state trees generated for EvalPlanQual
rechecks have simply shared the ResultRelInfo arrays used for the original
execution tree.  However, this leads to dangling-pointer problems, because
ExecInitModifyTable() is all too willing to scribble on some fields of the
ResultRelInfo(s) even when it's being run in one of those child trees.
This trashes those fields from the perspective of the parent tree, because
even if the generated subtree is logically identical to what was in use in
the parent, it's in a memory context that will go away when we're done
with the child state tree.

We do however want to share information in the direction from the parent
down to the children; in particular, fields such as es_instrument *must*
be shared or we'll lose the stats arising from execution of the children.
So the simplest fix is to make a copy of the parent's ResultRelInfo array,
but not copy any fields back at end of child execution.

Per report from Manuel Kniep.  The added isolation test is based on his
example.  In an unpatched memory-clobber-enabled build it will reliably
fail with "ctid is NULL" errors in all branches back to 9.1, as a
consequence of junkfilter->jf_junkAttNo being overwritten with $7f7f.
This test cannot be run as-is before that for lack of WITH syntax; but
I have no doubt that some variant of this problem can arise in older
branches, so apply the code change all the way back.
2015-01-15 18:52:58 -05:00
Heikki Linnakangas 49b04188f8 Fix thinko in re-setting wal_log_hints flag from a parameter-change record.
The flag is supposed to be copied from the record. Same issue with
track_commit_timestamps, but that's master-only.

Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was
added.
2015-01-15 20:52:41 +02:00
Tom Lane 8e166e164c Rearrange explain.c's API so callers need not embed sizeof(ExplainState).
The folly of the previous arrangement was just demonstrated: there's no
convenient way to add fields to ExplainState without breaking ABI, even
if callers have no need to touch those fields.  Since we might well need
to do that again someday in back branches, let's change things so that
only explain.c has to have sizeof(ExplainState) compiled into it.  This
costs one extra palloc() per EXPLAIN operation, which is surely pretty
negligible.
2015-01-15 13:39:33 -05:00
Tom Lane a5cd70dcbc Improve performance of EXPLAIN with large range tables.
As of 9.3, ruleutils.c goes to some lengths to ensure that table and column
aliases used in its output are unique.  Of course this takes more time than
was required before, which in itself isn't fatal.  However, EXPLAIN was set
up so that recalculation of the unique aliases was repeated for each
subexpression printed in a plan.  That results in O(N^2) time and memory
consumption for large plan trees, which did not happen in older branches.

Fortunately, the expensive work is the same across a whole plan tree,
so there is no need to repeat it; we can do most of the initialization
just once per query and re-use it for each subexpression.  This buys
back most (not all) of the performance loss since 9.2.

We need an extra ExplainState field to hold the precalculated deparse
context.  That's no problem in HEAD, but in the back branches, expanding
sizeof(ExplainState) seems risky because third-party extensions might
have local variables of that struct type.  So, in 9.4 and 9.3, introduce
an auxiliary struct to keep sizeof(ExplainState) the same.  We should
refactor the APIs to avoid such local variables in future, but that's
material for a separate HEAD-only commit.

Per gripe from Alexey Bashtanov.  Back-patch to 9.3 where the issue
was introduced.
2015-01-15 13:18:12 -05:00