Commit Graph

13835 Commits

Author SHA1 Message Date
Tom Lane 74242c23c1 Clear retry flags properly in replacement OpenSSL sock_write function.
Current OpenSSL code includes a BIO_clear_retry_flags() step in the
sock_write() function.  Either we failed to copy the code correctly, or
they added this since we copied it.  In any case, lack of the clear step
appears to be the cause of the server lockup after connection loss reported
in bug #8647 from Valentine Gogichashvili.  Assume that this is correct
coding for all OpenSSL versions, and hence back-patch to all supported
branches.

Diagnosis and patch by Alexander Kukushkin.
2013-12-05 12:48:28 -05:00
Alvaro Herrera 07aeb1fec5 Avoid resetting Xmax when it's a multi with an aborted update
HeapTupleSatisfiesUpdate can very easily "forget" tuple locks while
checking the contents of a multixact and finding it contains an aborted
update, by setting the HEAP_XMAX_INVALID bit.  This would lead to
concurrent transactions not noticing any previous locks held by
transactions that might still be running, and thus being able to acquire
subsequent locks they wouldn't be normally able to acquire.

This bug was introduced in commit 1ce150b7bb; backpatch this fix to 9.3,
like that commit.

This change reverts the change to the delete-abort-savept isolation test
in 1ce150b7bb, because that behavior change was caused by this bug.

Noticed by Andres Freund while investigating a different issue reported
by Noah Misch.
2013-12-05 12:21:55 -03:00
Heikki Linnakangas 9e857436ef Don't include unused space in LOG_NEWPAGE records.
This is the same trick we use when taking a full page image of a buffer
passed to XLogInsert.
2013-12-04 00:10:47 +02:00
Heikki Linnakangas 22122c83f1 Fix full-page writes of internal GIN pages.
Insertion to a non-leaf GIN page didn't make a full-page image of the page,
which is wrong. The code used to do it correctly, but was changed (commit
853d1c3103) because the redo-routine didn't
track incomplete splits correctly when the page was restored from a full
page image. Of course, that was not right way to fix it, the redo routine
should've been fixed instead. The redo-routine was surreptitiously fixed
in 2010 (commit 4016bdef8a), so all we need
to do now is revert the code that creates the record to its original form.

This doesn't change the format of the WAL record.

Backpatch to all supported versions.
2013-12-03 23:16:01 +02:00
Peter Eisentraut fef88b3fda Report exit code from external recovery commands properly
When an external recovery command such as restore_command or
archive_cleanup_command fails, report the exit code properly,
distinguishing signals and normal exists, using the existing
wait_result_to_str() facility, instead of just reporting the return
value from system().

Reviewed-by: Peter Geoghegan <pg@heroku.com>
2013-12-02 22:31:05 -05:00
Tom Lane 7ab321404c Fix crash in assign_collations_walker for EXISTS with empty SELECT list.
We (I think I, actually) forgot about this corner case while coding
collation resolution.  Per bug #8648 from Arjen Nienhuis.
2013-12-02 20:28:45 -05:00
Robert Haas c6d4b1dd3e Flag mmap implemenation of dynamic shared memory as resize-capable.
Error noted by Heikki Linnakangas
2013-12-02 11:18:54 -05:00
Robert Haas a8656a3ab0 Make NUM_TOCHAR_prepare and NUM_TOCHAR_finish macros declare "len".
Remove the variable from the enclosing scopes so that nothing can be
relying on it.  The net result of this refactoring is that we get rid
of a few unnecessary strlen() calls.

Original patch from Greg Jaskiewicz, substantially expanded by me.
2013-12-02 10:51:06 -05:00
Robert Haas 9d140f7be2 Avoid out-of-bounds read in errfinish if error_stack_depth < 0.
If errordata_stack_depth < 0, we won't find that out and correct the
problem until CHECK_STACK_DEPTH() is invoked.  In the meantime,
elevel will be set based on an invalid read.  This is probably
harmless in practice, but it seems cleaner this way.

Xi Wang
2013-12-02 10:42:01 -05:00
Peter Eisentraut 3e3520cf7a Translation updates 2013-12-02 00:17:07 -05:00
Alvaro Herrera 2393c7d102 Fix a couple of bugs in MultiXactId freezing
Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid.  This means
that a very old Xid could survive hidden within a multi, possibly
outliving its CLOG storage.  In the distant future, this would cause
clog lookup failures:
ERROR:  could not access status of transaction 3883960912
DETAIL:  Could not open file "pg_clog/0E78": No such file or directory.

This mostly was problematic when the updating transaction aborted, since
in that case the row wouldn't get pruned away earlier in vacuum and the
multixact could possibly survive for a long time.  In many cases, data
that is inaccessible for this reason way can be brought back
heuristically.

As a second bug, heap_freeze_tuple() didn't properly handle multixacts
that need to be frozen according to cutoff_multi, but whose updater xid
is still alive.  Instead of preserving the update Xid, it just set Xmax
invalid, which leads to both old and new tuple versions becoming
visible.  This is pretty rare in practice, but a real threat
nonetheless.  Existing corrupted rows, unfortunately, cannot be repaired
in an automated fashion.

Existing physical replicas might have already incorrectly frozen tuples
because of different behavior than in master, which might only become
apparent in the future once pg_multixact/ is truncated; it is
recommended that all clones be rebuilt after upgrading.

Following code analysis caused by bug report by J Smith in message
CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
and privately by F-Secure.

Backpatch to 9.3, where freezing of MultiXactIds was introduced.

Analysis and patch by Andres Freund, with some tweaks by Álvaro.
2013-11-29 21:47:25 -03:00
Alvaro Herrera 1ce150b7bb Don't TransactionIdDidAbort in HeapTupleGetUpdateXid
It is dangerous to do so, because some code expects to be able to see what's
the true Xmax even if it is aborted (particularly while traversing HOT
chains).  So don't do it, and instead rely on the callers to verify for
abortedness, if necessary.

Several race conditions and bugs fixed in the process.  One isolation test
changes the expected output due to these.

This also reverts commit c235a6a589, which is no longer necessary.

Backpatch to 9.3, where this function was introduced.

Andres Freund
2013-11-29 21:47:21 -03:00
Alvaro Herrera 1df0122daa Truncate pg_multixact/'s contents during crash recovery
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash
recovery, because there was no guarantee that enough state had been
setup, and because it wasn't deemed to be a good idea to remove data
during crash recovery anyway.  Since then, due to Hot-Standby, streaming
replication and PITR, the amount of time a cluster can spend doing crash
recovery has increased significantly, to the point that a cluster may
even never come out of it.  This has made not truncating the content of
pg_multixact/ not defensible anymore.

To fix, take care to setup enough state for multixact truncation before
crash recovery starts (easy since checkpoints contain the required
information), and move the current end-of-recovery actions to a new
TrimMultiXact() function, analogous to TrimCLOG().

At some later point, this should probably done similarly to the way
clog.c is doing it, which is to just WAL log truncations, but we can't
do that for the back branches.

Back-patch to 9.0.  8.4 also has the problem, but since there's no hot
standby there, it's much less pressing.  In 9.2 and earlier, this patch
is simpler than in newer branches, because multixact access during
recovery isn't required.  Add appropriate checks to make sure that's not
happening.

Andres Freund
2013-11-29 21:47:15 -03:00
Alvaro Herrera f54106f77e Fix full-table-vacuum request mechanism for MultiXactIds
While autovacuum dutifully launched anti-multixact-wraparound vacuums
when the multixact "age" was reached, the vacuum code was not aware that
it needed to make them be full table vacuums.  As the resulting
partial-table vacuums aren't capable of actually increasing relminmxid,
autovacuum continued to launch anti-wraparound vacuums that didn't have
the intended effect, until age of relfrozenxid caused the vacuum to
finally be a full table one via vacuum_freeze_table_age.

To fix, introduce logic for multixacts similar to that for plain
TransactionIds, using the same GUCs.

Backpatch to 9.3, where permanent MultiXactIds were introduced.

Andres Freund, some cleanup by Álvaro
2013-11-29 21:47:13 -03:00
Alvaro Herrera 76a31c689c Replace hardcoded 200000000 with autovacuum_freeze_max_age
Parts of the code used autovacuum_freeze_max_age to determine whether
anti-multixact-wraparound vacuums are necessary, while others used a
hardcoded 200000000 value.  This leads to problems when
autovacuum_freeze_max_age is set to a non-default value.  Use the latter
everywhere.

Backpatch to 9.3, where vacuuming of multixacts was introduced.

Andres Freund
2013-11-29 21:47:09 -03:00
Tom Lane 8b151558c8 Be sure to release proc->backendLock after SetupLockInTable() failure.
The various places that transferred fast-path locks to the main lock table
neglected to release the PGPROC's backendLock if SetupLockInTable failed
due to being out of shared memory.  In most cases this is no big deal since
ensuing error cleanup would release all held LWLocks anyway.  But there are
some hot-standby functions that don't consider failure of
FastPathTransferRelationLocks to be a hard error, and in those cases this
oversight could lead to system lockup.  For consistency, make all of these
places look the same as FastPathTransferRelationLocks.

Noted while looking for the cause of Dan Wood's bugs --- this wasn't it,
but it's a bug anyway.
2013-11-29 17:35:09 -05:00
Tom Lane 16e1b7a1b7 Fix assorted race conditions in the new timeout infrastructure.
Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions).  That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.

We could still lose control if an asynchronous SIGINT arrives just as the
function is entered.  This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction.  We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.

Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return.  When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely.  Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.

These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored.  So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues.  (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)

Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.

Per reports from Dan Wood.

Back-patch to 9.3 where the new timeout handling infrastructure was
introduced.  We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
2013-11-29 16:41:00 -05:00
Robert Haas 8e18d04d4d Refine our definition of what constitutes a system relation.
Although user-defined relations can't be directly created in
pg_catalog, it's possible for them to end up there, because you can
create them in some other schema and then use ALTER TABLE .. SET SCHEMA
to move them there.  Previously, such relations couldn't afterwards
be manipulated, because IsSystemRelation()/IsSystemClass() rejected
all attempts to modify objects in the pg_catalog schema, regardless
of their origin.  With this patch, they now reject only those
objects in pg_catalog which were created at initdb-time, allowing
most operations on user-created tables in pg_catalog to proceed
normally.

This patch also adds new functions IsCatalogRelation() and
IsCatalogClass(), which is similar to IsSystemRelation() and
IsSystemClass() but with a slightly narrower definition: only TOAST
tables of system catalogs are included, rather than *all* TOAST tables.
This is currently used only for making decisions about when
invalidation messages need to be sent, but upcoming logical decoding
patches will find other uses for this information.

Andres Freund, with some modifications by me.
2013-11-28 20:57:20 -05:00
Heikki Linnakangas 2fe69cacff Another gin_desc fix.
The number of items inserted was incorrectly printed as if it was a boolean.
2013-11-28 23:35:50 +02:00
Heikki Linnakangas 97c19e6c38 Fix gin_desc routine to match the WAL format.
In the GIN incomplete-splits patch, I used BlockIdDatas to store the block
number of left and right children, when inserting a downlink after a split
to an internal page posting list page. But gin_desc thought they were stored
as BlockNumbers.
2013-11-28 21:57:42 +02:00
Tom Lane da8a716089 Fix latent(?) race condition in LockReleaseAll.
We have for a long time checked the head pointer of each of the backend's
proclock lists and skipped acquiring the corresponding locktable partition
lock if the head pointer was NULL.  This was safe enough in the days when
proclock lists were changed only by the owning backend, but it is pretty
questionable now that the fast-path patch added cases where backends add
entries to other backends' proclock lists.  However, we don't really wish
to revert to locking each partition lock every time, because in simple
transactions that would add a lot of useless lock/unlock cycles on
already-heavily-contended LWLocks.  Fortunately, the only way that another
backend could be modifying our proclock list at this point would be if it
was promoting a formerly fast-path lock of ours; and any such lock must be
one that we'd decided not to delete in the previous loop over the locallock
table.  So it's okay if we miss seeing it in this loop; we'd just decide
not to delete it again.  However, once we've detected a non-empty list,
we'd better re-fetch the list head pointer after acquiring the partition
lock.  This guards against possibly fetching a corrupt-but-non-null pointer
if pointer fetch/store isn't atomic.  It's not clear if any practical
architectures are like that, but we've never assumed that before and don't
wish to start here.  In any case, the situation certainly deserves a code
comment.

While at it, refactor the partition traversal loop to use a for() construct
instead of a while() loop with goto's.

Back-patch, just in case the risk is real and not hypothetical.
2013-11-28 12:17:46 -05:00
Alvaro Herrera d51a8c52ba Unbreak buildfarm
I removed an intermediate commit before pushing and forgot to test the
resulting tree :-(
2013-11-28 12:59:45 -03:00
Alvaro Herrera 247c76a989 Use a more granular approach to follow update chains
Instead of simply checking the KEYS_UPDATED bit, we need to check
whether each lock held on the future version of the tuple conflicts with
the lock we're trying to acquire.

Per bug report #8434 by Tomonari Katsumata
2013-11-28 12:00:12 -03:00
Alvaro Herrera e4828e9ccb Compare Xmin to previous Xmax when locking an update chain
Not doing so causes us to traverse an update chain that has been broken
by concurrent page pruning.  All other code that traverses update chains
uses this check as one of the cases in which to stop iterating, so
replicate it here too.  Failure to do so leads to erroneous CLOG,
subtrans or multixact lookups.

Per discussion following the bug report by J Smith in
CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
as diagnosed by Andres Freund.
2013-11-28 12:00:12 -03:00
Alvaro Herrera c235a6a589 Don't try to set InvalidXid as page pruning hint
If a transaction updates/deletes a tuple just before aborting, and a
concurrent transaction tries to prune the page concurrently, the pruner
may see HeapTupleSatisfiesVacuum return HEAPTUPLE_DELETE_IN_PROGRESS,
but a later call to HeapTupleGetUpdateXid() return InvalidXid.  This
would cause an assertion failure in development builds, but would be
otherwise Mostly Harmless.

Fix by checking whether the updater Xid is valid before trying to apply
it as page prune point.

Reported by Andres in 20131124000203.GA4403@alap2.anarazel.de
2013-11-28 12:00:12 -03:00
Alvaro Herrera e518fa7adf Cope with heap_fetch failure while locking an update chain
The reason for the fetch failure is that the tuple was removed because
it was dead; so the failure is innocuous and can be ignored.  Moreover,
there's no need for further work and we can return success to the caller
immediately.  EvalPlanQualFetch is doing something very similar to this
already.

Report and test case from Andres Freund in
20131124000203.GA4403@alap2.anarazel.de
2013-11-28 12:00:12 -03:00
Tom Lane 7db285afc9 Fix stale-pointer problem in fast-path locking logic.
When acquiring a lock in fast-path mode, we must reset the locallock
object's lock and proclock fields to NULL.  They are not necessarily that
way to start with, because the locallock could be left over from a failed
lock acquisition attempt earlier in the transaction.  Failure to do this
led to all sorts of interesting misbehaviors when LockRelease tried to
clean up no-longer-related lock and proclock objects in shared memory.
Per report from Dan Wood.

In passing, modify LockRelease to elog not just Assert if it doesn't find
lock and proclock objects for a formerly fast-path lock, matching the code
in FastPathGetRelationLockEntry and LockRefindAndRelease.  This isn't a
bug but it will help in diagnosing any future bugs in this area.

Also, modify FastPathTransferRelationLocks and FastPathGetRelationLockEntry
to break out of their loops over the fastpath array once they've found the
sole matching entry.  This was inconsistently done in some search loops
and not others.

Improve assorted related comments, too.

Back-patch to 9.2 where the fast-path mechanism was introduced.
2013-11-27 18:10:00 -05:00
Tom Lane 8c84803e14 Minor corrections in lmgr/README.
Correct an obsolete statement that no backend touches another backend's
PROCLOCK lists.  This was probably wrong even when written (the deadlock
checker looks at everybody's lists), and it's certainly quite wrong now
that fast-path locking can require creation of lock and proclock objects
on behalf of another backend.  Also improve some statements in the hot
standby explanation, and do one or two other trivial bits of wordsmithing/
reformatting.
2013-11-27 15:07:13 -05:00
Heikki Linnakangas 631118fe1e Get rid of the post-recovery cleanup step of GIN page splits.
Replace it with an approach similar to what GiST uses: when a page is split,
the left sibling is marked with a flag indicating that the parent hasn't been
updated yet. When the parent is updated, the flag is cleared. If an insertion
steps on a page with the flag set, it will finish split before proceeding
with the insertion.

The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.

This also divides the responsibility of WAL-logging more clearly between
the generic ginbtree.c code, and the parts specific to entry and posting
trees. There is now a common WAL record format for insertions and deletions,
which is written by ginbtree.c, followed by tree-specific payload, which is
returned by the placetopage- and split- callbacks.
2013-11-27 19:21:23 +02:00
Heikki Linnakangas ce5326eed3 More GIN refactoring.
Separate the insertion payload from the more static portions of GinBtree.
GinBtree now only contains information related to searching the tree, and
the information of what to insert is passed separately.

Add root block number to GinBtree, instead of passing it around all the
functions as argument.

Split off ginFinishSplit() from ginInsertValue(). ginFinishSplit is
responsible for finding the parent and inserting the downlink to it.
2013-11-27 15:43:05 +02:00
Heikki Linnakangas 82b43f7df2 Don't update relfrozenxid if any pages were skipped.
Vacuum recognizes that it can update relfrozenxid by checking whether it has
processed all pages of a relation. Unfortunately it performed that check
after truncating the dead pages at the end of the relation, and used the new
number of pages to decide whether all pages have been scanned. If the new
number of pages happened to be smaller or equal to the number of pages
scanned, it incorrectly decided that all pages were scanned.

This can lead to relfrozenxid being updated, even though some pages were
skipped that still contain old XIDs. That can lead to data loss due to xid
wraparounds with some rows suddenly missing. This likely has escaped notice
so far because it takes a large number (~2^31) of xids being used to see the
effect, while a full-table vacuum before that would fix the issue.

The incorrect logic was introduced by commit
b4b6923e03. Backpatch this fix down to 8.4,
like that commit.

Andres Freund, with some modifications by me.
2013-11-27 13:43:27 +02:00
Peter Eisentraut 85ed91ee7d Implement information_schema.parameters.parameter_default column
Reviewed-by: Ali Dar <ali.munir.dar@gmail.com>
Reviewed-by: Amit Khandekar <amit.khandekar@enterprisedb.com>
Reviewed-by: Rodolfo Campero <rodolfo.campero@anachronics.com>
2013-11-26 23:21:35 -05:00
Jeff Davis 7cc0ba9f17 Add missing entry for session_preload_libraries in sample config.
The omission was apparently an oversight in the original patch.
2013-11-25 21:03:07 -08:00
Bruce Momjian a6542a4b68 Change SET LOCAL/CONSTRAINTS/TRANSACTION and ABORT behavior
Change SET LOCAL/CONSTRAINTS/TRANSACTION behavior outside of a
transaction block from error (post-9.3) to warning.  (Was nothing in <=
9.3.)  Also change ABORT outside of a transaction block from notice to
warning.
2013-11-25 19:19:40 -05:00
Jeff Davis 559d535819 Lessen library-loading log level.
Previously, messages were emitted at the LOG level every time a
backend preloaded a library. That was acceptable (though unnecessary)
for shared_preload_libraries; but it was excessive for
local_preload_libraries and session_preload_libraries. Reduce to
DEBUG1.

Also, there was logic in the EXEC_BACKEND case to avoid repeated
messages for shared_preload_libraries by demoting them to
DEBUG2. DEBUG1 seems more appropriate there, as well, so eliminate
that special case.

Peter Geoghegan.
2013-11-24 10:50:54 -08:00
Tom Lane 36a3be6540 Fix new and latent bugs with errno handling in secure_read/secure_write.
These functions must be careful that they return the intended value of
errno to their callers.  There were several scenarios where this might
not happen:

1. The recent SSL renegotiation patch added a hunk of code that would
execute after setting errno.  In the first place, it's doubtful that we
should consider renegotiation to be successfully completed after a failure,
and in the second, there's no real guarantee that the called OpenSSL
routines wouldn't clobber errno.  Fix by not executing that hunk except
during success exit.

2. errno was left in an unknown state in case of an unrecognized return
code from SSL_get_error().  While this is a "can't happen" case, it seems
like a good idea to be sure we know what would happen, so reset errno to
ECONNRESET in such cases.  (The corresponding code in libpq's fe-secure.c
already did this.)

3. There was an (undocumented) assumption that client_read_ended() wouldn't
change errno.  While true in the current state of the code, this seems less
than future-proof.  Add explicit saving/restoring of errno to make sure
that changes in the called functions won't break things.

I see no need to back-patch, since #1 is new code and the other two issues
are mostly hypothetical.

Per discussion with Amit Kapila.
2013-11-24 13:09:38 -05:00
Tom Lane 45e02e3232 Fix array slicing of int2vector and oidvector values.
The previous coding labeled expressions such as pg_index.indkey[1:3] as
being of int2vector type; which is not right because the subscript bounds
of such a result don't, in general, satisfy the restrictions of int2vector.
To fix, implicitly promote the result of slicing int2vector to int2[],
or oidvector to oid[].  This is similar to what we've done with domains
over arrays, which is a good analogy because these types are very much
like restricted domains of the corresponding regular-array types.

A side-effect is that we now also forbid array-element updates on such
columns, eg while "update pg_index set indkey[4] = 42" would have worked
before if you were superuser (and corrupted your catalogs irretrievably,
no doubt) it's now disallowed.  This seems like a good thing since, again,
some choices of subscripting would've led to results not satisfying the
restrictions of int2vector.  The case of an array-slice update was
rejected before, though with a different error message than you get now.
We could make these cases work in future if we added a cast from int2[]
to int2vector (with a cast function checking the subscript restrictions)
but it seems unlikely that there's any value in that.

Per report from Ronan Dunklau.  Back-patch to all supported branches
because of the crash risks involved.
2013-11-23 20:03:56 -05:00
Peter Eisentraut b7212c9726 Fix thinko in SPI_execute_plan() calls
Two call sites were apparently thinking that the last argument of
SPI_execute_plan() is the number of query parameters, but it is actually
the row limit.  Change the calls to 0, since we don't care about the
limit there.  The previous code didn't break anything, but it was still
wrong.
2013-11-23 09:34:57 -05:00
Peter Eisentraut 4053189d59 Avoid potential buffer overflow crash
A pointer to a C string was treated as a pointer to a "name" datum and
passed to SPI_execute_plan().  This pointer would then end up being
passed through datumCopy(), which would try to copy the entire 64 bytes
of name data, thus running past the end of the C string.  Fix by
converting the string to a proper name structure.

Found by LLVM AddressSanitizer.
2013-11-23 07:25:37 -05:00
Tom Lane f19e92ed04 Flatten join alias Vars before pulling up targetlist items from a subquery.
pullup_replace_vars()'s decisions about whether a pulled-up replacement
expression needs to be wrapped in a PlaceHolderVar depend on the assumption
that what looks like a Var behaves like a Var.  However, if the Var is a
join alias reference, later flattening of join aliases might replace the
Var with something that's not a Var at all, and should have been wrapped.

To fix, do a forcible pass of flatten_join_alias_vars() on the subquery
targetlist before we start to copy items out of it.  We'll re-run that
processing on the pulled-up expressions later, but that's harmless.

Per report from Ken Tanzer; the added regression test case is based on his
example.  This bug has been there since the PlaceHolderVar mechanism was
invented, but has escaped detection because the circumstances that trigger
it are fairly narrow.  You need a flattenable query underneath an outer
join, which contains another flattenable query inside a join of its own,
with a dangerous expression (a constant or something else non-strict)
in that one's targetlist.

Having seen this, I'm wondering if it wouldn't be prudent to do all
alias-variable flattening earlier, perhaps even in the rewriter.
But that would probably not be a back-patchable change.
2013-11-22 14:37:21 -05:00
Heikki Linnakangas 98f58a30c1 Fix Hot-Standby initialization of clog and subtrans.
These bugs can cause data loss on standbys started with hot_standby=on at
the moment they start to accept read only queries, by marking committed
transactions as uncommited. The likelihood of such corruptions is small
unless the primary has a high transaction rate.

5a031a5556 fixed bugs in HS's startup logic
by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state
was reached, missing the fact that both clog and subtrans are written to
before that. This only failed to fail in common cases because the usage
of ExtendCLOG in procarray.c was superflous since clog extensions are
actually WAL logged.

f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing
extensions of pg_subtrans due to the former commit's changes - which are
not WAL logged - by performing the extensions when switching to a state
> STANDBY_INITIALIZED and not performing xid assignments before that -
again missing the fact that ExtendCLOG is unneccessary - but screwed up
twice: Once because latestObservedXid wasn't updated anymore in that
state due to the earlier commit and once by having an off-by-one error in
the loop performing extensions. This means that whenever a
CLOG_XACTS_PER_PAGE (32768 with default settings) boundary was crossed
between the start of the checkpoint recovery started from and the first
xl_running_xact record old transactions commit bits in pg_clog could be
overwritten if they started and committed in that window.

Fix this mess by not performing ExtendCLOG() in HS at all anymore since
it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.

Analysis and patch by Andres Freund. Reported by Christophe Pettus.
Backpatch down to 9.0, like the previous commit that caused this.
2013-11-22 14:45:41 +02:00
Heikki Linnakangas 1a3d104475 Avoid acquiring spinlock when checking if recovery has finished, for speed.
RecoveryIsInProgress() can be called very frequently. During normal
operation, it just checks a backend-local variable and returns quickly,
but during hot standby, it checks a spinlock-protected shared variable.
Those spinlock acquisitions can become a point of contention on a busy
hot standby system.

Replace the spinlock acquisition with a memory barrier.

Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.
2013-11-22 13:07:23 +02:00
Tom Lane 784e762e88 Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry.  The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others.  This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.

This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.

Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).

The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does.  There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST().  After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.

Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-21 19:37:20 -05:00
Heikki Linnakangas 04eee1fa9e More GIN refactoring.
Split off the portion of ginInsertValue that inserts the tuple to current
level into a separate function, ginPlaceToPage. ginInsertValue's charter
is now to recurse up the tree to insert the downlink, when a page split is
required.

This is in preparation for a patch to change the way incomplete splits are
handled, which will need to do these operations separately. And IMHO makes
the code more readable anyway.
2013-11-20 17:01:33 +02:00
Heikki Linnakangas 501012631e Refactor the internal GIN B-tree interface for forming a downlink.
This creates a new gin-btree callback function for creating a downlink for
a page. Previously, ginxlog.c duplicated the logic used during normal
operation.
2013-11-20 16:57:41 +02:00
Heikki Linnakangas 04965ad40e Further GIN refactoring.
Merge some functions that were always called together. Makes the code
little bit more readable.
2013-11-20 16:09:14 +02:00
Robert Haas f1df4731ee Use cstring_to_text_with_len when length is known.
This avoids a potentially-expensive extra call to strlen().

David Rowley
2013-11-18 10:19:00 -05:00
Heikki Linnakangas 4c697d8f48 Count locked pages that don't need vacuuming as scanned.
Previously, if VACUUM skipped vacuuming a page because it's pinned, it
didn't count that page as scanned. However, that meant that relfrozenxid
was not bumped up either, which prevented anti-wraparound vacuum from
doing its job.

Report by Миша Тюрин, analysis and patch by Sergey Burladyn and Jeff Janes.
Backpatch to 9.2, where the skip-locked-pages behavior was introduced.
2013-11-18 09:51:09 +02:00
Tom Lane f901bb50e3 Add make_date() and make_time() functions.
Pavel Stehule, reviewed by Jeevan Chalke and Atri Sharma
2013-11-17 15:06:50 -05:00
Tom Lane 69c8fbac20 Improve performance of numeric sum(), avg(), stddev(), variance(), etc.
This patch improves performance of most built-in aggregates that formerly
used a NUMERIC or NUMERIC array as their transition type; this includes
not only aggregates on numeric inputs, but some aggregates on integer
inputs where overflow of an int8 value is a possibility.  The code now
uses a special-purpose data structure to avoid array construction and
deconstruction overhead, as well as packing and unpacking overhead for
numeric values.

These aggregates' transition type is now declared as INTERNAL, since
it doesn't correspond to any SQL data type.  To keep the planner from
thinking that that means a lot of storage will be used, we make use
of the just-added pg_aggregate.aggtransspace feature.  The space estimate
is set to 128 bytes, which is at least in the right ballpark.

Hadi Moshayedi, reviewed by Pavel Stehule and Tomas Vondra
2013-11-16 18:46:34 -05:00
Tom Lane 6cb86143e8 Allow aggregates to provide estimates of their transition state data size.
Formerly the planner had a hard-wired rule of thumb for guessing the amount
of space consumed by an aggregate function's transition state data.  This
estimate is critical to deciding whether it's OK to use hash aggregation,
and in many situations the built-in estimate isn't very good.  This patch
adds a column to pg_aggregate wherein a per-aggregate estimate can be
provided, overriding the planner's default, and infrastructure for setting
the column via CREATE AGGREGATE.

It may be that additional smarts will be required in future, perhaps even
a per-aggregate estimation function.  But this is already a step forward.

This is extracted from a larger patch to improve the performance of numeric
and int8 aggregates.  I (tgl) thought it was worth reviewing and committing
this infrastructure separately.  In this commit, all built-in aggregates
are given aggtransspace = 0, so no behavior should change.

Hadi Moshayedi, reviewed by Pavel Stehule and Tomas Vondra
2013-11-16 16:03:40 -05:00
Tom Lane f1f21b2d6f Fix incorrect loop counts in tidbitmap.c.
A couple of places that should have been iterating over WORDS_PER_CHUNK
words were iterating over WORDS_PER_PAGE words instead.  This thinko
accidentally failed to fail, because (at least on common architectures
with default BLCKSZ) WORDS_PER_CHUNK is a bit less than WORDS_PER_PAGE,
and the extra words being looked at were always zero so nothing happened.
Still, it's a bug waiting to happen if anybody ever fools with the
parameters affecting TIDBitmap sizes, and it's a small waste of cycles
too.  So back-patch to all active branches.

Etsuro Fujita
2013-11-15 18:34:14 -05:00
Tom Lane f3b3b8d5be Compute correct em_nullable_relids in get_eclass_for_sort_expr().
Bug #8591 from Claudio Freire demonstrates that get_eclass_for_sort_expr
must be able to compute valid em_nullable_relids for any new equivalence
class members it creates.  I'd worried about this in the commit message
for db9f0e1d9a, but claimed that it wasn't a
problem because multi-member ECs should already exist when it runs.  That
is transparently wrong, though, because this function is also called by
initialize_mergeclause_eclasses, which runs during deconstruct_jointree.
The example given in the bug report (which the new regression test item
is based upon) fails because the COALESCE() expression is first seen by
initialize_mergeclause_eclasses rather than process_equivalence.

Fixing this requires passing the appropriate nullable_relids set to
get_eclass_for_sort_expr, and it requires new code to compute that set
for top-level expressions such as ORDER BY, GROUP BY, etc.  We store
the top-level nullable_relids in a new field in PlannerInfo to avoid
computing it many times.  In the back branches, I've added the new
field at the end of the struct to minimize ABI breakage for planner
plugins.  There doesn't seem to be a good alternative to changing
get_eclass_for_sort_expr's API signature, though.  There probably aren't
any third-party extensions calling that function directly; moreover,
if there are, they probably need to think about what to pass for
nullable_relids anyway.

Back-patch to 9.2, like the previous patch in this area.
2013-11-15 16:46:18 -05:00
Tom Lane 80e3a470ba Minor comment corrections for sequence hashtable patch.
There were enough typos in the comments to annoy me ...
2013-11-15 12:17:12 -05:00
Heikki Linnakangas 5cb719beee Fix bogus hash table creation.
Andres Freund
2013-11-15 14:23:40 +02:00
Heikki Linnakangas 21025d4a53 Use a hash table to store current sequence values.
This speeds up nextval() and currval(), when you touch a lot of different
sequences in the same backend.

David Rowley
2013-11-15 12:29:38 +02:00
Robert Haas c46c803f8a Fix relfilenodemap.c's handling of cache invalidations.
The old code entered a new hash table entry first, then scanned
pg_class to determine what value to fill in, and then populated the
entry.  This fails to work properly if a cache invalidation happens
as a result of opening pg_class.  Repair.

Along the way, get rid of the idea of blowing away the entire hash
table as a method of processing invalidations.  Instead, just delete
all the entries one by one.  This is probably not quite as cheap but
it's simpler, and shouldn't happen often.

Andres Freund
2013-11-13 10:52:59 -05:00
Heikki Linnakangas 07fca603b5 Fix bug in GIN posting tree root creation.
The root page is filled with as many items as fit, and the rest are inserted
using normal insertions. However, I fumbled the variable names, and the code
actually memcpy'd all the items on the page, overflowing the buffer. While
at it, rename the variable to make the distinction more clear.

Reported by Teodor Sigaev. This bug was introduced by my recent
refactorings, so no backpatching required.
2013-11-13 13:47:59 +02:00
Peter Eisentraut aa04b323c3 Move variable closer to where it is used
This avoids an unused variable warning on Windows when building without
asserts

From: David Rowley <dgrowleyml@gmail.com>
2013-11-13 06:26:27 -05:00
Tom Lane ebefbb5fde Fix failure with whole-row reference to a subquery.
Simple oversight in commit 1cb108efb0 ---
recursively examining a subquery output column is only sane if the
original Var refers to a single output column.  Found by Kevin Grittner.
2013-11-11 16:36:27 -05:00
Tom Lane 0b7e660d6c Fix ruleutils pretty-printing to not generate trailing whitespace.
The pretty-printing logic in ruleutils.c operates by inserting a newline
and some indentation whitespace into strings that are already valid SQL.
This naturally results in leaving some trailing whitespace before the
newline in many cases; which can be annoying when processing the output
with other tools, as complained of by Joe Abbate.  We can fix that in
a pretty localized fashion by deleting any trailing whitespace before
we append a pretty-printing newline.  In addition, we have to modify the
code inserted by commit 2f582f76b1 so that
we also delete trailing whitespace when transposing items from temporary
buffers into the main result string, when a temporary item starts with a
newline.

This results in rather voluminous changes to the regression test results,
but it's easily verified that they are only removal of trailing whitespace.

Back-patch to 9.3, because the aforementioned commit resulted in many
more cases of trailing whitespace than had occurred in earlier branches.
2013-11-11 13:36:38 -05:00
Tom Lane 648bd05b13 Re-allow duplicate aliases within aliased JOINs.
Although the SQL spec forbids duplicate table aliases, historically
we've allowed queries like
    SELECT ... FROM tab1 x CROSS JOIN (tab2 x CROSS JOIN tab3 y) z
on the grounds that the aliased join (z) hides the aliases within it,
therefore there is no conflict between the two RTEs named "x".  The
LATERAL patch broke this, on the misguided basis that "x" could be
ambiguous if tab3 were a LATERAL subquery.  To avoid breaking existing
queries, it's better to allow this situation and complain only if
tab3 actually does contain an ambiguous reference.  We need only remove
the check that was throwing an error, because the column lookup code
is already prepared to handle ambiguous references.  Per bug #8444.
2013-11-11 10:42:57 -05:00
Peter Eisentraut 001e114b8d Fix whitespace issues found by git diff --check, add gitattributes
Set per file type attributes in .gitattributes to fine-tune whitespace
checks.  With the associated cleanups, the tree is now clean for git
2013-11-10 14:48:29 -05:00
Heikki Linnakangas ac4ab97ec0 Fix race condition in GIN posting tree page deletion.
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.

To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.

Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.

Backpatch to all supported versions.
2013-11-08 22:21:42 +02:00
Robert Haas 07cacba983 Add the notion of REPLICA IDENTITY for a table.
Pending patches for logical replication will use this to determine
which columns of a tuple ought to be considered as its candidate key.

Andres Freund, with minor, mostly cosmetic adjustments by me
2013-11-08 12:30:43 -05:00
Tom Lane b97ee66cc1 Make contain_volatile_functions/contain_mutable_functions look into SubLinks.
This change prevents us from doing inappropriate subquery flattening in
cases such as dangerous functions hidden inside a sub-SELECT in the
targetlist of another sub-SELECT.  That could result in unexpected behavior
due to multiple evaluations of a volatile function, as in a recent
complaint from Etienne Dube.  It's been questionable from the very
beginning whether these functions should look into subqueries (as noted in
their comments), and this case seems to provide proof that they should.

Because the new code only descends into SubLinks, not SubPlans or
InitPlans, the change only affects the planner's behavior during
prepjointree processing and not later on --- for example, you can still get
it to use a volatile function in an indexqual if you wrap the function in
(SELECT ...).  That's a historical behavior, for sure, but it's reasonable
given that the executor's evaluation rules for subplans don't depend on
whether there are volatile functions inside them.  In any case, we need to
constrain the behavioral change as narrowly as we can to make this
reasonable to back-patch.
2013-11-08 11:36:57 -05:00
Tom Lane 060b22a99a Fix subtly-wrong volatility checking in BeginCopyFrom().
contain_volatile_functions() is best applied to the output of
expression_planner(), not its input, so that insertion of function
default arguments and constant-folding have been done.  (See comments
at CheckMutability, for instance.)  It's perhaps unlikely that anyone
will notice a difference in practice, but still we should do it properly.

In passing, change variable type from Node* to Expr* to reduce the net
number of casts needed.

Noted while perusing uses of contain_volatile_functions().
2013-11-08 08:59:39 -05:00
Tom Lane 20803d7881 Make LOCK_PRINT & PROCLOCK_PRINT expand to ((void) 0) when not in use.
This avoids warnings from more-anal-than-average compilers, and might
prevent hidden syntax problems in the future.

Andres Freund
2013-11-07 19:07:48 -05:00
Kevin Grittner b64b5ccb6a Silence benign warnings from clang version 3.0-6ubuntu3. 2013-11-07 16:35:43 -06:00
Tom Lane c28b289bf3 Prevent display of dropped columns in row constraint violation messages.
ExecBuildSlotValueDescription() printed "null" for each dropped column in
a row being complained of by ExecConstraints().  This has some sanity in
terms of the underlying implementation, but is of course pretty surprising
to users.  To fix, we must pass the target relation's descriptor to
ExecBuildSlotValueDescription(), because the slot descriptor it had been
using doesn't get labeled with attisdropped markers.

Per bug #8408 from Maxim Boguk.  Back-patch to 9.2 where the feature of
printing row values in NOT NULL and CHECK constraint violation messages
was introduced.

Michael Paquier and Tom Lane
2013-11-07 14:41:36 -05:00
Tom Lane 5e900bc00f Fix generation of MergeAppend plans for optimized min/max on expressions.
Before jamming a desired targetlist into a plan node, one really ought to
make sure the plan node can handle projections, and insert a buffering
Result plan node if not.  planagg.c forgot to do this, which is a hangover
from the days when it only dealt with IndexScan plan types.  MergeAppend
doesn't project though, not to mention that it gets unhappy if you remove
its possibly-resjunk sort columns.  The code accidentally failed to fail
for cases in which the min/max argument was a simple Var, because the new
targetlist would be equivalent to the original "flat" tlist anyway.
For any more complex case, it's been broken since 9.1 where we introduced
the ability to optimize min/max using MergeAppend, as reported by Raphael
Bauduin.  Fix by duplicating the logic from grouping_planner that decides
whether we need a Result node.

In 9.2 and 9.1, this requires back-porting the tlist_same_exprs() function
introduced in commit 4387cf956b, else we'd
uselessly add a Result node in cases that worked before.  It's rather
tempting to back-patch that whole commit so that we can avoid extra Result
nodes in mainline cases too; but I'll refrain, since that code hasn't
really seen all that much field testing yet.
2013-11-07 13:14:14 -05:00
Heikki Linnakangas fde7172d93 Fix setting of right bound at GIN page split.
Broken by my refactoring.
2013-11-07 19:45:07 +02:00
Tom Lane 8dace66e07 Add #ifdef guards for some POSIX error symbols that Windows doesn't like.
Per buildfarm results.  It looks like the older the Windows version, the
more errno codes it hasn't got ...
2013-11-06 20:22:42 -05:00
Tom Lane 8e68816cc2 Be more robust when strerror() doesn't give a useful result.
glibc, at least, is capable of returning "???" instead of anything useful
if it doesn't like the setting of LC_CTYPE.  If this happens, or in the
previously-known case of strerror() returning an empty string, try to
print the C macro name for the error code ("EACCES" etc).  Only if we
don't have the error code in our compiled-in list of popular error codes
(which covers most though not quite all of what's called out in the POSIX
spec) will we fall back to printing a numeric error code.  This should
simplify debugging.

Note that this functionality is currently only provided for %m in backend
ereport/elog messages.  That may be sufficient, since we don't fool with the
locale environment in frontend clients, but it's foreseeable that we might
want similar code in libpq for instance.

There was some talk of back-patching this, but let's see how the buildfarm
likes it first.  It seems likely that at least some of the POSIX-defined
error code symbols don't exist on all platforms.  I don't want to clutter
the entire list with #ifdefs, but we may need more than are here now.

MauMau, edited by me
2013-11-06 15:50:17 -05:00
Tom Lane bb45c64041 Support default arguments and named-argument notation for window functions.
These things didn't work because the planner omitted to do the necessary
preprocessing of a WindowFunc's argument list.  Add the few dozen lines
of code needed to handle that.

Although this sounds like a feature addition, it's really a bug fix because
the default-argument case was likely to crash previously, due to lack of
checking of the number of supplied arguments in the built-in window
functions.  It's not a security issue because there's no way for a
non-superuser to create a window function definition with defaults that
refers to a built-in C function, but nonetheless people might be annoyed
that it crashes rather than producing a useful error message.  So
back-patch as far as the patch applies easily, which turns out to be 9.2.
I'll put a band-aid in earlier versions as a separate patch.

(Note that these features still don't work for aggregates, and fixing that
case will be harder since we represent aggregate arg lists as target lists
not bare expression lists.  There's no crash risk though because CREATE
AGGREGATE doesn't accept defaults, and we reject named-argument notation
when parsing an aggregate call.)
2013-11-06 13:33:09 -05:00
Kevin Grittner 5829082a57 Keep heap open until new heap generated in RMV.
Early close became apparent when invalidation messages were
processed in a new location under CLOBBER_CACHE_ALWAYS builds, due
to additional locking.

Back-patch to 9.3
2013-11-06 12:27:52 -06:00
Heikki Linnakangas 0ea53256a8 Fix missing argument and function prototypes.
Not sure how I missed these in previous commit.
2013-11-06 11:22:58 +02:00
Heikki Linnakangas ecaa4708e5 Misc GIN refactoring.
Merge the isEnoughSpace and placeToPage functions in the b-tree interface
into one function that tries to put a tuple on page, and returns false if
it doesn't fit.

Move createPostingTree function to gindatapage.c, and change its contract
so that it can be passed more items than fit on the root page. It's in a
better position than the callers to know how many items fit.

Move ginMergeItemPointers out of gindatapage.c, into a separate file.

These changes make no difference now, but reduce the footprint of Alexander
Korotkov's upcoming patch to pack item pointers more tightly.
2013-11-06 10:32:09 +02:00
Tom Lane 920c8261d5 Improve the error message given for modifying a window with frame clause.
For rather inscrutable reasons, SQL:2008 disallows copying-and-modifying a
window definition that has any explicit framing clause.  The error message
we gave for this only made sense if the referencing window definition
itself contains an explicit framing clause, which it might well not.
Moreover, in the context of an OVER clause it's not exactly obvious that
"OVER (windowname)" implies copy-and-modify while "OVER windowname" does
not.  This has led to multiple complaints, eg bug #5199 from Iliya
Krapchatov.  Change to a hopefully more intelligible error message, and
in the case where we have just "OVER (windowname)", add a HINT suggesting
that omitting the parentheses will fix it.  Also improve the related
documentation.  Back-patch to all supported branches.
2013-11-05 21:58:08 -05:00
Kevin Grittner 2636ecf78b Lock relation used to generate fresh data for RMV.
The relation should not be accessible to any other process, but it
should be locked for consistency.  Since this is not known to
cause any bug, it will not be back-patch, at least for now.

Per report from Andres Freund
2013-11-05 15:36:33 -06:00
Tom Lane 6331de1d44 Fix some obsolete information in src/backend/optimizer/README.
Constant quals aren't handled the same way they used to be.  Also,
add mention of a couple more major steps in grouping_planner.
Per complaint a couple months back from Etsuro Fujita.
2013-11-05 11:31:35 -05:00
Kevin Grittner 732758db4c Fix breakage of MV column name list usage.
Per bug report from Tomonari Katsumata.

Back-patch to 9.3.
2013-11-04 14:31:07 -06:00
Robert Haas dddc34408a Fix format code used to print dsm request sizes.
Per report from Peter Eisentraut.
2013-11-04 11:22:03 -05:00
Tom Lane e36ce0c7f7 Get rid of more cases of the "must detoast before output function" meme.
I missed that json.c was doing this too, because for some bizarre reason
it wasn't doing it adjacent to the output function call.
2013-11-03 11:55:37 -05:00
Tom Lane b006f4ddb9 Prevent memory leaks from accumulating across printtup() calls.
Historically, printtup() has assumed that it could prevent memory leakage
by pfree'ing the string result of each output function and manually
managing detoasting of toasted values.  This amounts to assuming that
datatype output functions never leak any memory internally; an assumption
we've already decided to be bogus elsewhere, for example in COPY OUT.
range_out in particular is known to leak multiple kilobytes per call, as
noted in bug #8573 from Godfried Vanluffelen.  While we could go in and fix
that leak, it wouldn't be very notationally convenient, and in any case
there have been and undoubtedly will again be other leaks in other output
functions.  So what seems like the best solution is to run the output
functions in a temporary memory context that can be reset after each row,
as we're doing in COPY OUT.  Some quick experimentation suggests this is
actually a tad faster than the retail pfree's anyway.

This patch fixes all the variants of printtup, except for debugtup()
which is used in standalone mode.  It doesn't seem worth worrying
about query-lifespan leaks in standalone mode, and fixing that case
would be a bit tedious since debugtup() doesn't currently have any
startup or shutdown functions.

While at it, remove manual detoast management from several other
output-function call sites that had copied it from printtup().  This
doesn't make a lot of difference right now, but in view of recent
discussions about supporting "non-flattened" Datums, we're going to
want that code gone eventually anyway.

Back-patch to 9.2 where range_out was introduced.  We might eventually
decide to back-patch this further, but in the absence of known major
leaks in older output functions, I'll refrain for now.
2013-11-03 11:33:05 -05:00
Kevin Grittner 2a781d57dc Acquire appropriate locks when rewriting during RMV.
Since the query has not been freshly parsed when executing REFRESH
MATERIALIZED VIEW, locks must be explicitly taken before rewrite.

Backpatch to 9.3.

Andres Freund
2013-11-02 19:18:08 -05:00
Kevin Grittner be420fa02e Fix subquery reference to non-populated MV in CMV.
A subquery reference to a matview should be allowed by CREATE
MATERIALIZED VIEW WITH NO DATA, just like a direct reference is.

Per bug report from Laurent Sartran.

Backpatch to 9.3.
2013-11-02 18:38:17 -05:00
Tom Lane 24ace4053d Retry after buffer locking failure during SPGiST index creation.
The original coding thought this case was impossible, but it can happen
if the bgwriter or checkpointer processes decide to write out an index
page while creation is still proceeding, leading to a bogus "unexpected
spgdoinsert() failure" error.  Problem reported by Jonathan S. Katz.

Teodor Sigaev
2013-11-02 16:45:42 -04:00
Tom Lane bffd1ce92c Ensure all files created for a single BufFile have the same resource owner.
Callers expect that they only have to set the right resource owner when
creating a BufFile, not during subsequent operations on it.  While we could
insist this be fixed at the caller level, it seems more sensible for the
BufFile to take care of it.  Without this, some temp files belonging to
a BufFile can go away too soon, eg at the end of a subtransaction,
leading to errors or crashes.

Reported and fixed by Andres Freund.  Back-patch to all active branches.
2013-11-01 16:09:48 -04:00
Tom Lane 45f64f1bbf Remove CTimeZone/HasCTZSet, root and branch.
These variables no longer have any useful purpose, since there's no reason
to special-case brute force timezones now that we have a valid
session_timezone setting for them.  Remove the variables, and remove the
SET/SHOW TIME ZONE code that deals with them.

The user-visible impact of this is that SHOW TIME ZONE will now show a
POSIX-style zone specification, in the form "<+-offset>-+offset", rather
than an interval value when a brute-force zone has been set.  While perhaps
less intuitive, this is a better definition than before because it's
actually possible to give that string back to SET TIME ZONE and get the
same behavior, unlike what used to happen.

We did not previously mention the angle-bracket syntax when describing
POSIX timezone specifications; add some documentation so that people
can figure out what these strings do.  (There's still quite a lot of
undocumented functionality there, but anybody who really cares can
go read the POSIX spec to find out about it.  In practice most people
seem to prefer Olsen-style city names anyway.)
2013-11-01 13:57:31 -04:00
Tom Lane 1c8a7f617f Remove internal uses of CTimeZone/HasCTZSet.
The only remaining places where we actually look at CTimeZone/HasCTZSet
are abstime2tm() and timestamp2tm().  Now that session_timezone is always
valid, we can remove these special cases.  The caller-visible impact of
this is that these functions now always return a valid zone abbreviation
if requested, whereas before they'd return a NULL pointer if a brute-force
timezone was in use.  In the existing code, the only place I can find that
changes behavior is to_char(), whose TZ format code will now print
something useful rather than nothing for such zones.  (In the places where
the returned zone abbreviation is passed to EncodeDateTime, the lack of
visible change is because we've chosen the abbreviation used for these
zones to match what EncodeTimezone would have printed.)

It's likely that there is now a fair amount of removable dead code around
the call sites, namely anything that's meant to cope with getting a NULL
timezone abbreviation, but I've not made an effort to root that out.

This could be back-patched if we decide we'd like to fix to_char()'s
behavior in the back branches, but there doesn't seem to be much
enthusiasm for that at present.
2013-11-01 12:51:27 -04:00
Tom Lane 631dc390f4 Fix some odd behaviors when using a SQL-style simple GMT offset timezone.
Formerly, when using a SQL-spec timezone setting with a fixed GMT offset
(called a "brute force" timezone in the code), the session_timezone
variable was not updated to match the nominal timezone; rather, all code
was expected to ignore session_timezone if HasCTZSet was true.  This is
of course obviously fragile, though a search of the code finds only
timeofday() failing to honor the rule.  A bigger problem was that
DetermineTimeZoneOffset() supposed that if its pg_tz parameter was
pointer-equal to session_timezone, then HasCTZSet should override the
parameter.  This would cause datetime input containing an explicit zone
name to be treated as referencing the brute-force zone instead, if the
zone name happened to match the session timezone that had prevailed
before installing the brute-force zone setting (as reported in bug #8572).
The same malady could affect AT TIME ZONE operators.

To fix, set up session_timezone so that it matches the brute-force zone
specification, which we can do using the POSIX timezone definition syntax
"<abbrev>offset", and get rid of the bogus lookaside check in
DetermineTimeZoneOffset().  Aside from fixing the erroneous behavior in
datetime parsing and AT TIME ZONE, this will cause the timeofday() function
to print its result in the user-requested time zone rather than some
previously-set zone.  It might also affect results in third-party
extensions, if there are any that make use of session_timezone without
considering HasCTZSet, but in all cases the new behavior should be saner
than before.

Back-patch to all supported branches.
2013-11-01 12:13:18 -04:00
Robert Haas cacbdd7810 Use appendStringInfoString instead of appendStringInfo where possible.
This shaves a few cycles, and generally seems like good programming
practice.

David Rowley
2013-10-31 10:55:59 -04:00
Robert Haas 343bb134ea Avoid too-large shift on 32-bit Windows.
Apparently, shifts greater than or equal to the width of the type
are undefined, and can surprisingly produce a non-zero value.

Amit Kapila, with a comment by me.
2013-10-30 09:14:56 -04:00
Tom Lane 9a9473f3cc Prevent using strncpy with src == dest in TupleDescInitEntry.
The C and POSIX standards state that strncpy's behavior is undefined when
source and destination areas overlap.  While it remains dubious whether any
implementations really misbehave when the pointers are exactly equal, some
platforms are now starting to force the issue by complaining when an
undefined call occurs.  (In particular OS X 10.9 has been seen to dump core
here, though the exact set of circumstances needed to trigger that remain
elusive.  Similar behavior can be expected to be optional on Linux and
other platforms in the near future.)  So tweak the code to explicitly do
nothing when nothing need be done.

Back-patch to all active branches.  In HEAD, this also lets us get rid of
an exception in valgrind.supp.

Per discussion of a report from Matthias Schmitt.
2013-10-28 20:49:24 -04:00
Robert Haas d2aecaea15 Modify dynamic shared memory code to use Size rather than uint64.
This is more consistent with what we do elsewhere.
2013-10-28 12:12:06 -04:00
Tom Lane c2b51cf190 Improve documentation about usage of FDW validator functions.
SGML documentation, as well as code comments, failed to note that an FDW's
validator will be applied to foreign-table options for foreign tables using
the FDW.

Etsuro Fujita
2013-10-28 10:28:35 -04:00
Noah Misch c50b7c09d8 Add large object functions catering to SQL callers.
With these, one need no longer manipulate large object descriptors and
extract numeric constants from header files in order to read and write
large object contents from SQL.

Pavel Stehule, reviewed by Rushabh Lathia.
2013-10-27 22:56:54 -04:00
Tom Lane 43fe90f66a Suppress -0 in the C field of lines computed by line_construct_pts().
It's not entirely clear why some PPC machines are generating -0 here, since
the underlying computation should be exactly 0 - 0.  Perhaps there's some
wider-than-nominal-precision calculations happening?  Anyway, the best way
to avoid platform-dependent results seems to be to explicitly reset -0 to
regular zero.
2013-10-25 15:55:15 -04:00
Tom Lane 3147acd63e Use improved vsnprintf calling logic in more places.
When we are using a C99-compliant vsnprintf implementation (which should be
most places, these days) it is worth the trouble to make use of its report
of how large the buffer needs to be to succeed.  This patch adjusts
stringinfo.c and some miscellaneous usages in pg_dump to do that, relying
on the logic recently added in libpgcommon's psprintf.c.  Since these
places want to know the number of bytes written once we succeed, modify the
API of pvsnprintf() to report that.

There remains near-duplicate logic in pqexpbuffer.c, but since that code
is in libpq, psprintf.c's approach of exit()-on-error isn't appropriate
for use there.  Also note that I didn't bother touching the multitude
of places that call (v)snprintf without any attempt to provide a resizable
buffer.

Release-note-worthy incompatibility: the API of appendStringInfoVA()
changed.  If there's any third-party code that's calling that directly,
it will need tweaking along the same lines as in this patch.

David Rowley and Tom Lane
2013-10-24 21:43:57 -04:00
Heikki Linnakangas 98c50656ca Increase the number of different values used when seeding random().
When a backend process is forked, we initialize the system's random number
generator with srandom(). The seed used is derived from the backend's pid
and the timestamp. However, we only used the microseconds part of the
timestamp, and it was XORed with the pid, so the total range of different
seed values chosen was 0-999999. That's quite limited.

Change the code to also use the seconds part of the timestamp in the seed,
and shift the microseconds so that all 32 bits of the seed are used.

Honza Horak
2013-10-24 17:00:18 +03:00
Heikki Linnakangas 138184adc5 Plug memory leak when reloading config file.
The absolute path to config file was not pfreed. There are probably more
small leaks here and there in the config file reload code and assign hooks,
and in practice no-one reloads the config files frequently enough for it to
be a problem, but this one is trivial enough that might as well fix it.

Backpatch to 9.3 where the leak was introduced.
2013-10-24 15:27:40 +03:00
Heikki Linnakangas bb598456dc Fix memory leak when an empty ident file is reloaded.
Hari Babu
2013-10-24 14:03:26 +03:00
Heikki Linnakangas 4d6d425ab8 Fix typos in comments. 2013-10-24 11:50:02 +03:00
Heikki Linnakangas 83eb54001c Fix two bugs in setting the vm bit of empty pages.
Use a critical section when setting the all-visible flag on an empty page,
and WAL-logging it. log_newpage_buffer() contains an assertion that it
must be called inside a critical section, and it's the right thing to do
when modifying a buffer anyway.

Also, the page should be marked dirty before calling log_newpage_buffer(),
per the comment in log_newpage_buffer() and src/backend/access/transam/README.

Patch by Andres Freund, in response to my report. Backpatch to 9.2, like
the patch that introduced these bugs (a6370fd9).
2013-10-23 14:24:37 +03:00
Tom Lane 5f1ab46101 Suppress a couple of compiler warnings seen with older gcc versions.
To wit,
bgworker.c: In function `RegisterDynamicBackgroundWorker':
bgworker.c:761: warning: `generation' might be used uninitialized in this function
dsm_impl.c: In function `dsm_impl_op':
dsm_impl.c:197: warning: control reaches end of non-void function

Neither of these represent actual bugs, but we may as well tweak the code
so that more compilers can tell that.  This won't change the generated code
on compilers that do recognize that the cases are unreachable.
2013-10-22 21:31:57 -04:00
Tom Lane 2c66f9924c Replace pg_asprintf() with psprintf().
This eliminates an awkward coding pattern that's also unnecessarily
inconsistent with backend coding.  psprintf() is now the thing to
use everywhere.
2013-10-22 19:40:26 -04:00
Tom Lane 09a89cb5fc Get rid of use of asprintf() in favor of a more portable implementation.
asprintf(), aside from not being particularly portable, has a fundamentally
badly-designed API; the psprintf() function that was added in passing in
the previous patch has a much better API choice.  Moreover, the NetBSD
implementation that was borrowed for the previous patch doesn't work with
non-C99-compliant vsnprintf, which is something we still have to cope with
on some platforms; and it depends on va_copy which isn't all that portable
either.  Get rid of that code in favor of an implementation similar to what
we've used for many years in stringinfo.c.  Also, move it into libpgcommon
since it's not really libpgport material.

I think this patch will be enough to turn the buildfarm green again, but
there's still cosmetic work left to do, namely get rid of pg_asprintf()
in favor of using psprintf().  That will come in a followon patch.
2013-10-22 18:42:13 -04:00
Peter Eisentraut 586a8fc75b Make use of psprintf() in recent changes 2013-10-22 07:04:41 -04:00
Tom Lane 2885881147 Fix blatantly broken record_image_cmp() logic for pass-by-value fields.
Doesn't anybody here pay attention to compiler warnings?
2013-10-22 00:38:53 -04:00
Noah Misch 709170b790 Consistently use unsigned arithmetic for alignment calculations.
This avoids an assumption about the signed number representation.  It is
anticipated to have no functional changes on supported configurations;
many two's complement assumptions remain elsewhere.

Per a suggestion from Andres Freund.
2013-10-20 21:04:52 -04:00
Peter Eisentraut 713a9f210d Add libpgcommon to backend gettext source files
This ought to have been done when libpgcommon was split off from
libpgport.
2013-10-19 13:49:05 -04:00
Robert Haas cab5dc5daf Allow only some columns of a view to be auto-updateable.
Previously, unless all columns were auto-updateable, we wouldn't
inserts, updates, or deletes, or at least not without a rule or trigger;
now, we'll allow inserts and updates that target only the auto-updateable
columns, and deletes even if there are no auto-updateable columns at
all provided the view definition is otherwise suitable.

Dean Rasheed, reviewed by Marko Tiikkaja
2013-10-18 10:35:36 -04:00
Robert Haas 523beaa11b Provide a reliable mechanism for terminating a background worker.
Although previously-introduced APIs allow the process that registers a
background worker to obtain the worker's PID, there's no way to prevent
a worker that is not currently running from being restarted.  This
patch introduces a new API TerminateBackgroundWorker() that prevents
the background worker from being restarted, terminates it if it is
currently running, and causes it to be unregistered if or when it is
not running.

Patch by me.  Review by Michael Paquier and KaiGai Kohei.
2013-10-18 10:23:11 -04:00
Robert Haas ea91a6be89 Remove IRIX port.
Development of IRIX has been discontinued, and support is scheduled
to end in December of 2013.  Therefore, there will be no supported
versions of this operating system by the time PostgreSQL 9.4 is
released.  Furthermore, we have no maintainer for this platform.
2013-10-18 08:14:21 -04:00
Robert Haas 81051a86bc Remove spinlock support for SINIX, Sun3, and NS32K.
All of these platforms are very much obsolete.

As far as I can determine, the last version of SINIX, later renamed
Reliant, occurred some time between 2002 and 2005.

The last release of SunOS that would run on a sun3 was released in
November of 1991; the last release of OpenBSD which supported that
platform was in 2001.  The highest clock speed of any processor in
the family was 25MHz.

The NS32K (national semiconductor 320xx) architecture was retired
in 1990.

Support can be re-added if a maintainer emerges for any of these
platforms, but it seems unlikely.

Reviewed by Andres Freund.
2013-10-17 12:02:05 -04:00
Alvaro Herrera 86029b31e5 Silence compiler warning when SSL not in use
Per Jaime Casanova and Vik Fearing
2013-10-17 11:28:50 -03:00
Bruce Momjian 7778ddc7a2 Allow 5+ digit years for non-ISO timestamp/date strings, where appropriate
Report from Haribabu Kommi
2013-10-16 13:22:55 -04:00
Robert Haas e515861367 In dsm_impl_windows, don't error out when the segment already exists.
This is the behavior of the other implementations, and the behavior
expected by the callers of this function.

Amit Kapila
2013-10-14 11:48:49 -04:00
Robert Haas 05a0283e7a Fix details missed by dynamic shared memory patch.
Additional documentation update, and a comment fix.

Both issues reported by Amit Kapila.
2013-10-14 08:00:26 -04:00
Peter Eisentraut 5b6d08cd29 Add use of asprintf()
Add asprintf(), pg_asprintf(), and psprintf() to simplify string
allocation and composition.  Replacement implementations taken from
NetBSD.

Reviewed-by: Álvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Asif Naeem <anaeem.it@gmail.com>
2013-10-13 00:09:18 -04:00
Kevin Grittner 4cbb646334 Fix several possibly non-portable gaffs in record_image_ops.
Sparc machines in the buildfarm were made happy by the previous
fix, but PowerPC machines still are still failing.  Hopefully this
will cure that.
2013-10-11 13:02:52 -05:00
Alvaro Herrera ada01014d4 Use $(PERL) to invoke duplicate_oids
Per buildfarm failure reported by smilodon
2013-10-10 23:45:38 -03:00
Alvaro Herrera 31cf1a1a43 Rework SSL renegotiation code
The existing renegotiation code was home for several bugs: it might
erroneously report that renegotiation had failed; it might try to
execute another renegotiation while the previous one was pending; it
failed to terminate the connection if the renegotiation never actually
took place; if a renegotiation was started, the byte count was reset,
even if the renegotiation wasn't completed (this isn't good from a
security perspective because it means continuing to use a session that
should be considered compromised due to volume of data transferred.)

The new code is structured to avoid these pitfalls: renegotiation is
started a little earlier than the limit has expired; the handshake
sequence is retried until it has actually returned successfully, and no
more than that, but if it fails too many times, the connection is
closed.  The byte count is reset only when the renegotiation has
succeeded, and if the renegotiation byte count limit expires, the
connection is terminated.

This commit only touches the master branch, because some of the changes
are controversial.  If everything goes well, a back-patch might be
considered.

Per discussion started by message
20130710212017.GB4941@eldon.alvh.no-ip.org
2013-10-10 23:45:20 -03:00
Peter Eisentraut 5dd41f3574 Remove maintainer-check target, fold into normal build
make maintainer-check was obscure and rarely called in practice, and
many breakages were missed.  Fold everything that make maintainer-check
used to do into the normal build.  Specifically:

- Call duplicate_oids when genbki.pl is called.

- Check for tabs in SGML files when the documentation is built.

- Run msgfmt with the -c option during the regular build.  Add an
  additional configure check to see whether we are using the GNU
  version.  (make maintainer-check probably used to fail with non-GNU
  msgfmt.)

Keep maintainer-check as around as phony target for the time being in
case anyone is calling it.  But it won't do anything anymore.
2013-10-10 20:11:56 -04:00
Kevin Grittner 15e46fd1dd Fix bug in record_image_ops on big endian machines.
The buildfarm pointed out the problem.

Fix based on suggestion by Robert Haas.
2013-10-10 11:25:30 -05:00
Andrew Dunstan 4d212bac17 json_typeof function.
Andrew Tipton.
2013-10-10 12:21:59 -04:00
Robert Haas 4b7b9a7904 Fix incorrect use of shm_unlink where unlink should be used.
Per buildfarm.
2013-10-10 10:57:10 -04:00
Peter Eisentraut 261c7d4b65 Revive line type
Change the input/output format to {A,B,C}, to match the internal
representation.

Complete the implementations of line_in, line_out, line_recv, line_send.
Remove comments and error messages about the line type not being
implemented.  Add regression tests for existing line operators and
functions.

Reviewed-by: rui hua <365507506hua@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
2013-10-09 22:34:38 -04:00
Robert Haas 0ac5e5a7e1 Allow dynamic allocation of shared memory segments.
Patch by myself and Amit Kapila.  Design help from Noah Misch.  Review
by Andres Freund.
2013-10-09 21:05:02 -04:00
Kevin Grittner f566515192 Add record_image_ops opclass for matview concurrent refresh.
REFRESH MATERIALIZED VIEW CONCURRENTLY was broken for any matview
containing a column of a type without a default btree operator
class.  It also did not produce results consistent with a non-
concurrent REFRESH or a normal view if any column was of a type
which allowed user-visible differences between values which
compared as equal according to the type's default btree opclass.
Concurrent matview refresh was modified to use the new operators
to solve these problems.

Documentation was added for record comparison, both for the
default btree operator class for record, and the newly added
operators.  Regression tests now check for proper behavior both
for a matview with a box column and a matview containing a citext
column.

Reviewed by Steve Singer, who suggested some of the doc language.
2013-10-09 14:26:09 -05:00
Bruce Momjian 0c6b675076 Centralize effective_cache_size default setting 2013-10-09 08:33:12 -04:00
Bruce Momjian 96dfa6ec0d Adjust the effective_cache_size default for standalone backends 2013-10-08 23:53:39 -04:00
Bruce Momjian 6b82f78ff9 Again move function where we set effective_cache_size's default 2013-10-08 23:12:45 -04:00
Bruce Momjian cbafd6618a Move new effective_cache_size function
Previously set_default_effective_cache_size() could not handle fork,
non-fork, and bootstrap cases.
2013-10-08 22:41:23 -04:00
Bruce Momjian bf46524b31 Fix C comment in check_effective_cache_size() 2013-10-08 19:25:26 -04:00
Bruce Momjian 6648775028 Update postgres.conf.sample for effective_cache_size's new default 2013-10-08 12:50:05 -04:00
Bruce Momjian ee1e5662d8 Auto-tune effective_cache size to be 4x shared buffers 2013-10-08 12:12:24 -04:00
Heikki Linnakangas 5962519b36 TYPEALIGN doesn't work on int64 on 32-bit platforms.
The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with
values larger than intptr_t, because TYPEALIGN casts the argument to
intptr_t to do the arithmetic. That's not a problem when dealing with
pointers or lengths or offsets related to pointers, but the XLogInsert
scaling patch added a call to MAXALIGN with an XLogRecPtr argument.

To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64,
which are just like the existing variants but work with uint64 instead of
intptr_t.

Report and patch by David Rowley, analysis by Andres Freund.
2013-10-08 01:59:57 +03:00
Heikki Linnakangas 81fbbfe335 Fix bugs in SSI tuple locking.
1. In heap_hot_search_buffer(), the PredicateLockTuple() call is passed
wrong offset number. heapTuple->t_self is set to the tid of the first
tuple in the chain that's visited, not the one actually being read.

2. CheckForSerializableConflictIn() uses the tuple's t_ctid field
instead of t_self to check for exiting predicate locks on the tuple. If
the tuple was updated, but the updater rolled back, t_ctid points to the
aborted dead tuple.

Reported by Hannu Krosing. Backpatch to 9.1.
2013-10-08 00:18:43 +03:00
Peter Eisentraut 0b109c822b Translation updates 2013-10-07 16:51:52 -04:00
Robert Haas 16a906f535 Make DISCARD SEQUENCES also discard the last used sequence.
Otherwise, we access already-freed memory.  Oops.

Report by Michael Paquier.  Fix by me.
2013-10-07 15:55:56 -04:00
Kevin Grittner c01262a824 Eliminate xmin from hash tag for predicate locks on heap tuples.
If a tuple was frozen while its predicate locks mattered,
read-write dependencies could be missed, resulting in failure to
detect conflicts which could lead to anomalies in committed
serializable transactions.

This field was added to the tag when we still thought that it was
necessary to carry locks forward to a new version of an updated
row.  That was later proven to be unnecessary, which allowed
simplification of the code, but elimination of xmin from the tag
was missed at the time.

Per report and analysis by Heikki Linnakangas.
Backpatch to 9.1.
2013-10-07 14:16:54 -05:00
Alvaro Herrera bf2617981c Fix various bugs in postmaster SIGKILL processing
Clamp the minimum sleep time during immediate shutdown or crash to a
minimum of zero, not a maximum of one second.  The previous code could
result in a negative sleep time, leading to failure in select() calls.

Also, on crash recovery, reset AbortStartTime as soon as SIGKILL is sent
or abort processing has commenced instead of waiting until the startup
process completes.  Reset AbortStartTime as soon as SIGKILL is sent,
too, to avoid doing that repeatedly.

Per trouble report from Jeff Janes on
CAMkU=1xd3=wFqZwwuXPWe4BQs3h1seYo8LV9JtSjW5RodoPxMg@mail.gmail.com

Author: MauMau
2013-10-05 23:52:04 -03:00
Bruce Momjian a54141aebc Issue error on SET outside transaction block in some cases
Issue error for SET LOCAL/CONSTRAINTS/TRANSACTION outside a transaction
block, as they have no effect.

Per suggestion from Morten Hustveit
2013-10-04 13:50:28 -04:00
Robert Haas 0f1ef79095 Fix silly thinko in ResetSequenceCaches.
Report from Kevin Hale Boyes.
2013-10-03 20:17:51 -04:00
Robert Haas d90ced8bb2 Add DISCARD SEQUENCES command.
DISCARD ALL will now discard cached sequence information, as well.

Fabrízio de Royes Mello, reviewed by Zoltán Böszörményi, with some
further tweaks by me.
2013-10-03 16:23:31 -04:00
Heikki Linnakangas c2b175b948 Minor GIN code refactoring.
It makes for cleaner code to have separate Get/Add functions for PostingItems
and ItemPointers.  A few callsites that have to deal with both types need to
be duplicated because of this, but all the callers have to know which one
they're dealing with anyway. Overall, this reduces the amount of casting
required.

Extracted from Alexander Korotkov's larger patch to change the data page
format.
2013-10-03 11:51:31 +03:00
Bruce Momjian d50f281210 Adjust C comments that would be wrap-able. 2013-10-01 19:45:01 -04:00
Alvaro Herrera 15732b34e8 Add WaitForLockers in lmgr, refactoring index.c code
This is in support of a future REINDEX CONCURRENTLY feature.

Michael Paquier
2013-10-01 17:57:01 -03:00
Heikki Linnakangas ee01d848f3 In bms_add_member(), use repalloc() if the bms needs to be enlarged.
Previously bms_add_member() would palloc a whole-new copy of the existing
set, copy the words, and pfree the old one. repalloc() is potentially much
faster, and more importantly, this is less surprising if CurrentMemoryContext
is not the same as the context the old set is in. bms_add_member() still
allocates a new bitmapset in CurrentMemoryContext if NULL is passed as
argument, but that is a lot less likely to induce bugs.

Nicholas White.
2013-09-30 16:54:03 +03:00
Heikki Linnakangas 357f752138 Fix snapshot leak if lo_open called on non-existent object.
lo_open registers the currently active snapshot, and checks if the
large object exists after that. Normally, snapshots registered by lo_open
are unregistered at end of transaction when the lo descriptor is closed, but
if we error out before the lo descriptor is added to the list of open
descriptors, it is leaked. Fix by moving the snapshot registration to after
checking if the large object exists.

Reported by Pavel Stehule. Backpatch to 8.4. The snapshot registration
system was introduced in 8.4, so prior versions are not affected (and not
supported, anyway).
2013-09-30 12:53:14 +03:00
Robert Haas 4334639f4b Allow printf-style padding specifications in log_line_prefix.
David Rowley, after a suggestion from Heikki Linnakangas.  Reviewed by
Albe Laurenz, and further edited by me.
2013-09-26 17:56:31 -04:00
Heikki Linnakangas adaba2751f Fix spurious warning after vacuuming a page on a table with no indexes.
There is a rare race condition, when a transaction that inserted a tuple
aborts while vacuum is processing the page containing the inserted tuple.
Vacuum prunes the page first, which normally removes any dead tuples, but
if the inserting transaction aborts right after that, the loop after
pruning will see a dead tuple and remove it instead. That's OK, but if the
page is on a table with no indexes, and the page becomes completely empty
after removing the dead tuple (or tuples) on it, it will be immediately
marked as all-visible. That's OK, but the sanity check in vacuum would
throw a warning because it thinks that the page contains dead tuples and
was nevertheless marked as all-visible, even though it just vacuumed away
the dead tuples and so it doesn't actually contain any.

Spotted this while reading the code. It's difficult to hit the race
condition otherwise, but can be done by putting a breakpoint after the
heap_page_prune() call.

Backpatch all the way to 8.4, where this code first appeared.
2013-09-26 11:31:53 +03:00
Heikki Linnakangas 77ae7f7c35 Plug memory leak in range_cmp function.
B-tree operators are not allowed to leak memory into the current memory
context. Range_cmp leaked detoasted copies of the arguments. That caused
a quick out-of-memory error when creating an index on a range column.

Reported by Marian Krucina, bug #8468.
2013-09-25 16:02:00 +03:00
Alvaro Herrera b2fc4d6142 Fix pgindent comment breakage 2013-09-24 18:20:37 -03:00
Robert Haas ba3d39c969 Don't allow system columns in CHECK constraints, except tableoid.
Previously, arbitray system columns could be mentioned in table
constraints, but they were not correctly checked at runtime, because
the values weren't actually set correctly in the tuple.  Since it
seems easy enough to initialize the table OID properly, do that,
and continue allowing that column, but disallow the rest unless and
until someone figures out a way to make them work properly.

No back-patch, because this doesn't seem important enough to take the
risk of destabilizing the back branches.  In fact, this will pose a
dump-and-reload hazard for those upgrading from previous versions:
constraints that were accepted before but were not correctly enforced
will now either be enforced correctly or not accepted at all.  Either
could result in restore failures, but in practice I think very few
users will notice the difference, since the use case is pretty
marginal anyway and few users will be relying on features that have
not historically worked.

Amit Kapila, reviewed by Rushabh Lathia, with doc changes by me.
2013-09-23 13:31:22 -04:00
Robert Haas 496439d943 Fix compiler warning in WaitForBackgroundWorkerStartup().
Per complaint from Andrew Gierth.
2013-09-19 13:00:17 -04:00
Robert Haas 86a174bff0 Typo fix.
Etsuro Fujita
2013-09-18 08:57:44 -04:00
Alvaro Herrera 1247ea28cb Remove `proc` argument from LockCheckConflicts
This has been unused since commit 8563ccae2c.

Noted by Antonin Houska
2013-09-16 22:14:14 -03:00
Alvaro Herrera dd778e9d88 Rename various "freeze multixact" variables
It seems to make more sense to use "cutoff multixact" terminology
throughout the backend code; "freeze" is associated with replacing of an
Xid with FrozenTransactionId, which is not what we do for MultiXactIds.

Andres Freund
Some adjustments by Álvaro Herrera
2013-09-16 15:47:31 -03:00
Heikki Linnakangas 0892ecbc01 Add a GUC to report whether data page checksums are enabled.
Bernd Helmle
2013-09-16 14:36:01 +03:00
Noah Misch d41cb869aa Ignore interrupts during quickdie().
Once the administrator has called for an immediate shutdown or a backend
crash has triggered a reinitialization, no mere SIGINT or SIGTERM should
change that course.  Such derailment remains possible when the signal
arrives before quickdie() blocks signals.  That being a narrow race
affecting most PostgreSQL signal handlers in some way, leave it for
another patch.  Back-patch this to all supported versions.
2013-09-11 20:10:15 -04:00
Peter Eisentraut b34f8f409b Show schemas in information_schema.schemata that the current has access to
Before, it would only show schemas that the current user owns.  Per
discussion, the new behavior is more useful and consistent for PostgreSQL.
2013-09-09 22:25:37 -04:00
Robert Haas 71901ab6da Introduce InvalidCommandId.
This allows a 32-bit field to represent an *optional* command ID
without a separate flag bit.

Andres Freund
2013-09-09 16:25:29 -04:00
Noah Misch b8104730c8 Don't VALGRIND_PRINTF() each query string.
Doing so was helpful for some Valgrind usage and distracting for other
usage.  One can achieve the same effect by changing log_statement and
pointing both PostgreSQL and Valgrind logging to stderr.

Per gripe from Andres Freund.
2013-09-06 19:42:00 -04:00
Kevin Grittner 277607d600 Eliminate pg_rewrite.ev_attr column and related dead code.
Commit 95ef6a3448 removed the
ability to create rules on an individual column as of 7.3, but
left some residual code which has since been useless.  This cleans
up that dead code without any change in behavior other than
dropping the useless column from the catalog.
2013-09-05 14:03:43 -05:00
Heikki Linnakangas 20cb18db46 Make catalog cache hash tables resizeable.
If the hash table backing a catalog cache becomes too full (fillfactor > 2),
enlarge it. A new buckets array, double the size of the old, is allocated,
and all entries in the old hash are moved to the right bucket in the new
hash.

This has two benefits. First, cache lookups don't get so expensive when
there are lots of entries in a cache, like if you access hundreds of
thousands of tables. Second, we can make the (initial) sizes of the caches
much smaller, which saves memory.

This patch dials down the initial sizes of the catcaches. The new sizes are
chosen so that a backend that only runs a few basic queries still won't need
to enlarge any of them.
2013-09-05 20:20:03 +03:00
Jeff Davis b1892aaeaa Revert WAL posix_fallocate() patches.
This reverts commit 269e780822
and commit 5b571bb8c8.

Unfortunately, the initial patch had insufficient performance testing,
and resulted in a regression.

Per report by Thom Brown.
2013-09-04 23:43:41 -07:00
Bruce Momjian f5c2f5a8f6 Add GUC descriptions for compile-time postgresql.conf settings
Previous text was "No description available".

Tianyin Xu
2013-09-04 17:44:04 -04:00
Heikki Linnakangas 375d8526f2 Keep heavily-contended fields in XLogCtlInsert on different cache lines.
Performance testing shows that if the insertpos_lck spinlock and the fields
that it protects are on the same cache line with other variables that are
frequently accessed, the false sharing can hurt performance a lot. Keep
them apart by adding some padding.
2013-09-04 23:14:33 +03:00
Robert Haas cc52d5b33f Expose fsync_fname as a public API.
Andres Freund
2013-09-04 11:15:00 -04:00
Tom Lane 0c66a22377 Update comments concerning PGC_S_TEST.
This GUC context value was once only used by ALTER DATABASE SET and
ALTER USER SET.  That's not true anymore, though, so rewrite the
comments to be a bit more general.

Patch in HEAD only, since this is just an internal documentation issue.
2013-09-03 18:56:22 -04:00
Tom Lane 546f7c2e38 Don't fail for bad GUCs in CREATE FUNCTION with check_function_bodies off.
The previous coding attempted to activate all the GUC settings specified
in SET clauses, so that the function validator could operate in the GUC
environment expected by the function body.  However, this is problematic
when restoring a dump, since the SET clauses might refer to database
objects that don't exist yet.  We already have the parameter
check_function_bodies that's meant to prevent forward references in
function definitions from breaking dumps, so let's change CREATE FUNCTION
to not install the SET values if check_function_bodies is off.

Authors of function validators were already advised not to make any
"context sensitive" checks when check_function_bodies is off, if indeed
they're checking anything at all in that mode.  But extend the
documentation to point out the GUC issue in particular.

(Note that we still check the SET clauses to some extent; the behavior
with !check_function_bodies is now approximately equivalent to what ALTER
DATABASE/ROLE have been doing for awhile with context-dependent GUCs.)

This problem can be demonstrated in all active branches, so back-patch
all the way.
2013-09-03 18:32:20 -04:00
Tom Lane 0d3f4406df Allow aggregate functions to be VARIADIC.
There's no inherent reason why an aggregate function can't be variadic
(even VARIADIC ANY) if its transition function can handle the case.
Indeed, this patch to add the feature touches none of the planner or
executor, and little of the parser; the main missing stuff was DDL and
pg_dump support.

It is true that variadic aggregates can create the same sort of ambiguity
about parameters versus ORDER BY keys that was complained of when we
(briefly) had both one- and two-argument forms of string_agg().  However,
the policy formed in response to that discussion only said that we'd not
create any built-in aggregates with varying numbers of arguments, not that
we shouldn't allow users to do it.  So the logical extension of that is
we can allow users to make variadic aggregates as long as we're wary about
shipping any such in core.

In passing, this patch allows aggregate function arguments to be named, to
the extent of remembering the names in pg_proc and dumping them in pg_dump.
You can't yet call an aggregate using named-parameter notation.  That seems
like a likely future extension, but it'll take some work, and it's not what
this patch is really about.  Likewise, there's still some work needed to
make window functions handle VARIADIC fully, but I left that for another
day.

initdb forced because of new aggvariadic field in Aggref parse nodes.
2013-09-03 17:08:46 -04:00
Heikki Linnakangas a93bdfc711 Fix typo in comment.
Also line-wrap an over-wide line in a comment that's ignored by pgindent.
2013-09-03 13:17:09 +03:00
Peter Eisentraut 6a007fa1eb Translation updates 2013-09-02 02:43:18 -04:00
Tom Lane 8e2b71d2d0 Reset the binary heap in MergeAppend rescans.
Failing to do so can cause queries to return wrong data, error out or crash.
This requires adding a new binaryheap_reset() method to binaryheap.c,
but that probably should have been there anyway.

Per bug #8410 from Terje Elde.  Diagnosis and patch by Andres Freund.
2013-08-30 19:15:21 -04:00
Alvaro Herrera 9381cb5229 Make error wording more consistent 2013-08-29 12:42:28 -04:00
Robert Haas 090d0f2050 Allow discovery of whether a dynamic background worker is running.
Using the infrastructure provided by this patch, it's possible either
to wait for the startup of a dynamically-registered background worker,
or to poll the status of such a worker without waiting.  In either
case, the current PID of the worker process can also be obtained.
As usual, worker_spi is updated to demonstrate the new functionality.

Patch by me.  Review by Andres Freund.
2013-08-28 14:08:13 -04:00
Robert Haas c9e2e2db5c Partially restore comments discussing enum renumbering hazards.
As noted by Tom Lane, commit 813fb03155
was overly optimistic about how safe it is to concurrently change
enumsortorder values under MVCC catalog scan semantics.  Restore
some of the previous text, with hopefully-correct adjustments for
the new state of play.
2013-08-28 13:21:08 -04:00
Alvaro Herrera e246cfc95f Initialize cached OID to Invalid in new hash entries
Andres Freund; bug detected by valgrind
2013-08-27 14:53:17 -04:00
Tom Lane 2aac3399ae Account better for planning cost when choosing whether to use custom plans.
The previous coding in plancache.c essentially used 10% of the estimated
runtime as its cost estimate for planning.  This can be pretty bogus,
especially when the estimated runtime is very small, such as in a simple
expression plan created by plpgsql, or a simple INSERT ... VALUES.

While we don't have a really good handle on how planning time compares
to runtime, it seems reasonable to use an estimate based on the number of
relations referenced in the query, with a rather large multiplier.  This
patch uses 1000 * cpu_operator_cost * (nrelations + 1), so that even a
trivial query will be charged 1000 * cpu_operator_cost for planning.
This should address the problem reported by Marc Cousin and others that
9.2 and up prefer custom plans in cases where the planning time greatly
exceeds what can be saved.
2013-08-24 15:14:17 -04:00
Magnus Hagander db4ef73760 Don't crash when pg_xlog is empty and pg_basebackup -x is used
The backup will not work (without a logarchive, and that's the whole
point of -x) in this case, this patch just changes it to throw an
error instead of crashing when this happens.

Noticed and diagnosed by TAKATSUKA Haruka
2013-08-24 17:13:49 +02:00
Tom Lane fcf9ecad57 In locate_grouping_columns(), don't expect an exact match of Var typmods.
It's possible that inlining of SQL functions (or perhaps other changes?)
has exposed typmod information not known at parse time.  In such cases,
Vars generated by query_planner might have valid typmod values while the
original grouping columns only have typmod -1.  This isn't a semantic
problem since the behavior of grouping only depends on type not typmod,
but it breaks locate_grouping_columns' use of tlist_member to locate the
matching entry in query_planner's result tlist.

We can fix this without an excessive amount of new code or complexity by
relying on the fact that locate_grouping_columns only gets called when
make_subplanTargetList has set need_tlist_eval == false, and that can only
happen if all the grouping columns are simple Vars.  Therefore we only need
to search the sub_tlist for a matching Var, and we can reasonably define a
"match" as being a match of the Var identity fields
varno/varattno/varlevelsup.  The code still Asserts that vartype matches,
but ignores vartypmod.

Per bug #8393 from Evan Martin.  The added regression test case is
basically the same as his example.  This has been broken for a very long
time, so back-patch to all supported branches.
2013-08-23 17:30:53 -04:00
Tom Lane 3454876314 Fix hash table size estimation error in choose_hashed_distinct().
We should account for the per-group hashtable entry overhead when
considering whether to use a hash aggregate to implement DISTINCT.  The
comparable logic in choose_hashed_grouping() gets this right, but I think
I omitted it here in the mistaken belief that there would be no overhead
if there were no aggregate functions to be evaluated.  This can result in
more than 2X underestimate of the hash table size, if the tuples being
aggregated aren't very wide.  Per report from Tomas Vondra.

This bug is of long standing, but per discussion we'll only back-patch into
9.3.  Changing the estimation behavior in stable branches seems to carry too
much risk of destabilizing plan choices for already-tuned applications.
2013-08-21 13:38:34 -04:00
Tom Lane 20fe870753 Be more wary of unwanted whitespace in pgstat_reset_remove_files().
sscanf isn't the easiest thing to use for exact pattern checks ...
also, don't use strncmp where strcmp will do.
2013-08-19 19:36:04 -04:00
Alvaro Herrera f9b50b7c18 Fix removal of files in pgstats directories
Instead of deleting all files in stats_temp_directory and the permanent
directory on a crash, only remove those files that match the pattern of
files we actually write in them, to avoid possibly clobbering existing
unrelated contents of the temporary directory.  Per complaint from Jeff
Janes, and subsequent discussion, starting at message
CAMkU=1z9+7RsDODnT4=cDFBRBp8wYQbd_qsLcMtKEf-oFwuOdQ@mail.gmail.com

Also, fix a bug in the same routine to avoid removing files from the
permanent directory twice (instead of once from that directory and then
from the temporary directory), also per report from Jeff Janes, in
message
CAMkU=1wbk947=-pAosDMX5VC+sQw9W4ttq6RM9rXu=MjNeEQKA@mail.gmail.com
2013-08-19 17:48:17 -04:00
Heikki Linnakangas 3619a20d33 Rename the "fast_promote" file to just "promote".
This keeps the usual trigger file name unchanged from 9.2, avoiding nasty
issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa.
The fallback behavior of creating a full checkpoint before starting up is now
triggered by a file called "fallback_promote". That can be useful for
debugging purposes, but we don't expect any users to have to resort to that
and we might want to remove that in the future, which is why the fallback
mechanism is undocumented.
2013-08-19 20:59:51 +03:00
Tom Lane c64de21e96 Fix qual-clause-misplacement issues with pulled-up LATERAL subqueries.
In an example such as
SELECT * FROM
  i LEFT JOIN LATERAL (SELECT * FROM j WHERE i.n = j.n) j ON true;
it is safe to pull up the LATERAL subquery into its parent, but we must
then treat the "i.n = j.n" clause as a qual clause of the LEFT JOIN.  The
previous coding in deconstruct_recurse mistakenly labeled the clause as
"is_pushed_down", resulting in wrong semantics if the clause were applied
at the join node, as per an example submitted awhile ago by Jeremy Evans.
To fix, postpone processing of such clauses until we return back up to
the appropriate recursion depth in deconstruct_recurse.

In addition, tighten the is-safe-to-pull-up checks in is_simple_subquery;
we previously missed the possibility that the LATERAL subquery might itself
contain an outer join that makes lateral references in lower quals unsafe.

A regression test case equivalent to Jeremy's example was already in my
commit of yesterday, but was giving the wrong results because of this
bug.  This patch fixes the expected output for that, and also adds a
test case for the second problem.
2013-08-19 13:19:41 -04:00
Alvaro Herrera 78e1220104 Fix pg_upgrade failure from servers older than 9.3
When upgrading from servers of versions 9.2 and older, and MultiXactIds
have been used in the old server beyond the first page (that is, 2048
multis or more in the default 8kB-page build), pg_upgrade would set the
next multixact offset to use beyond what has been allocated in the new
cluster.  This would cause a failure the first time the new cluster
needs to use this value, because the pg_multixact/offsets/ file wouldn't
exist or wouldn't be large enough.  To fix, ensure that the transient
server instances launched by pg_upgrade extend the file as necessary.

Per report from Jesse Denardo in
CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com
2013-08-19 12:56:18 -04:00
Peter Eisentraut a2f2e902b8 Translation updates 2013-08-18 23:41:03 -04:00
Kevin Grittner 28154bb23b Remove relcache entry invalidation in REFRESH MATERIALIZED VIEW.
This was added as part of the attempt to support unlogged matviews
along with a populated status.  It got missed when unlogged
support was removed pre-commit.

Noticed by Noah Misch.  Back-patched to 9.3 branch.
2013-08-18 16:19:22 -05:00
Tom Lane f1d5fce7cf Fix thinko in comment. 2013-08-17 20:36:29 -04:00
Tom Lane 9e7e29c75a Fix planner problems with LATERAL references in PlaceHolderVars.
The planner largely failed to consider the possibility that a
PlaceHolderVar's expression might contain a lateral reference to a Var
coming from somewhere outside the PHV's syntactic scope.  We had a previous
report of a problem in this area, which I tried to fix in a quick-hack way
in commit 4da6439bd8, but Antonin Houska
pointed out that there were still some problems, and investigation turned
up other issues.  This patch largely reverts that commit in favor of a more
thoroughly thought-through solution.  The new theory is that a PHV's
ph_eval_at level cannot be higher than its original syntactic level.  If it
contains lateral references, those don't change the ph_eval_at level, but
rather they create a lateral-reference requirement for the ph_eval_at join
relation.  The code in joinpath.c needs to handle that.

Another issue is that createplan.c wasn't handling nested PlaceHolderVars
properly.

In passing, push knowledge of lateral-reference checks for join clauses
into join_clause_is_movable_to.  This is mainly so that FDWs don't need
to deal with it.

This patch doesn't fix the original join-qual-placement problem reported by
Jeremy Evans (and indeed, one of the new regression test cases shows the
wrong answer because of that).  But the PlaceHolderVar problems need to be
fixed before that issue can be addressed, so committing this separately
seems reasonable.
2013-08-17 20:22:37 -04:00
Robert Haas 2dee7998f9 Move more bgworker code to bgworker.c; also, some renaming.
Per discussion on pgsql-hackers.

Michael Paquier, slightly modified by me.  Original suggestion
from Amit Kapila.
2013-08-16 15:31:28 -04:00
Heikki Linnakangas 05cbce6f30 Fix typo in comment. 2013-08-16 16:26:22 +03:00
Kevin Grittner 3f78b1715c Don't allow ALTER MATERIALIZED VIEW ADD UNIQUE.
Was accidentally allowed, but not documented and lacked support
for rename or drop once created.

Per report from Noah Misch.
2013-08-15 13:14:48 -05:00
Peter Eisentraut 229fb58d4f Treat timeline IDs as unsigned in replication parser
Timeline IDs are unsigned ints everywhere, except the replication parser
treated them as signed ints.
2013-08-14 23:18:49 -04:00
Peter Eisentraut 32f7c0ae17 Improve error message when view is not updatable
Avoid using the term "updatable" in confusing ways.  Suggest a trigger
first, before a rule.
2013-08-14 23:02:59 -04:00