Commit Graph

12804 Commits

Author SHA1 Message Date
Simon Riggs 2c8a4e9be2 Wake WALSender to reduce data loss at failover for async commit.
WALSender now woken up after each background flush by WALwriter, avoiding
multi-second replication delay for an all-async commit workload.
Replication delay reduced from 7s with default settings to 200ms and often
much less, allowing significantly reduced data loss at failover.

Andres Freund and Simon Riggs
2012-06-07 19:22:47 +01:00
Robert Haas b50991eedb Fix more crash-safe visibility map bugs, and improve comments.
In lazy_scan_heap, we could issue bogus warnings about incorrect
information in the visibility map, because we checked the visibility
map bit before locking the heap page, creating a race condition.  Fix
by rechecking the visibility map bit before we complain.  Rejigger
some related logic so that we rely on the possibly-outdated
all_visible_according_to_vm value as little as possible.

In heap_multi_insert, it's not safe to clear the visibility map bit
before beginning the critical section.  The visibility map is not
crash-safe unless we treat clearing the bit as a critical operation.
Specifically, if the transaction were to error out after we set the
bit and before entering the critical section, we could end up writing
the heap page to disk (with the bit cleared) and crashing before the
visibility map page made it to disk.  That would be bad.  heap_insert
has this correct, but somehow the order of operations got rearranged
when heap_multi_insert was added.

Also, add some more comments to visibilitymap_test, lazy_scan_heap,
and IndexOnlyNext, expounding on concurrency issues.

Per extensive code review by Andres Freund, and further review by Tom
Lane, who also made the original report about the bogus warnings.
2012-06-07 12:48:13 -04:00
Tom Lane 3dd8e59681 Fix bogus handling of control characters in json_lex_string().
The original coding misbehaved if "char" is signed, and also made the
extremely poor decision to print control characters literally when trying
to complain about them.  Report and patch by Shigeru Hanada.

In passing, also fix core dump risk in report_parse_error() should the
parse state be something other than what it expects.
2012-06-04 20:43:57 -04:00
Simon Riggs d3abbbebe5 Avoid early reuse of btree pages, causing incorrect query results.
When we allowed read-only transactions to skip assigning XIDs
we introduced the possibility that a fully deleted btree page
could be reused. This broke the index link sequence which could
then lead to indexscans silently returning fewer rows than would
have been correct. The actual incidence of silent errors from
this is thought to be very low because of the exact workload
required and locking pre-conditions. Fix is to remove pages only
if index page opaque->btpo.xact precedes RecentGlobalXmin.

Noah Misch, reviewed by Simon Riggs
2012-06-01 12:21:45 +01:00
Simon Riggs 055c352abb After any checkpoint, close all smgr files handles in bgwriter 2012-06-01 09:24:53 +01:00
Simon Riggs a297d64d92 Checkpointer starts before bgwriter to avoid missing fsync requests.
Noted while testing Hot Standby startup.
2012-06-01 08:25:17 +01:00
Simon Riggs 1ec6a2bbc9 Provide interim statistics while in mid-checkpoint.
Re-implements similar functionality in 9.1 and previously which
was removed during split of checkpointer and bgwriter.

Requested/spotted by Magnus Hagander
2012-06-01 08:19:06 +01:00
Tom Lane a04dc87db1 Improve comment for GetStableLatestTransactionId(). 2012-05-31 11:20:02 -04:00
Simon Riggs a2b516dab9 Only throw recovery conflicts when InHotStandby. Bug fix to recent
patch to allow Index Only Scans on Hot Standby.

Bug report from Jaime Casanova
2012-05-31 13:11:47 +01:00
Tom Lane ad0009e7be Force PL and range-type support functions to be owned by a superuser.
We allow non-superusers to create procedural languages (with restrictions)
and range datatypes.  Previously, the automatically-created support
functions for these objects ended up owned by the creating user.  This
represents a rather considerable security hazard, because the owning user
might be able to alter a support function's definition in such a way as to
crash the server, inject trojan-horse SQL code, or even execute arbitrary
C code directly.  It appears that right now the only actually exploitable
problem is the infinite-recursion bug fixed in the previous patch for
CVE-2012-2655.  However, it's not hard to imagine that future additions of
more ALTER FUNCTION capability might unintentionally open up new hazards.
To forestall future problems, cause these support functions to be owned by
the bootstrap superuser, not the user creating the parent object.
2012-05-30 23:47:57 -04:00
Tom Lane 33c6eaf78e Ignore SECURITY DEFINER and SET attributes for a PL's call handler.
It's not very sensible to set such attributes on a handler function;
but if one were to do so, fmgr.c went into infinite recursion because
it would call fmgr_security_definer instead of the handler function proper.
There is no way for fmgr_security_definer to know that it ought to call the
handler and not the original function referenced by the FmgrInfo's fn_oid,
so it tries to do the latter, causing the whole process to start over
again.

Ordinarily such misconfiguration of a procedural language's handler could
be written off as superuser error.  However, because we allow non-superuser
database owners to create procedural languages and the handler for such a
language becomes owned by the database owner, it is possible for a database
owner to crash the backend, which ideally shouldn't be possible without
superuser privileges.  In 9.2 and up we will adjust things so that the
handler functions are always owned by superusers, but in existing branches
this is a minor security fix.

Problem noted by Noah Misch (after several of us had failed to detect
it :-().  This is CVE-2012-2655.
2012-05-30 23:27:57 -04:00
Tom Lane cd0ff9c0f4 Expand the allowed range of timezone offsets to +/-15:59:59 from Greenwich.
We used to only allow offsets less than +/-13 hours, then it was +/14,
then it was +/-15.  That's still not good enough though, as per today's bug
report from Patric Bechtel.  This time I actually looked through the Olson
timezone database to find the largest offsets used anywhere.  The winners
are Asia/Manila, at -15:56:00 until 1844, and America/Metlakatla, at
+15:13:42 until 1867.  So we'd better allow offsets less than +/-16 hours.

Given the history, we are way overdue to have some greppable #define
symbols controlling this, so make some ... and also remove an obsolete
comment that didn't get fixed the last time.

Back-patch to all supported branches.
2012-05-30 19:58:35 -04:00
Robert Haas 07ab1383e3 Fix two more bugs in fast-path relation locking.
First, the previous code failed to account for the fact that, during Hot
Standby operation, the startup process takes AccessExclusiveLocks on
relations without setting MyDatabaseId.  This resulted in fast path
strong lock counts failing to be incremented with the startup process
took locks, which in turn allowed conflicting lock requests to succeed
when they should not have.  Report by Erik Rijkers, diagnosis by Heikki
Linnakangas.

Second, LockReleaseAll() failed to honor the allLocks and lockmethodid
restrictions with respect to fast-path locks.  It's not clear to me
whether this produces any user-visible breakage at the moment, but it's
certainly wrong.  Rearrange order of operations in LockReleaseAll to fix.
Noted by Tom Lane.
2012-05-30 16:17:46 -04:00
Heikki Linnakangas d1996ed5e8 Change the way parent pages are tracked during buffered GiST build.
We used to mimic the way a stack is constructed when descending the tree
during normal GiST inserts, but that was quite complicated during a buffered
build. It was also wrong: in GiST, the left-to-right relationships on
different levels might not match each other, so that when you know the
parent of a child page, you won't necessarily find the parent of the page to
the right of the child page by following the rightlinks at the parent level.
This sometimes led to "could not re-find parent" errors while building a
GiST index.

We now use a simple hash table to track the parent of every internal page.
Whenever a page is split, and downlinks are moved from one page to another,
we update the hash table accordingly. This is also better for performance
than the old method, as we never need to move right to re-find the parent
page, which could take a significant amount of time for buffers that were
created much earlier in the index build.
2012-05-30 12:05:57 +03:00
Heikki Linnakangas be02b16826 Delete the temporary file used in buffered GiST build, after the build.
There were two bugs here: We forgot to call gistFreeBuildBuffers() function
at the end of build, and we passed interXact == true to BufFileCreateTemp,
so the file wasn't automatically cleaned up at end-of-transaction either.
2012-05-30 12:05:57 +03:00
Heikki Linnakangas 4bc6fb57f7 Fix integer overflow bug in GiST buffering build calculations.
The result of (maintenance_work_mem * 1024) / BLCKSZ doesn't fit in a signed
32-bit integer, if maintenance_work_mem >= 2GB. Use double instead. And
while we're at it, write the calculations in an easier to understand form,
with the intermediary steps written out and commented.
2012-05-29 22:27:42 +03:00
Tom Lane 2755abf386 Teach AbortOutOfAnyTransaction to clean up partially-started transactions.
AbortOutOfAnyTransaction failed to do anything if the state it saw on
entry corresponded to failing partway through StartTransaction.  I fixed
AbortCurrentTransaction to cope with that case way back in commit
60b2444cc3, but evidently overlooked that
AbortOutOfAnyTransaction should do likewise.

Back-patch to all supported branches.  It's not clear that this omission
has any more-than-cosmetic consequences, but it's also not clear that it
doesn't, so back-patching seems the least risky choice.
2012-05-28 23:57:06 -04:00
Peter Eisentraut 388d251679 Update SQL features list
Set E081 Basic Privileges to supported, since by the letter of it, we
support it, even though not all possible forms of USAGE privileges are
implemented.
2012-05-27 23:34:16 +03:00
Peter Eisentraut 27314d32a8 Suppress -Wunused-result warning about write()
This is related to aa90e148ca, but this
code is only used under -DLINUX_OOM_ADJ, so it was apparently
overlooked then.
2012-05-27 22:35:01 +03:00
Tom Lane 532fe28dad Prevent synchronized scanning when systable_beginscan chooses a heapscan.
The only interesting-for-performance case wherein we force heapscan here
is when we're rebuilding the relcache init file, and the only such case
that is likely to be examining a catalog big enough to be syncscanned is
RelationBuildTupleDesc.  But the early-exit optimization in that code gets
broken if we start the scan at a random place within the catalog, so that
allowing syncscan is actually a big deoptimization if pg_attribute is large
(at least for the normal case where the rows for core system catalogs have
never been changed since initdb).  Hence, prevent syncscan here.  Per my
testing pursuant to complaints from Jeff Frost and Greg Sabino Mullane,
though neither of them seem to have actually hit this specific problem.

Back-patch to 8.3, where syncscan was introduced.
2012-05-26 19:09:52 -04:00
Tom Lane d3b97d1488 Fix string truncation to be multibyte-aware in text_name and bpchar_name.
Previously, casts to name could generate invalidly-encoded results.

Also, make these functions match namein() more exactly, by consistently
using palloc0() instead of ad-hoc zeroing code.

Back-patch to all supported branches.

Karl Schnaitter and Tom Lane
2012-05-25 17:34:51 -04:00
Tom Lane 2a4c46e0ba Fix array overrun in regex code.
zaptreesubs() was coded to unconditionally reset a capture subre's
corresponding pmatch[] entry.  However, in regexes without backrefs, that
array is caller-supplied and might not have as many entries as the regex
has capturing parens.  So check the array length and do nothing if there
is no corresponding entry, much as subset() does.  Failure to check this
resulted in a stack clobber in the case reported by Marko Kreen.

This bug appears to have been latent in the regex library from the
beginning.  It was not exposed because find() called dissect() not
cdissect(), and the dissect() code path didn't ever call zaptreesubs()
(formerly zapmem()).  When I unified dissect() and cdissect() in commit
4dd78bf37a, the problem was exposed.

Now that I've seen this, I'm rather suspicious that we might need to
back-patch it; but will refrain for now, for lack of evidence that
the case can be hit in the previous coding.
2012-05-24 13:56:16 -04:00
Tom Lane ed962fd712 Ensure that seqscans check for interrupts at least once per page.
If a seqscan encounters many consecutive pages containing only dead tuples,
it can remain in the loop in heapgettup for a long time, and there was no
CHECK_FOR_INTERRUPTS anywhere in that loop.  This meant there were
real-world situations where a query would be effectively uncancelable for
long stretches.  Add a check placed to occur once per page, which should be
enough to provide reasonable response time without adding any measurable
overhead.

Report and patch by Merlin Moncure (though I tweaked it a bit).
Back-patch to all supported branches.
2012-05-22 19:42:05 -04:00
Robert Haas 8fbe5a317d Fix error message for COMMENT/SECURITY LABEL ON COLUMN xxx IS 'yyy'
When the column name is an unqualified name, rather than table.column,
the error message complains about too many dotted names, which is
wrong.  Report by Peter Eisentraut based on examination of the
sepgsql regression test output, but the problem also affects COMMENT.
New wording as suggested by Tom Lane.
2012-05-22 11:23:36 -04:00
Robert Haas 219c024c64 Repair out-of-date information in src/backend/storage/buffer/README.
In commit d526575f89, we changed things so
that buffer usage counts are incremented when the buffer is pinned, rather
than when it is unpinned, but the README file didn't get the memo.

Report by Amit Kapila.
2012-05-22 09:32:09 -04:00
Tom Lane b94ce6e807 Move postmaster's RemovePgTempFiles call to a less randomly chosen place.
There is no reason to do this as early as possible in postmaster startup,
and good reason not to do it until we have completely created the
postmaster's lock file, namely that it might contribute to pg_ctl thinking
that postmaster startup has timed out.  (This would require a rather
unusual amount of time to be spent scanning temp file directories, but we
have at least one field report of it happening reproducibly.)

Back-patch to 9.1.  Before that, pg_ctl didn't wait for additional info to
be added to the lock file, so it wasn't a problem.

Note that this is not a complete fix to the slow-start issue in 9.1,
because we still had identify_system_timezone being run during postmaster
start in 9.1.  But that's at least a reasonably well-defined delay, with
an easy workaround if needed, whereas the temp-files scan is not so
predictable and cannot be avoided.
2012-05-21 22:50:30 -04:00
Tom Lane efae4653c9 Update woefully-obsolete comment.
The accurate info about what's in a lock file has been in miscadmin.h
for some time, so let's just make this comment point there instead of
maintaining a duplicative copy.
2012-05-21 22:11:00 -04:00
Peter Eisentraut f1f6737e15 Fix incorrect logic in JSON number lexer
Detectable by gcc -Wlogical-op.

Add two regression test cases that would previously allow incorrect
values to pass.
2012-05-20 02:24:46 +03:00
Peter Eisentraut 2273a50364 Realign some --help output to have better spacing between columns 2012-05-18 20:34:14 +03:00
Heikki Linnakangas 1d27dcf578 Fix bug in gistRelocateBuildBuffersOnSplit().
When we create a temporary copy of the old node buffer, in stack, we mustn't
leak that into any of the long-lived data structures. Before this patch,
when we called gistPopItupFromNodeBuffer(), it got added to the array of
"loaded buffers". After gistRelocateBuildBuffersOnSplit() exits, the
pointer added to the loaded buffers array points to garbage. Often that goes
unnotied, because when we go through the array of loaded buffers to unload
them, buffers with a NULL pageBuffer are ignored, which can often happen by
accident even if the pointer points to garbage.

This patch fixes that by marking the temporary copy in stack explicitly as
temporary, and refrain from adding buffers marked as temporary to the array
of loaded buffers.

While we're at it, initialize nodeBuffer->pageBlocknum to InvalidBlockNumber
and improve comments a bit. This isn't strictly necessary, but makes
debugging easier.
2012-05-18 19:38:32 +03:00
Peter Eisentraut 939ec9b8a4 Update SQL features/conformance information to SQL:2011 2012-05-17 09:50:04 +03:00
Peter Eisentraut be6d1c88a4 Change COLLATION keyword category
It was changed from unreserved to reserved as part of the COLLATION
FOR syntax, but it turns out that type_func_name_keyword is
sufficient.
2012-05-16 20:19:44 +03:00
Tom Lane 488c6dd170 Improve error message for ALTER COLUMN TYPE coercion failure.
Per recent discussion, the error message for this was actually a trifle
inaccurate, since it said "cannot be cast" which might be incorrect.
Adjust that wording, and add a HINT suggesting that a USING clause might
be needed.
2012-05-16 07:28:25 -04:00
Heikki Linnakangas 6593c5b5dc Fix bug in freespace calculation in heap_multi_insert().
If the amount of freespace on page was less than the amount reserved by
fillfactor, the calculation would underflow.

This fixes bug #6643 reported by Tomonari Katsumata.
2012-05-16 14:13:06 +03:00
Peter Eisentraut c8e086795a Remove whitespace from end of lines
pgindent and perltidy should clean up the rest.
2012-05-15 22:19:41 +03:00
Peter Eisentraut 8afb026e57 Remove stray nbsp character 2012-05-15 21:38:59 +03:00
Heikki Linnakangas d2495f272c Fix bug in to_tsquery().
We were using memcpy() to copy to a possibly overlapping memory region,
which is a no-no. Use memmove() instead.
2012-05-15 19:27:34 +03:00
Tom Lane 9b63e9869f In pgstat.c, use a timeout in WaitLatchOrSocket only on Windows.
We have no need for a timeout here really, but some broken products from
Redmond seem to lose FD_READ events occasionally, and waking up and
retrying the recv() is the only known way to work around that.  Perhaps
somebody will be motivated to figure out a better answer here; but not I.
2012-05-14 23:51:34 -04:00
Tom Lane 5a2bb06012 Revert "Add some temporary instrumentation to pgstat.c."
This reverts commit 7d88bb73f7.
That instrumentation has served its purpose.
2012-05-14 23:08:10 -04:00
Tom Lane e42a21b9e6 Assert that WaitLatchOrSocket callers cannot wait only for writability.
Since we have chosen to report socket EOF and error conditions via the
WL_SOCKET_READABLE flag bit, it's unsafe to wait only for
WL_SOCKET_WRITEABLE; the caller would never be notified of the socket
condition, and in some of these implementations WaitLatchOrSocket would
busy-wait until something else happens.  Add this restriction to the API
specification, and add Asserts to check that callers don't try to do that.

At some point we might want to consider adjusting the API to relax this
restriction, but until we have an actual use case for waiting on a
write-only socket, it seems premature to design a solution.
2012-05-14 16:12:28 -04:00
Tom Lane d461d0502b For testing purposes, reinsert a timeout in pgstat.c's wait call.
Test results from buildfarm members mastodon/narwhal (Windows Server 2003)
make it look like that platform just plain loses FD_READ events
occasionally, and the only reason our previous coding seemed to work was
that it timed out every couple of seconds and retried the whole operation.
Try to verify this by reinserting a finite timeout into the pgstat loop.
This isn't meant to be a permanent patch either, just to confirm or
disprove a theory.
2012-05-14 15:03:14 -04:00
Tom Lane f1ca51549e Force pgwin32_recv into nonblock mode when called from pgstat.c.
This should get rid of the usage of pgwin32_waitforsinglesocket entirely,
and perhaps thereby remove the race condition that's evidently still
present on some versions of Windows.  The previous arrangement was a bit
unsafe anyway, since waiting at the recv() would not allow pgstat to notice
postmaster death.
2012-05-14 10:57:07 -04:00
Heikki Linnakangas f15c2eae9c Remove unnecessary pg_verifymbstr() calls from tsvector/query in functions.
The input should've been validated well before it hits the input function.
Doing so again is a waste of cycles.
2012-05-14 14:30:32 +03:00
Heikki Linnakangas 9e4637bf89 Update comments that became out-of-date with the PGXACT struct.
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.

Noah Misch
2012-05-14 10:28:55 +03:00
Peter Eisentraut 64f09ca386 Remove leftovers of BeOS port
These should have been removed when the BeOS port was removed in
44f9021223.
2012-05-14 04:50:39 +03:00
Peter Eisentraut 6bf1e7668d Small punctuation editing of postgresql.conf.sample 2012-05-14 04:50:39 +03:00
Tom Lane 7d88bb73f7 Add some temporary instrumentation to pgstat.c.
Log main-loop blocking events and the results of inquiry messages.
This is to get some clarity as to what's happening on those Windows
buildfarm members that still don't like the latch-ified stats collector.
This bulks up the postmaster log a tad, so I won't leave it in place for
long.
2012-05-13 21:11:31 -04:00
Tom Lane b8347138e9 Fix DROP TABLESPACE to unlink symlink when directory is not there.
If the tablespace directory is missing entirely, we allow DROP TABLESPACE
to go through, on the grounds that it should be possible to clean up the
catalog entry in such a situation.  However, we forgot that the pg_tblspc
symlink might still be there.  We should try to remove the symlink too
(but not fail if it's no longer there), since not doing so can lead to
weird behavior subsequently, as per report from Michael Nolan.

There was some discussion of adding dependency links to prevent DROP
TABLESPACE when the catalogs still contain references to the tablespace.
That might be worth doing too, but it's an orthogonal question, and in
any case wouldn't be back-patchable.

Back-patch to 9.0, which is as far back as the logic looks like this.
We could possibly do something similar in 8.x, but given the lack of
reports I'm not sure it's worth the trouble, and anyway the case could
not arise in the form the logic is meant to cover (namely, a post-DROP
transaction rollback having resurrected the pg_tablespace entry after
some or all of the filesystem infrastructure is gone).
2012-05-13 18:06:52 -04:00
Tom Lane 966970ed63 Re-revert stats collector latch changes.
This reverts commit cb2f2873d6, restoring
the latch-ified stats collector logic.  We'll soon see if this works any
better on the Windows buildfarm machines.
2012-05-13 14:44:39 -04:00
Tom Lane b85427f227 Attempt to fix some issues in our Windows socket code.
Make sure WaitLatchOrSocket regards FD_CLOSE as a read-ready condition.
We might want to tweak this further, but it was surely wrong as-is.

Make pgwin32_waitforsinglesocket detach its private event object from the
passed socket before returning.  I suspect that failure to do so leads
to race conditions when other code (such as WaitLatchOrSocket) attaches
a different event object to the same socket.  Moreover, the existing
coding meant that repeated calls to pgwin32_waitforsinglesocket would
perform ResetEvent on an event actively connected to a socket, which
is rumored to be an unsafe practice; the WSAEventSelect documentation
appears to recommend against this, though it does not say not to do it
in so many words.

Also, uniformly use the coding pattern "WSAEventSelect(s, NULL, 0)" to
detach events from sockets, rather than passing the event in the second
parameter.  The WSAEventSelect documentation says that the second parameter
is ignored if the third is 0, so theoretically this should make no
difference.  However, elsewhere on the same reference page the use of NULL
in this context is recommended, and I have found suggestions on the net
that some versions of Windows have bugs with a non-NULL second parameter
in this usage.

Some other mostly-cosmetic cleanup, such as using the right one of
WSAGetLastError and GetLastError for reporting errors from these functions.
2012-05-13 14:35:40 -04:00
Tom Lane fd350ef395 Fix bogus declaration of local variable.
rc should be an int here, not a pgsocket.  Fairly harmless as long as
pgsocket is an integer type, but nonetheless wrong.  Error introduced
in commit 87091cb1f1.
2012-05-13 00:30:32 -04:00
Tom Lane 398b240151 Avoid unnecessary process wakeups in the log collector.
syslogger was coded to wake up once per second whether there was anything
useful to do or not.  As part of our campaign to reduce the server's idle
power consumption, change it to use a latch for waiting.  Now, in the
absence of any data to log or any signals to service, it will only wake up
at the programmed logfile rotation times (if any).
2012-05-12 19:21:54 -04:00
Tom Lane 31ad655364 Fix WaitLatchOrSocket to handle EOF on socket correctly.
When using poll(), EOF on a socket is reported with the POLLHUP not
POLLIN flag (at least on Linux).  WaitLatchOrSocket failed to check
this bit, causing it to go into a busy-wait loop if EOF occurs.
We earlier fixed the same mistake in the test for the state of the
postmaster_alive socket, but missed it for the caller-supplied socket.
Fortunately, this error is new in 9.2, since 9.1 only had a select()
based code path not a poll() based one.
2012-05-12 16:36:47 -04:00
Simon Riggs 867540b49c Ensure backwards compatibility for GetStableLatestTransactionId() 2012-05-12 13:26:10 +01:00
Peter Eisentraut afe86a9e73 Fix obsolescent C declaration syntax
gcc -Wextra/-Wold-style-declaration thinks that "inline" should go
before the function return type.
2012-05-12 12:52:02 +03:00
Tom Lane d0c231d132 Cosmetic adjustments for postmaster's handling of checkpointer.
Correct some comments, order some operations a bit more consistently.
No functional changes.
2012-05-11 17:46:37 -04:00
Robert Haas 1331cc6c1a Prevent loss of init fork when truncating an unlogged table.
Fixes bug #6635, reported by Akira Kurosawa.
2012-05-11 09:48:56 -04:00
Simon Riggs b762e8f50b Remove extraneous #include "storage/proc.h" 2012-05-11 14:46:46 +01:00
Simon Riggs b06679e012 Ensure age() returns a stable value rather than the latest value 2012-05-11 14:36:24 +01:00
Heikki Linnakangas 3652d72dd4 On GiST page split, release the locks on child pages before recursing up.
When inserting the downlinks for a split gist page, we used hold the locks
on the child pages until the insertion into the parent - and recursively its
parent if it had to be split too - were all completed. Change that so that
the locks on child pages are released after the insertion in the immediate
parent is done, before recursing further up the tree.

This reduces the number of lwlocks that are held simultaneously. Holding
many locks is bad for concurrency, and in extreme cases you can even hit
the limit of 100 simultaneously held lwlocks in a backend. If you're really
unlucky, you can hit the limit while in a critical section, which brings
down the whole system.

This fixes bug #6629 reported by Tom Forbes. Backpatch to 9.1. The page
splitting code was rewritten in 9.1, and the old code did not have this
problem.
2012-05-11 12:35:28 +03:00
Tom Lane cb2f2873d6 Temporarily revert stats collector latch changes so we can ship beta1.
This patch reverts commit 49340037ee and some
follow-on tweaking in pgstat.c.  While the basic scheme of latch-ifying the
stats collector seems sound enough, it's failing on most Windows buildfarm
members for unknown reasons, and there's no time left to debug that before
9.2beta1.  Better to ship a beta version without this improvement.  I hope
to re-revert this once beta1 is out, though.
2012-05-10 17:26:33 -04:00
Tom Lane f40022f1ad Make WaitLatch's WL_POSTMASTER_DEATH result trustworthy; simplify callers.
Per a suggestion from Peter Geoghegan, make WaitLatch responsible for
verifying that the WL_POSTMASTER_DEATH bit it returns is truthful (by
testing PostmasterIsAlive).  Then simplify its callers, who no longer
need to do that for themselves.  Remove weasel wording about falsely-set
result bits from WaitLatch's API contract.
2012-05-10 14:34:53 -04:00
Tom Lane ada8fa08fc Fix Windows implementation of PGSemaphoreLock.
The original coding failed to reset ImmediateInterruptOK before returning,
which would potentially allow a subsequent query-cancel interrupt to be
accepted at an unsafe point.  This is a really nasty bug since it's so hard
to predict the consequences, but they could be unpleasant.

Also, ensure that signal handlers are serviced before this function
returns, even if the semaphore is already set.  This should make the
behavior more like Unix.

Back-patch to all supported versions.
2012-05-10 13:36:14 -04:00
Tom Lane 8ebc908c57 Improve Windows implementation of WaitLatch/WaitLatchOrSocket.
Ensure that signal handlers are serviced before this function returns.
This should make the behavior more like Unix.  Also, add some more
error checking, and make some other cosmetic improvements.

No back-patch since it's not clear whether this is fixing any live bug
that would affect 9.1.  I'm more concerned about 9.2 anyway given our
considerable recent expansions in the usage of WaitLatch.
2012-05-10 13:26:47 -04:00
Tom Lane fd71421b01 Improve tests for postmaster death in auxiliary processes.
In checkpointer and walwriter, avoid calling PostmasterIsAlive unless
WaitLatch has reported WL_POSTMASTER_DEATH.  This saves a kernel call per
iteration of the process's outer loop, which is not all that much, but a
cycle shaved is a cycle earned.  I had already removed the unconditional
PostmasterIsAlive calls in bgwriter and pgstat in previous patches, but
forgot that WL_POSTMASTER_DEATH is supposed to be treated as untrustworthy
(per comment in unix_latch.c); so adjust those two cases to match.

There are a few other places where the same idea might be applied, but only
after substantial code rearrangement, so I didn't bother.
2012-05-10 00:54:56 -04:00
Tom Lane d3ae406f54 Further tweaking of nomenclature in checkpointer.c.
Get rid of some more naming choices that only make sense if you know that
this code used to be in the bgwriter, as well as some stray comments
referencing the bgwriter.
2012-05-10 00:01:10 -04:00
Tom Lane 6308ba05a7 Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89 added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption.  However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out.  That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening.  Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop.  To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.

In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes.  I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately.  In any case it did nothing to improve the readability or
robustness of the code.

In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant.  I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
2012-05-09 23:37:10 -04:00
Peter Eisentraut 5d39807a00 Add make dependency so that postgres.bki is rebuilt in major version change
Every time since the current rule for postgres.bki was put in place
when we change the major version, people complain that their tests
fail in strange ways.  This is because the version number in
postgres.bki is not updated, because it has no dependency for that.
And you can't even force the rebuild manually if you don't happen to
know which file has the problem.  Fix that now before it will happen
again.

The only remaining problem with switching major versions, as far as
the regression tests are concerned, is that contrib needs to be
rebuilt.  But that's easily invoked, and in any case the failure modes
are more friendly if you forget that.
2012-05-09 20:45:56 +03:00
Simon Riggs 8f28789bff Rename BgWriterShmem/Request to CheckpointerShmem/Request 2012-05-09 14:23:45 +01:00
Simon Riggs bbd3ec9dce Rename BgWriterCommLock to CheckpointerCommLock 2012-05-09 14:11:48 +01:00
Simon Riggs 5829387381 Avoid xid error from age() function when run on Hot Standby 2012-05-09 13:56:24 +01:00
Tom Lane acd4c7d58b Fix an issue in recent walwriter hibernation patch.
Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk.  The
original version of the walwriter hibernation patch broke that.  Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.
2012-05-08 23:06:40 -04:00
Tom Lane 49340037ee Reduce idle power consumption of stats collector process.
Latch-ify the stats collector, so that it does not need an arbitrary wakeup
cycle to check for postmaster death.  The incremental savings in idle power
is pretty marginal, since we only had it waking every two seconds; but I
believe that this patch may also improve the collector's performance under
load, by reducing the number of kernel calls made per message when messages
are arriving constantly (we now avoid a select/poll call except when we
need to sleep).  The change also reduces the time needed for a normal
database shutdown on platforms where signals don't interrupt select().
2012-05-08 21:26:46 -04:00
Tom Lane 5461564a9d Reduce idle power consumption of walwriter and checkpointer processes.
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption.  It reverts to
normal as soon as it's done any successful flushes.  It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.

Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle.  This also is primarily useful for
reducing the server's power consumption when idle.

In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch.  Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.

Peter Geoghegan, somewhat simplified by Tom
2012-05-08 20:03:26 -04:00
Magnus Hagander 916d589a10 Make "unexpected EOF" messages DEBUG1 unless in an open transaction
"Unexpected EOF on client connection" without an open transaction
is mostly noise, so turn it into DEBUG1. With an open transaction it's
still indicating a problem, so keep those as ERROR, and change the message
to indicate that it happened in a transaction.
2012-05-07 18:50:44 +02:00
Tom Lane 71b9549d05 Overdue code review for transaction-level advisory locks patch.
Commit 62c7bd31c8 had assorted problems, most
visibly that it broke PREPARE TRANSACTION in the presence of session-level
advisory locks (which should be ignored by PREPARE), as per a recent
complaint from Stephen Rees.  More abstractly, the patch made the
LockMethodData.transactional flag not merely useless but outright
dangerous, because in point of fact that flag no longer tells you anything
at all about whether a lock is held transactionally.  This fix therefore
removes that flag altogether.  We now rely entirely on the convention
already in use in lock.c that transactional lock holds must be owned by
some ResourceOwner, while session holds are never so owned.  Setting the
locallock struct's owner link to NULL thus denotes a session hold, and
there is no redundant marker for that.

PREPARE TRANSACTION now works again when there are session-level advisory
locks, and it is also able to transfer transactional advisory locks to the
prepared transaction, but for implementation reasons it throws an error if
we hold both types of lock on a single lockable object.  Perhaps it will be
worth improving that someday.

Assorted other minor cleanup and documentation editing, as well.

Back-patch to 9.1, except that in the 9.1 branch I did not remove the
LockMethodData.transactional flag for fear of causing an ABI break for
any external code that might be examining those structs.
2012-05-04 17:44:31 -04:00
Bruce Momjian ebcaa5fcde Remove BSD/OS (BSDi) port. There are no known users upgrading to
Postgres 9.2, and perhaps no existing users either.
2012-05-03 10:58:44 -04:00
Peter Eisentraut e9605a039b Even more duplicate word removal, in the spirit of the season 2012-05-02 20:56:03 +03:00
Robert Haas 0038110421 Avoid repeated CLOG access from heap_hot_search_buffer.
At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate.  So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.

Review by Tom Lane.
2012-05-02 12:40:07 -04:00
Robert Haas 1b4998fd44 Further corrections from the department of redundancy department.
Thom Brown
2012-05-02 11:11:25 -04:00
Robert Haas e01e66f808 More duplicate word removal. 2012-05-02 09:28:16 -04:00
Heikki Linnakangas f291ccd43e Remove duplicate words in comments.
Found these with grep -r "for for ".
2012-05-02 10:20:27 +03:00
Tom Lane 50c2d6a1a6 Kill some remaining references to SVR4 and univel.
Both terms still appear in a few places, but I thought it best to leave
those alone in context.
2012-05-02 00:29:17 -04:00
Peter Eisentraut f2f9439fbf Remove dead ports
Remove the following ports:

- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel

These are obsolete and not worth rescuing.  In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.
2012-05-01 22:11:12 +03:00
Tom Lane 809e7e21af Converge all SQL-level statistics timing values to float8 milliseconds.
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds.  By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)

The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time

The first four of these are new in 9.2, so there is no compatibility issue
from changing them.  The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
2012-04-30 14:03:33 -04:00
Robert Haas 0d2235a25b Remove duplicate word in comment.
Noted by Peter Geoghegan.
2012-04-30 13:14:46 -04:00
Tom Lane 1dd89eadcd Rename I/O timing statistics columns to blk_read_time and blk_write_time.
This seems more consistent with the pre-existing choices for names of
other statistics columns.  Rename assorted internal identifiers to match.
2012-04-29 18:13:33 -04:00
Tom Lane 309c64745e Rename track_iotiming GUC to track_io_timing.
This spelling seems significantly more readable to me.
2012-04-29 16:23:54 -04:00
Peter Eisentraut 81107282a5 Change return type of ExceptionalCondition to void and mark it noreturn
In ancient times, it was thought that this wouldn't work because of
TrapMacro/AssertMacro, but changing those to use a comma operator
appears to work without compiler warnings.
2012-04-29 21:20:14 +03:00
Tom Lane cdbad241f4 Clear I/O timing counters after sending them to the stats collector.
This oversight caused the reported times to accumulate in an O(N^2)
fashion the longer a backend runs.
2012-04-28 15:11:13 -04:00
Tom Lane d6f7d4fdc5 Fix printing of whole-row Vars at top level of a SELECT targetlist.
Normally whole-row Vars are printed as "tabname.*".  However, that does not
work at top level of a targetlist, because per SQL standard the parser will
think that the "*" should result in column-by-column expansion; which is
not at all what a whole-row Var implies.  We used to just print the table
name in such cases, which works most of the time; but it fails if the table
name matches a column name available anywhere in the FROM clause.  This
could lead for instance to a view being interpreted differently after dump
and reload.  Adding parentheses doesn't fix it, but there is a reasonably
simple kluge we can use instead: attach a no-op cast, so that the "*" isn't
syntactically at top level anymore.  This makes the printing of such
whole-row Vars a lot more consistent with other Vars, and may indeed fix
more cases than just the reported one; I'm suspicious that cases involving
schema qualification probably didn't work properly before, either.

Per bug report and fix proposal from Abbas Butt, though this patch is quite
different in detail from his.

Back-patch to all supported versions.
2012-04-27 19:49:18 -04:00
Tom Lane 537b266953 Fix syslogger's rotation disable/re-enable logic.
If it fails to open a new log file, the syslogger assumes there's something
wrong with its parameters (such as log_directory), and stops attempting
automatic time-based or size-based log file rotations.  Sending it SIGHUP
is supposed to start that up again.  However, the original coding for that
was really bogus, involving clobbering a couple of GUC variables and hoping
that SIGHUP processing would restore them.  Get rid of that technique in
favor of maintaining a separate flag showing we've turned rotation off.
Per report from Mark Kirkwood.

Also, the syslogger will automatically attempt to create the log_directory
directory if it doesn't exist, but that was only happening at startup.
For consistency and ease of use, it should do the same whenever the value
of log_directory is changed by SIGHUP.

Back-patch to all supported branches.
2012-04-27 00:12:42 -04:00
Robert Haas 3424bff90f Prevent index-only scans from returning wrong answers under Hot Standby.
The alternative of disallowing index-only scans in HS operation was
discussed, but the consensus was that it was better to treat marking
a page all-visible as a recovery conflict for snapshots that could still
fail to see XIDs on that page.  We may in the future try to soften this,
so that we simply force index scans to do heap fetches in cases where
this may be an issue, rather than throwing a hard conflict.
2012-04-26 20:00:21 -04:00
Tom Lane 7c85aa39fc Fix oversight in recent parameterized-path patch.
bitmap_scan_cost_est() has to be able to cope with a BitmapOrPath, but
I'd taken a shortcut that didn't work for that case.  Noted by Heikki.
Add some regression tests since this area is evidently under-covered.
2012-04-26 14:17:44 -04:00
Tom Lane 9fa82c9809 Fix planner's handling of RETURNING lists in writable CTEs.
setrefs.c failed to do "rtoffset" adjustment of Vars in RETURNING lists,
which meant they were left with the wrong varnos when the RETURNING list
was in a subquery.  That was never possible before writable CTEs, of
course, but now it's broken.  The executor fails to notice any problem
because ExecEvalVar just references the ecxt_scantuple for any normal
varno; but EXPLAIN breaks when the varno is wrong, as illustrated in a
recent complaint from Bartosz Dmytrak.

Since the eventual rtoffset of the subquery is not known at the time
we are preparing its plan node, the previous scheme of executing
set_returning_clause_references() at that time cannot handle this
adjustment.  Fortunately, it turns out that we don't really need to do it
that way, because all the needed information is available during normal
setrefs.c execution; we just have to dig it out of the ModifyTable node.
So, do that, and get rid of the kluge of early setrefs processing of
RETURNING lists.  (This is a little bit of a cheat in the case of inherited
UPDATE/DELETE, because we are not passing a "root" struct that corresponds
exactly to what the subplan was built with.  But that doesn't matter, and
anyway this is less ugly than early setrefs processing was.)

Back-patch to 9.1, where the problem became possible to hit.
2012-04-25 20:20:33 -04:00
Tom Lane 9873001e6d Another trivial comment-typo fix. 2012-04-25 14:28:58 -04:00
Robert Haas 3ce7f18e92 Casts to or from a domain type are ignored; warn and document.
Prohibiting this outright would break dumps taken from older versions
that contain such casts, which would create far more pain than is
justified here.

Per report by Jaime Casanova and subsequent discussion.
2012-04-24 09:20:53 -04:00
Robert Haas 5d4b60f2f2 Lots of doc corrections.
Josh Kupershmidt
2012-04-23 22:43:09 -04:00
Robert Haas 7ab9b2f3b7 Rearrange lazy_scan_heap to avoid visibility map race conditions.
We must set the visibility map bit before releasing our exclusive lock
on the heap page; otherwise, someone might clear the heap page bit
before we set the visibility map bit, leading to a situation where the
visibility map thinks the page is all-visible but it's really not.

This problem has existed since 8.4, but it wasn't critical before we
had index-only scans, since the worst case scenario was that the page
wouldn't get vacuumed until the next scan_all vacuum.

Along the way, a couple of minor, related improvements: (1) if we
pause the heap scan to do an index vac cycle, release any visibility
map page we're holding, since really long-running pins are not good
for a variety of reasons; and (2) warn if we see a page that's marked
all-visible in the visibility map but not on the page level, since
that should never happen any more (it was allowed in previous
releases, but not in 9.2).
2012-04-23 22:08:06 -04:00
Robert Haas 85efd5f065 Reduce hash size for compute_array_stats, compute_tsvector_stats.
The size is only a hint, but a big hint chews up a lot of memory without
apparently improving performance much.

Analysis and patch by Noah Misch.
2012-04-23 22:05:41 -04:00
Peter Eisentraut 48658a1b81 Fix some typos
Josh Kupershmidt
2012-04-22 19:23:47 +03:00
Tom Lane 33e99153e9 Use fuzzy not exact cost comparison for the final tie-breaker in add_path.
Instead of an exact cost comparison, use a fuzzy comparison with 1e-10
delta after all other path metrics have proved equal.  This is to avoid
having platform-specific roundoff behaviors determine the choice when
two paths are really the same to our cost estimators.  Adjust the
recently-added test case that made it obvious we had a problem here.
2012-04-21 00:51:14 -04:00
Alvaro Herrera 09ff76fcdb Recast "ONLY" column CHECK constraints as NO INHERIT
The original syntax wasn't universally loved, and it didn't allow its
usage in CREATE TABLE, only ALTER TABLE.  It now works everywhere, and
it also allows using ALTER TABLE ONLY to add an uninherited CHECK
constraint, per discussion.

The pg_constraint column has accordingly been renamed connoinherit.

This commit partly reverts some of the changes in
61d81bd28d, particularly some pg_dump and
psql bits, because now pg_get_constraintdef includes the necessary NO
INHERIT within the constraint definition.

Author: Nikhil Sontakke
Some tweaks by me
2012-04-20 23:56:57 -03:00
Tom Lane 1f03630011 Adjust join_search_one_level's handling of clauseless joins.
For an initial relation that lacks any join clauses (that is, it has to be
cartesian-product-joined to the rest of the query), we considered only
cartesian joins with initial rels appearing later in the initial-relations
list.  This creates an undesirable dependency on FROM-list order.  We would
never fail to find a plan, but perhaps we might not find the best available
plan.  Noted while discussing the logic with Amit Kapila.

Improve the comments a bit in this area, too.

Arguably this is a bug fix, but given the lack of complaints from the
field I'll refrain from back-patching.
2012-04-20 20:10:46 -04:00
Tom Lane 5b7b5518d0 Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate.  We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.

In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage.  This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.

To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing.  This is required at both base scans and joins.  It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree.  Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 15:53:47 -04:00
Robert Haas 293ec33c32 Remove bogus comment from HeapTupleSatisfiesNow.
This has been wrong for a really long time.  We don't use two-phase
locking to protect against serialization anomalies.

Per discussion on pgsql-hackers about 2011-03-07; original report
by Dan Ports.
2012-04-18 11:50:45 -04:00
Robert Haas 4a6fab03f2 Finish rename of FastPathStrongLocks to FastPathStrongRelationLocks.
Commit 8e5ac74c12 tried to do this renaming,
but I relied on gcc to tell me where I needed to make changes, instead of
grep.

Noted by Jeff Davis.
2012-04-18 11:29:34 -04:00
Robert Haas 53c5b869b4 Tighten up error recovery for fast-path locking.
The previous code could cause a backend crash after BEGIN; SAVEPOINT a;
LOCK TABLE foo (interrupted by ^C or statement timeout); ROLLBACK TO
SAVEPOINT a; LOCK TABLE foo, and might have leaked strong-lock counts
in other situations.

Report by Zoltán Böszörményi; patch review by Jeff Davis.
2012-04-18 11:17:30 -04:00
Robert Haas ab77b2da8b Fix incorrect comment in SetBufferCommitInfoNeedsSave().
Noah Misch spotted the fact that the old comment is in fact incorrect, due
to memory ordering hazards.
2012-04-18 10:55:40 -04:00
Robert Haas e93c0b820f After PageSetAllVisible, use MarkBufferDirty.
Previously, we used SetBufferCommitInfoNeedsSave, but that's really
intended for dirty-marks we can theoretically afford to lose, such as
hint bits.  As for 9.2, the PD_ALL_VISIBLE mustn't be lost in this
way, since we could then end up with a heap page that isn't
all-visible and a visibility map page that is all visible, causing
index-only scans to return wrong answers.
2012-04-18 10:49:37 -04:00
Robert Haas b5eccaef2c Fix copyfuncs/equalfuncs support for ReassignOwnedStmt.
Noah Misch
2012-04-18 10:45:18 -04:00
Robert Haas 53bbc681ca Fix various infelicities in node functions.
Mostly, this consists of adding support for fields which exist in the
structure but aren't handled by copy/equal/outfuncs; but the create
foreign table case can actually produce garbage output.

Noah Misch
2012-04-18 10:43:16 -04:00
Heikki Linnakangas fe546f3da6 Don't wait for the commit record to be replicated if we wrote no WAL.
When using synchronous replication, we waited for the commit record to be
replicated, but if we our transaction didn't write any other WAL records,
that's not required because we don't even flush the WAL locally to disk in
that case. This lead to long waits when committing a transaction that only
modified a temporary table. Bug spotted by Thom Brown.
2012-04-17 16:28:31 +03:00
Peter Eisentraut a33fcd7e79 Fix typo
Kyotaro HORIGUCHI
2012-04-16 15:36:40 +03:00
Robert Haas ea6a2d8d47 Rename synchronous_commit='write' to 'remote_write'.
Fujii Masao, per discussion on pgsql-hackers
2012-04-14 10:53:22 -04:00
Robert Haas 4a2d7ad76f pg_size_pretty(numeric)
The output of the new pg_xlog_location_diff function is of type numeric,
since it could theoretically overflow an int8 due to signedness; this
provides a convenient way to format such values.

Fujii Masao, with some beautification by me.
2012-04-14 08:07:25 -04:00
Tom Lane e54b10a62d Remove the "last ditch" code path in join_search_one_level().
So far as I can tell, it is no longer possible for this heuristic to do
anything useful, because the new weaker definition of
have_relevant_joinclause means that any relation with a joinclause must be
considered joinable to at least one other relation.  It would still be
possible for the code block to be entered, for example if there are join
order restrictions that prevent any join of the current level from being
formed; but in that case it's just a waste of cycles to attempt to form
cartesian joins, since the restrictions will still apply.

Furthermore, IMO the existence of this code path can mask bugs elsewhere;
we would have noticed the problem with cartesian joins a lot sooner if
this code hadn't compensated for it in the simplest case.

Accordingly, let's remove it and see what happens.  I'm committing this
separately from the prerequisite changes in have_relevant_joinclause,
just to make the question easier to revisit if there is some fault in
my logic.
2012-04-13 16:07:18 -04:00
Tom Lane e3ffd05b02 Weaken the planner's tests for relevant joinclauses.
We should be willing to cross-join two small relations if that allows us
to use an inner indexscan on a large relation (that is, the potential
indexqual for the large table requires both smaller relations).  This
worked in simple cases but fell apart as soon as there was a join clause
to a fourth relation, because the existence of any two-relation join clause
caused the planner to not consider clauseless joins between other base
relations.  The added regression test shows an example case adapted from
a recent complaint from Benoit Delbosc.

Adjust have_relevant_joinclause, have_relevant_eclass_joinclause, and
has_relevant_eclass_joinclause to consider that a join clause mentioning
three or more relations is sufficient grounds for joining any subset of
those relations, even if we have to do so via a cartesian join.  Since such
clauses are relatively uncommon, this shouldn't affect planning speed on
typical queries; in fact it should help a bit, because the latter two
functions in particular get significantly simpler.

Although this is arguably a bug fix, I'm not going to risk back-patching
it, since it might have currently-unforeseen consequences.
2012-04-13 16:07:17 -04:00
Peter Eisentraut c0cc526e8b Rename bytea_agg to string_agg and add delimiter argument
Per mailing list discussion, we would like to keep the bytea functions
parallel to the text functions, so rename bytea_agg to string_agg,
which already exists for text.

Also, to satisfy the rule that we don't want aggregate functions of
the same name with a different number of arguments, add a delimiter
argument, just like string_agg for text already has.
2012-04-13 21:36:59 +03:00
Peter Eisentraut 64e1309c76 Consistently quote encoding and locale names in messages 2012-04-13 20:37:07 +03:00
Robert Haas 61167bfaf2 Fix typo in comment. 2012-04-13 08:54:13 -04:00
Robert Haas 5630eddf1e Update lazy_scan_heap header comment.
The previous comment described how things worked in PostgreSQL 8.2
and prior.
2012-04-13 08:51:19 -04:00
Tom Lane 732bfa2448 Fix cost estimation for indexscan filter conditions.
cost_index's method for estimating per-tuple costs of evaluating filter
conditions (a/k/a qpquals) was completely wrong in the presence of derived
indexable conditions, such as range conditions derived from a LIKE clause.
This was largely masked in common cases as a result of all simple operator
clauses having about the same costs, but it could show up in a big way when
dealing with functional indexes containing expensive functions, as seen for
example in bug #6579 from Istvan Endredy.  Rejigger the calculation to give
sane answers when the indexquals aren't a subset of the baserestrictinfo
list.  As a side benefit, we now do the calculation properly for cases
involving join clauses (ie, parameterized indexscans), which we always
overestimated before.

There are still cases where this is an oversimplification, such as clauses
that can be dropped because they are implied by a partial index's
predicate.  But we've never accounted for that in cost estimates before,
and I'm not convinced it's worth the cycles to try to do so.
2012-04-11 20:24:17 -04:00
Tom Lane 880bfc3287 Silently ignore any nonexistent schemas that are listed in search_path.
Previously we attempted to throw an error or at least warning for missing
schemas, but this was done inconsistently because of implementation
restrictions (in many cases, GUC settings are applied outside transactions
so that we can't do system catalog lookups).  Furthermore, there were
exceptions to the rule even in the beginning, and we'd been poking more
and more holes in it as time went on, because it turns out that there are
lots of use-cases for having some irrelevant items in a common search_path
value.  It seems better to just adopt a philosophy similar to what's always
been done with Unix PATH settings, wherein nonexistent or unreadable
directories are silently ignored.

This commit also fixes the documentation to point out that schemas for
which the user lacks USAGE privilege are silently ignored.  That's always
been true but was previously not documented.

This is mostly in response to Robert Haas' complaint that 9.1 started to
throw errors or warnings for missing schemas in cases where prior releases
had not.  We won't adopt such a significant behavioral change in a back
branch, so something different will be needed in 9.1.
2012-04-11 12:02:50 -04:00
Tom Lane 3769fa5fc6 Make pg_tablespace_location(0) return the database's default tablespace.
This definition is convenient when applying the function to the
reltablespace column of pg_class, since that's what zero means there;
and it doesn't interfere with any other plausible use of the function.
Per gripe from Bruce Momjian.
2012-04-10 21:43:14 -04:00
Tom Lane 0d9819f7e3 Measure epoch of timestamp-without-time-zone from local not UTC midnight.
This patch reverts commit 191ef2b407
and thereby restores the pre-7.3 behavior of EXTRACT(EPOCH FROM
timestamp-without-tz).  Per discussion, the more recent behavior was
misguided on a couple of grounds: it makes it hard to get a
non-timezone-aware epoch value for a timestamp, and it makes this one
case dependent on the value of the timezone GUC, which is incompatible
with having timestamp_part() labeled as immutable.

The other behavior is still available (in all releases) by explicitly
casting the timestamp to timestamp with time zone before applying EXTRACT.

This will need to be called out as an incompatible change in the 9.2
release notes.  Although having mutable behavior in a function marked
immutable is clearly a bug, we're not going to back-patch such a change.
2012-04-10 12:04:42 -04:00
Tom Lane 65fd91333e Fix an Assert that turns out to be reachable after all.
estimate_num_groups() gets unhappy with
	create table empty();
	select * from empty except select * from empty e2;
I can't see any actual use-case for such a query (and the table is illegal
per SQL spec), but it seems like a good idea that it not cause an assert
failure.
2012-04-09 11:58:24 -04:00
Tom Lane d515365a61 Don't bother copying empty support arrays in a zero-column MergeJoin.
The case could not arise when this code was originally written, but it can
now (since we made zero-column MergeJoins work for the benefit of FULL JOIN
ON TRUE).  I don't think there is any actual bug here, but we might as well
treat it consistently with other uses of COPY_POINTER_FIELD().  Per comment
from Ashutosh Bapat.
2012-04-09 11:41:54 -04:00
Robert Haas 3ae5133b1c Teach SLRU code to avoid replacing I/O-busy pages.
Patch by me; review by Tom Lane and others.
2012-04-08 23:05:55 -04:00
Heikki Linnakangas 03529a3ff9 set_stack_base() no longer needs to be called in PostgresMain.
This was a thinko in previous commit. Now that stack base pointer is now set
in PostmasterMain and SubPostmasterMain, it doesn't need to be set in
PostgresMain anymore.
2012-04-08 19:39:12 +03:00
Heikki Linnakangas ef3883d130 Do stack-depth checking in all postmaster children.
We used to only initialize the stack base pointer when starting up a regular
backend, not in other processes. In particular, autovacuum workers can run
arbitrary user code, and without stack-depth checking, infinite recursion
in e.g an index expression will bring down the whole cluster.

The comment about PL/Java using set_stack_base() is not yet true. As the
code stands, PL/java still modifies the stack_base_ptr variable directly.
However, it's been discussed in the PL/Java mailing list that it should be
changed to use the function, because PL/Java is currently oblivious to the
register stack used on Itanium. There's another issues with PL/Java, namely
that the stack base pointer it sets is not really the base of the stack, it
could be something close to the bottom of the stack. That's a separate issue
that might need some further changes to this code, but that's a different
story.

Backpatch to all supported releases.
2012-04-08 19:07:55 +03:00
Tom Lane 7feecedcce Fix incorrect make maintainer-clean rule. 2012-04-07 18:16:50 -04:00
Tom Lane 95b9c333b2 Further adjustment of comment about qsort_tuple. 2012-04-07 17:48:40 -04:00
Tom Lane a25ef7a5f6 Remove useless variable to suppress compiler warning. 2012-04-07 16:44:43 -04:00
Tom Lane 0ab4db52c0 Fix misleading output from gin_desc().
XLOG_GIN_UPDATE_META_PAGE and XLOG_GIN_DELETE_LISTPAGE records were printed
with a list link field labeled as "blkno", which was confusing, especially
when the link was empty (InvalidBlockNumber).  Print the metapage block
number instead, since that's what's actually being updated.  We could
include the link values too as a separate field, but not clear it's worth
the trouble.

Back-patch to 8.4 where the dubious code was added.
2012-04-06 18:10:21 -04:00
Tom Lane 17b985b1a0 Fix broken comparetup_datum code.
Commit 337b6f5ecf contained the entirely
fanciful assumption that it had made comparetup_datum unreachable.
Reported and patched by Takashi Yamamoto.

Fix up some not terribly accurate/useful comments from that commit, too.
2012-04-06 16:58:50 -04:00
Tom Lane cea49fe82f Dept of second thoughts: improve the API for AnalyzeForeignTable.
If we make the initially-called function return the table physical-size
estimate, acquire_inherited_sample_rows will be able to use that to
allocate numbers of samples among child tables, when the day comes that
we want to support foreign tables in inheritance trees.
2012-04-06 16:04:10 -04:00
Tom Lane 263d9de66b Allow statistics to be collected for foreign tables.
ANALYZE now accepts foreign tables and allows the table's FDW to control
how the sample rows are collected.  (But only manual ANALYZEs will touch
foreign tables, for the moment, since among other things it's not very
clear how to handle remote permissions checks in an auto-analyze.)

contrib/file_fdw is extended to support this.

Etsuro Fujita, reviewed by Shigeru Hanada, some further tweaking by me.
2012-04-06 15:02:35 -04:00
Simon Riggs 8cb53654db Add DROP INDEX CONCURRENTLY [IF EXISTS], uses ShareUpdateExclusiveLock 2012-04-06 10:21:40 +01:00
Robert Haas 21cc529698 checkopint -> checkpoint
Report by Guillaume Lelarge.
2012-04-05 21:37:33 -04:00
Robert Haas b736aef2ec Publish checkpoint timing information to pg_stat_bgwriter.
Greg Smith, Peter Geoghegan, and Robert Haas
2012-04-05 14:04:37 -04:00
Robert Haas 644828908f Expose track_iotiming data via the statistics collector.
Ants Aasma's original patch to add timing information for buffer I/O
requests exposed this data at the relation level, which was judged too
costly.  I've here exposed it at the database level instead.
2012-04-05 11:40:24 -04:00
Tom Lane c17e863bc7 Fix syslogger to not lose log coherency under high load.
The original coding of the syslogger had an arbitrary limit of 20 large
messages concurrently in progress, after which it would just punt and dump
message fragments to the output file separately.  Our ambitions are a bit
higher than that now, so allow the data structure to expand as necessary.

Reported and patched by Andrew Dunstan; some editing by Tom
2012-04-04 15:05:10 -04:00
Peter Eisentraut 38b9693fd9 Add support for renaming domain constraints 2012-04-03 08:11:51 +03:00
Simon Riggs 68219aaf6b Correct epoch of txid_current() when executed on a Hot Standby server.
Initialise ckptXidEpoch from starting checkpoint and maintain the correct
value as we roll forwards. This allows GetNextXidAndEpoch() to return the
correct epoch when executed during recovery. Backpatch to 9.0 when the
problem is first observable by a user.

Bug report from Daniel Farina
2012-03-29 14:55:30 +01:00
Andrew Dunstan aeca650226 Unbreak Windows builds broken by pgpipe removal. 2012-03-29 04:11:57 -04:00
Heikki Linnakangas 5762a4d909 Inherit max_safe_fds to child processes in EXEC_BACKEND mode.
Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.

It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.
2012-03-29 08:19:11 +03:00
Andrew Dunstan d2c1740dc2 Remove now redundant pgpipe code. 2012-03-28 23:24:07 -04:00
Tom Lane 5d3fcc4c2e Bend parse location rules for the convenience of pg_stat_statements.
Generally, the parse location assigned to a multiple-token construct is
the location of its leftmost token.  This commit breaks that rule for
the syntaxes TYPENAME 'LITERAL' and CAST(CONSTANT AS TYPENAME) --- the
resulting Const will have the location of the literal string, not the
typename or CAST keyword.  The cases where this matters are pretty thin on
the ground (no error messages in the regression tests change, for example),
and it's unlikely that any user would be confused anyway by an error cursor
pointing at the literal.  But still it's less than consistent.  The reason
for changing it is that contrib/pg_stat_statements wants to know the parse
location of the original literal, and it was agreed that this is the least
unpleasant way to preserve that information through parse analysis.

Peter Geoghegan
2012-03-27 15:17:41 -04:00
Tom Lane a40fa613b5 Add some infrastructure for contrib/pg_stat_statements.
Add a queryId field to Query and PlannedStmt.  This is not used by the
core backend, except for being copied around at appropriate times.
It's meant to allow plug-ins to track a particular query forward from
parse analysis to execution.

The queryId is intentionally not dumped into stored rules (and hence this
commit doesn't bump catversion).  You could argue that choice either way,
but it seems better that stored rule strings not have any dependency
on plug-ins that might or might not be present.

Also, add a post_parse_analyze_hook that gets invoked at the end of
parse analysis (but only for top-level analysis of complete queries,
not cases such as analyzing a domain's default-value expression).
This is mainly meant to be used to compute and assign a queryId,
but it could have other applications.

Peter Geoghegan
2012-03-27 15:17:40 -04:00