Commit Graph

1113 Commits

Author SHA1 Message Date
Andres Freund bbfd7edae5 Add macros wrapping all usage of gcc's __attribute__.
Until now __attribute__() was defined to be empty for all compilers but
gcc. That's problematic because it prevents using it in other compilers;
which is necessary e.g. for atomics portability.  It's also just
generally dubious to do so in a header as widely included as c.h.

Instead add pg_attribute_format_arg, pg_attribute_printf,
pg_attribute_noreturn macros which are implemented in the compilers that
understand them. Also add pg_attribute_noreturn and pg_attribute_packed,
but don't provide fallbacks, since they can affect functionality.

This means that external code that, possibly unwittingly, relied on
__attribute__ defined to be empty on !gcc compilers may now run into
warnings or errors on those compilers. But there shouldn't be many
occurances of that and it's hard to work around...

Discussion: 54B58BA3.8040302@ohmu.fi
Author: Oskari Saarenmaa, with some minor changes by me.
2015-03-11 14:30:01 +01:00
Tom Lane 09d8d110a6 Use FLEXIBLE_ARRAY_MEMBER in a bunch more places.
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.

This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.

Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent.  Instead use offsetof(struct, lastfield).  Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).

Michael Paquier, review and additional fixes by me.
2015-02-20 00:11:42 -05:00
Heikki Linnakangas 33e879c4e9 Fix broken #ifdef for __sparcv8
Rob Rowan. Backpatch to all supported versions, like the patch that added
the broken #ifdef.
2015-02-13 23:56:25 +02:00
Andres Freund d06995710b Remove the option to service interrupts during PGSemaphoreLock().
The remaining caller (lwlocks) doesn't need that facility, and we plan
to remove ImmedidateInterruptOK entirely. That means that interrupts
can't be serviced race-free and portably anyway, so there's little
reason for keeping the feature.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:25:00 +01:00
Andres Freund 6753333f55 Move deadlock and other interrupt handling in proc.c out of signal handlers.
Deadlock checking was performed inside signal handlers up to
now. While it's a remarkable feat to have made this work reliably,
it's quite complex to understand why that is the case. Partially it
worked due to the assumption that semaphores are signal safe - which
is not actually documented to be the case for sysv semaphores.

The reason we had to rely on performing this work inside signal
handlers is that semaphores aren't guaranteed to be interruptable by
signals on all platforms. But now that latches provide a somewhat
similar API, which actually has the guarantee of being interruptible,
we can avoid doing so.

Signalling between ProcSleep, ProcWakeup, ProcWaitForSignal and
ProcSendSignal is now done using latches. This increases the
likelihood of spurious wakeups. As spurious wakeup already were
possible and aren't likely to be frequent enough to be an actual
problem, this seems acceptable.

This change would allow for further simplification of the deadlock
checking, now that it doesn't have to run in a signal handler. But
even if I were motivated to do so right now, it would still be better
to do that separately. Such a cleanup shouldn't have to be reviewed a
the same time as the more fundamental changes in this commit.

There is one possible usability regression due to this commit. Namely
it is more likely than before that log_lock_waits messages are output
more than once.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:24:38 +01:00
Andres Freund 4f85fde8eb Introduce and use infrastructure for interrupt processing during client reads.
Up to now large swathes of backend code ran inside signal handlers
while reading commands from the client, to allow for speedy reaction to
asynchronous events. Most prominently shared invalidation and NOTIFY
handling. That means that complex code like the starting/stopping of
transactions is run in signal handlers...  The required code was
fragile and verbose, and is likely to contain bugs.

That approach also severely limited what could be done while
communicating with the client. As the read might be from within
openssl it wasn't safely possible to trigger an error, e.g. to cancel
a backend in idle-in-transaction state. We did that in some cases,
namely fatal errors, nonetheless.

Now that FE/BE communication in the backend employs non-blocking
sockets and latches to block, we can quite simply interrupt reads from
signal handlers by setting the latch. That allows us to signal an
interrupted read, which is supposed to be retried after returning from
within the ssl library.

As signal handlers now only need to set the latch to guarantee timely
interrupt processing, remove a fair amount of complicated & fragile
code from async.c and sinval.c.

We could now actually start to process some kinds of interrupts, like
sinval ones, more often that before, but that seems better done
separately.

This work will hopefully allow to handle cases like being blocked by
sending data, interrupting idle transactions and similar to be
implemented without too much effort.  In addition to allowing getting
rid of ImmediateInterruptOK, that is.

Author: Andres Freund
Reviewed-By: Heikki Linnakangas
2015-02-03 22:25:20 +01:00
Andres Freund ed127002d8 Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.

Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.

In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.

As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.

Discussion: 20140202151319.GD32123@awork2.anarazel.de

Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 22:48:45 +01:00
Alvaro Herrera 813ffc0ef9 reinit.h: Fix typo in identification comment
Author: Sawada Masahiko
2015-01-22 12:26:51 -03:00
Andres Freund 6cfd5086e1 Blindly try to fix a warning in s_lock.h when compiling with gcc on HPPA.
The possibly, depending on compiler settings, generated warning was
"warning: `S_UNLOCK' redefined".

The hppa spinlock implementation doesn't follow the rules of s_lock.h
and provides a gcc specific implementation outside of the the part of
the file that's supposed to do that.  It does so to avoid duplication
between the HP compiler and gcc. That unfortunately means that
S_UNLOCK is already defined when the HPPA specific section is reached.

Undefine the generic fallback S_UNLOCK definition inside the HPPA
section. That's far from pretty, but has the big advantage of being
simple. If somebody is interested to fix this in a prettier way...

This presumably got broken in the course of 0709b7ee72.

Discussion: 20150114225919.GY5245@awork2.anarazel.de

Per complaint from Tom Lane.
2015-01-15 13:26:25 +01:00
Andres Freund 59f71a0d0b Add a default local latch for use in signal handlers.
To do so, move InitializeLatchSupport() into the new common process
initialization functions, and add a new global variable MyLatch.

MyLatch is usable as soon InitPostmasterChild() has been called
(i.e. very early during startup). Initially it points to a process
local latch that exists in all processes. InitProcess/InitAuxiliaryProcess
then replaces that local latch with PGPROC->procLatch. During shutdown
the reverse happens.

This is primarily advantageous for two reasons: For one it simplifies
dealing with the shared process latch, especially in signal handlers,
because instead of having to check for MyProc, MyLatch can be used
unconditionally. For another, a later patch that makes FEs/BE
communication use latches, now can rely on the existence of a latch,
even before having gone through InitProcess.

Discussion: 20140927191243.GD5423@alap3.anarazel.de
2015-01-14 18:45:22 +01:00
Andres Freund 14e8803f10 Add barriers to the latch code.
Since their introduction latches have required barriers in SetLatch
and ResetLatch - but when they were introduced there wasn't any
barrier abstraction. Instead latches were documented to rely on the
callsites to provide barrier semantics.

Now that the barrier support looks halfway complete, add the necessary
barriers to both latch implementations.

Also remove a now superflous lock acquisition from syncrep.c and a
superflous (and insufficient) barrier from freelist.c. There might be
other cases that can now be simplified, but those are the only ones
I've seen on a quick scan.

We might want to backpatch this at some later point, but right now the
barrier infrastructure in the backbranches isn't totally on par with
master.

Discussion: 20150112154026.GB2092@awork2.anarazel.de
2015-01-13 12:58:43 +01:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Andres Freund d72731a704 Lockless StrategyGetBuffer clock sweep hot path.
StrategyGetBuffer() has proven to be a bottleneck in a number of
buffer acquisition heavy workloads. To some degree this has already
been alleviated by 5d7962c6, but it still can be quite a heavy
bottleneck.  The problem is that in unfortunate usage patterns a
single StrategyGetBuffer() call will have to look at a large number of
buffers - in turn making it likely that the process will be put to
sleep while still holding the spinlock.

Replace most of the usage of the buffer_strategy_lock spinlock for the
clock sweep by a atomic nextVictimBuffer variable. That variable,
modulo NBuffers, is the current hand of the clock sweep. The buffer
clock-sweep then only needs to acquire the spinlock after a
wraparound. And even then only in the process that did the wrapping
around. That alleviates nearly all the contention on the relevant
spinlock, although significant contention on the cacheline can still
exist.

Reviewed-By: Robert Haas and Amit Kapila

Discussion: 20141010160020.GG6670@alap3.anarazel.de,
    20141027133218.GA2639@awork2.anarazel.de
2014-12-25 18:26:25 +01:00
Andres Freund ab5194e6f6 Improve LWLock scalability.
The old LWLock implementation had the problem that concurrent lock
acquisitions required exclusively acquiring a spinlock. Often that
could lead to acquirers waiting behind the spinlock, even if the
actual LWLock was free.

The new implementation doesn't acquire the spinlock when acquiring the
lock itself. Instead the new atomic operations are used to atomically
manipulate the state. Only the waitqueue, used solely in the slow
path, is still protected by the spinlock. Check lwlock.c's header for
an explanation about the used algorithm.

For some common workloads on larger machines this can yield
significant performance improvements. Particularly in read mostly
workloads.

Reviewed-By: Amit Kapila and Robert Haas
Author: Andres Freund

Discussion: 20130926225545.GB26663@awork2.anarazel.de
2014-12-25 17:24:30 +01:00
Andres Freund 7882c3b0b9 Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it.
Besides being shorter and much easier to read it changes the logic in
LWLockRelease() to release all shared lockers when waking up any. This
can yield some significant performance improvements - and the fairness
isn't really much worse than before, as we always allowed new shared
lockers to jump the queue.
2014-12-25 17:24:30 +01:00
Alvaro Herrera 73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Andres Freund 0fd38e1370 Don't skip SQL backends in logical decoding for visibility computation.
The logical decoding patchset introduced PROC_IN_LOGICAL_DECODING flag
PGXACT flag, that allows such backends to be skipped when computing
the xmin horizon/snapshots. That's fine and sensible for walsenders
streaming out logical changes, but not at all fine for SQL backends
doing logical decoding. If the latter set that flag any change they
have performed outside of logical decoding will not be regarded as
visible - which e.g. can lead to that change being vacuumed away.

Note that not setting the flag for SQL backends isn't particularly
bothersome - the SQL backend doesn't do streaming, so it only runs for
a limited amount of time.

Per buildfarm member 'tick' and Alvaro.

Backpatch to 9.4, where logical decoding was introduced.
2014-12-02 23:47:08 +01:00
Heikki Linnakangas 1812ee5767 Remove dead function prototype
It was added in commit efc16ea5, but never defined.
2014-11-26 11:05:54 +02:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Heikki Linnakangas 81c4508196 Fix race condition between hot standby and restoring a full-page image.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.

To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.

In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.

Backpatch to all supported versions; this has been racy since hot standby
was introduced.
2014-11-13 20:02:37 +02:00
Heikki Linnakangas ae667f778d Really fix compilation failure on MIPS.
I missed an additional colon in previous patch. Oops. to make that mistake
less likely in the future, add comments as placeholders for unused inputs
and outputs in inline assembly.
2014-11-11 10:25:22 +02:00
Heikki Linnakangas baf7b3a503 Fix compilation failure on MIPS.
Rémi Zara
2014-11-11 01:06:06 +02:00
Alvaro Herrera 7516f52594 BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.

As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.

For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range.  This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results.  In this commit I only include minmax.

Catalog version bumped due to new builtin catalog entries.

There's more that could be done here, but this is a good step forwards.

Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.

Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633.
2014-11-07 16:38:14 -03:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Robert Haas f7102b0463 Extend dsm API with a new function dsm_unpin_mapping.
This reassociates a dynamic shared memory handle previous passed to
dsm_pin_mapping with the current resource owner, so that it will be
cleaned up at the end of the current query.

Patch by me.  Review of the function name by Andres Freund, Amit
Kapila, Jim Nasby, Petr Jelinek, and Álvaro Herrera.
2014-10-30 14:55:23 -04:00
Robert Haas 6057c212f3 "Pin", rather than "keep", dynamic shared memory mappings and segments.
Nobody seemed concerned about this naming when it originally went in,
but there's a pending patch that implements the opposite of
dsm_keep_mapping, and the term "unkeep" was judged unpalatable.
"unpin" has existing precedent in the PostgreSQL code base, and the
English language, so use this terminology instead.

Per discussion, back-patch to 9.4.
2014-10-30 11:35:55 -04:00
Bruce Momjian a4da35a0d2 Add variable names to two LWLock C prototypes
Previously only the variable types appeared.
2014-10-27 04:45:57 -04:00
Robert Haas 7bb0e97407 Extend shm_mq API with new functions shm_mq_sendv, shm_mq_set_handle.
shm_mq_sendv sends a message to the queue assembled from multiple
locations.  This is expected to be used by forthcoming patches to
allow frontend/backend protocol messages to be sent via shm_mq, but
might be useful for other purposes as well.

shm_mq_set_handle associates a BackgroundWorkerHandle with an
already-existing shm_mq_handle.  This solves a timing problem when
creating a shm_mq to communicate with a newly-launched background
worker: if you attach to the queue first, and the background worker
fails to start, you might block forever trying to do I/O on the queue;
but if you start the background worker first, but then die before
attaching to the queue, the background worrker might block forever
trying to do I/O on the queue.  This lets you attach before starting
the worker (so that the worker is protected) and then associate the
BackgroundWorkerHandle later (so that you are also protected).

Patch by me, reviewed by Stephen Frost.
2014-10-08 14:38:31 -04:00
Robert Haas 9019264f2e Still another typo fix for 0709b7ee72.
Buildfarm member anole caught this one.
2014-10-03 11:39:38 -04:00
Robert Haas 3acc10c997 Increase the number of buffer mapping partitions to 128.
Testing by Amit Kapila, Andres Freund, and myself, with and without
other patches that also aim to improve scalability, seems to indicate
that this change is a significant win over the current value and over
smaller values such as 64.  It's not clear how high we can push this
value before it starts to have negative side-effects elsewhere, but
going this far looks OK.
2014-10-02 13:58:50 -04:00
Andres Freund b64d92f1a5 Add a basic atomic ops API abstracting away platform/architecture details.
Several upcoming performance/scalability improvements require atomic
operations. This new API avoids the need to splatter compiler and
architecture dependent code over all the locations employing atomic
ops.

For several of the potential usages it'd be problematic to maintain
both, a atomics using implementation and one using spinlocks or
similar. In all likelihood one of the implementations would not get
tested regularly under concurrency. To avoid that scenario the new API
provides a automatic fallback of atomic operations to spinlocks. All
properties of atomic operations are maintained. This fallback -
obviously - isn't as fast as just using atomic ops, but it's not bad
either. For one of the future users the atomics ontop spinlocks
implementation was actually slightly faster than the old purely
spinlock using implementation. That's important because it reduces the
fear of regressing older platforms when improving the scalability for
new ones.

The API, loosely modeled after the C11 atomics support, currently
provides 'atomic flags' and 32 bit unsigned integers. If the platform
efficiently supports atomic 64 bit unsigned integers those are also
provided.

To implement atomics support for a platform/architecture/compiler for
a type of atomics 32bit compare and exchange needs to be
implemented. If available and more efficient native support for flags,
32 bit atomic addition, and corresponding 64 bit operations may also
be provided. Additional useful atomic operations are implemented
generically ontop of these.

The implementation for various versions of gcc, msvc and sun studio have
been tested. Additional existing stub implementations for
* Intel icc
* HUPX acc
* IBM xlc
are included but have never been tested. These will likely require
fixes based on buildfarm and user feedback.

As atomic operations also require barriers for some operations the
existing barrier support has been moved into the atomics code.

Author: Andres Freund with contributions from Oskari Saarenmaa
Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera
Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com,
    20131015123303.GH5300@awork2.anarazel.de,
    20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
Robert Haas 5d7962c679 Change locking regimen around buffer replacement.
Previously, we used an lwlock that was held from the time we began
seeking a candidate buffer until the time when we found and pinned
one, which is disastrous for concurrency.  Instead, use a spinlock
which is held just long enough to pop the freelist or advance the
clock sweep hand, and then released.  If we need to advance the clock
sweep further, we reacquire the spinlock once per buffer.

This represents a significant increase in atomic operations around
buffer eviction, but it still wins on many workloads.  On others, it
may result in no gain, or even cause a regression, unless the number
of buffer mapping locks is also increased.  However, that seems like
material for a separate commit.  We may also need to consider other
methods of mitigating contention on this spinlock, such as splitting
it into multiple locks or jumping the clock sweep hand more than one
buffer at a time, but those, too, seem like separate improvements.

Patch by me, inspired by a much larger patch from Amit Kapila.
Reviewed by Andres Freund.
2014-09-25 10:43:24 -04:00
Andres Freund e5603a2f35 Mark x86's memory barrier inline assembly as clobbering the cpu flags.
x86's memory barrier assembly was marked as clobbering "memory" but
not "cc" even though 'addl' sets various flags. As it turns out gcc on
x86 implicitly assumes "cc" on every inline assembler statement, so
it's not a bug. But as that's poorly documented and might get copied
to architectures or compilers where that's not the case, it seems
better to be precise.

Discussion: 20140919100016.GH4277@alap3.anarazel.de

To keep the code common, backpatch to 9.2 where explicit memory
barriers were introduced.
2014-09-19 17:04:00 +02:00
Andres Freund 728f152e07 Add rmgr callback to name xlog record types for display purposes.
This is primarily useful for the upcoming pg_xlogdump --stats feature,
but also allows to remove some duplicated code in the rmgr_desc
routines.

Due to the separation and harmonization, the output of dipsplayed
records changes somewhat. But since this isn't enduser oriented
content that's ok.

It's potentially desirable to further change pg_xlogdump's display of
records. It previously wasn't possible to show the record type
separately from the description forcing it to be in the last
column. But that's better done in a separate commit.

Author: Abhijit Menon-Sen, slightly editorialized by me
Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas
Discussion: 20140604104716.GA3989@toroid.org
2014-09-19 16:20:29 +02:00
Robert Haas 5b26278822 Fix thinko in 0709b7ee72.
Buildfarm member castoroides is unhappy with this, for entirely
understandable reasons.
2014-09-10 14:40:21 -04:00
Andres Freund b4c28d1b92 Fix typo in 07c8651dd9 causing WIN32_ONLY_COMPILER builds to fail. 2014-09-10 02:44:39 +02:00
Robert Haas 0709b7ee72 Change the spinlock primitives to function as compiler barriers.
Previously, they functioned as barriers against CPU reordering but not
compiler reordering, an odd API that required extensive use of volatile
everywhere that spinlocks are used.  That's error-prone and has negative
implications for performance, so change it.

In theory, this makes it safe to remove many of the uses of volatile
that we currently have in our code base, but we may find that there are
some bugs in this effort when we do.  In the long run, though, this
should make for much more maintainable code.

Patch by me.  Review by Andres Freund.
2014-09-09 17:48:50 -04:00
Andres Freund 50881036b1 Fix typo in solaris spinlock fix.
07968dbfaa missed part of the S_UNLOCK define when building for
sparcv8+.
2014-09-09 13:57:38 +02:00
Andres Freund 07968dbfaa Fix spinlock implementation for some !solaris sparc platforms.
Some Sparc CPUs can be run in various coherence models, ranging from
RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always
run CPUs in TSO mode while in userland, but linux didn't use to and
the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK
were only correct under TSO. Fix that by adding the necessary memory
barrier instructions. On sparcv8+, which should be all relevant CPUs,
these are treated as NOPs if the current consistency model doesn't
require the barriers.

Discussion: 20140630222854.GW26930@awork2.anarazel.de

Will be backpatched to all released branches once a few buildfarm
cycles haven't shown up problems. As I've no access to sparc, this is
blindly written.
2014-09-09 00:47:32 +02:00
Andres Freund 4b4b680c3d Make backend local tracking of buffer pins memory efficient.
Since the dawn of time (aka Postgres95) multiple pins of the same
buffer by one backend have been optimized not to modify the shared
refcount more than once. This optimization has always used a NBuffer
sized array in each backend keeping track of a backend's pins.

That array (PrivateRefCount) was one of the biggest per-backend memory
allocations, depending on the shared_buffers setting. Besides the
waste of memory it also has proven to be a performance bottleneck when
assertions are enabled as we make sure that there's no remaining pins
left at the end of transactions. Also, on servers with lots of memory
and a correspondingly high shared_buffers setting the amount of random
memory accesses can also lead to poor cpu cache efficiency.

Because of these reasons a backend's buffers pins are now kept track
of in a small statically sized array that overflows into a hash table
when necessary. Benchmarks have shown neutral to positive performance
results with considerably lower memory usage.

Patch by me, review by Robert Haas.

Discussion: 20140321182231.GA17111@alap3.anarazel.de
2014-08-30 14:03:21 +02:00
Robert Haas 4893ccd024 Remove swpb-based spinlock implementation for ARMv5 and earlier.
Per recent analysis by Andres Freund, this implementation is in fact
unsafe, because ARMv5 has weak memory ordering, which means tha the
CPU could move loads or stores across the volatile store performed by
the default S_UNLOCK.  We could try to fix this, but have no ARMv5
hardware to test on, so removing support seems better.  We can still
support ARMv5 systems on GCC versions new enough to have built-in
atomics support for this platform, and can also re-add support for
the old way if someone has hardware that can be used to test a fix.
However, since the requirement to use a relatively-new GCC hasn't
been an issue for ARMv6 or ARMv7, which lack the swpb instruction
altogether, perhaps it won't be an issue for ARMv5 either.
2014-07-06 14:56:36 -04:00
Heikki Linnakangas 1c6821be31 Fix and enhance the assertion of no palloc's in a critical section.
The assertion failed if WAL_DEBUG or LWLOCK_STATS was enabled; fix that by
using separate memory contexts for the allocations made within those code
blocks.

This patch introduces a mechanism for marking any memory context as allowed
in a critical section. Previously ErrorContext was exempt as a special case.

Instead of a blanket exception of the checkpointer process, only exempt the
memory context used for the pending ops hash table.
2014-06-30 10:26:00 +03:00
Andres Freund a6d488cb53 Remove Alpha and Tru64 support.
Support for running postgres on Alpha hasn't been tested for a long
while. Due to Alpha's uniquely lax cache coherency model it's a hard
to develop for platform (especially blindly!) and thought to be
unlikely to currently work correctly.

As Alpha is the only supported architecture for Tru64 drop support for
it as well. Tru64's support has ended 2012 and it has been in
maintenance-only mode for much longer.

Also remove stray references to __ksr__ and ultrix defines.
2014-06-28 21:46:15 +02:00
Kevin Grittner bfaa8c665f Fix calculation of PREDICATELOCK_MANAGER_LWLOCK_OFFSET.
Commit ea9df812d8 failed to include
NUM_BUFFER_PARTITIONS in this offset, resulting in a bad offset.
Ultimately this threw off NUM_FIXED_LWLOCKS which is based on
earlier offsets, leading to memory allocation problems.  It seems
likely to have also caused increased LWLOCK contention when
serializable transactions were used, because lightweight locks used
for that overlapped others.

Reported by Amit Kapila with analysis and fix.
Backpatch to 9.4, where the bug was introduced.
2014-06-19 08:40:37 -05:00
Heikki Linnakangas 0ef0b6784c Change the signature of rm_desc so that it's passed a XLogRecord.
Just feels more natural, and is more consistent with rm_redo.
2014-06-14 10:46:48 +03:00
Tom Lane 5f93c37805 Add defenses against running with a wrong selection of LOBLKSIZE.
It's critical that the backend's idea of LOBLKSIZE match the way data has
actually been divided up in pg_largeobject.  While we don't provide any
direct way to adjust that value, doing so is a one-line source code change
and various people have expressed interest recently in changing it.  So,
just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in
pg_control and cross-check that the backend's compiled-in setting matches
the on-disk data.

Also tweak the code in inv_api.c so that fetches from pg_largeobject
explicitly verify that the length of the data field is not more than
LOBLKSIZE.  Formerly we just had Asserts() for that, which is no protection
at all in production builds.  In some of the call sites an overlength data
value would translate directly to a security-relevant stack clobber, so it
seems worth one extra runtime comparison to be sure.

In the back branches, we can't change the contents of pg_control; but we
can still make the extra checks in inv_api.c, which will offer some amount
of protection against running with the wrong value of LOBLKSIZE.
2014-06-05 11:31:06 -04:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Tom Lane 2d00190495 Rationalize common/relpath.[hc].
Commit a730183926 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that.  To clean it up:

* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h.  We won't consider this symbol part of the FE/BE API.

* Push enum ForkNumber from relfilenode.h into relpath.h.  We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.

* So, relfilenode.h now includes relpath.h rather than vice-versa.  This
direction of dependency is fine.  (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)

* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.

* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().

* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers.  (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)

* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().

* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames.  This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.

The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.

Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
2014-04-30 17:30:50 -04:00
Robert Haas 11a65eed16 Get rid of the dynamic shared memory state file.
Instead of storing the ID of the dynamic shared memory control
segment in a file within the data directory, store it in the main
control segment.  This avoids a number of nasty corner cases,
most seriously that doing an online backup and then using it on
the same machine (e.g. to fire up a standby) would result in the
standby clobbering all of the master's dynamic shared memory
segments.

Per complaints from Heikki Linnakangas, Fujii Masao, and Tom
Lane.
2014-04-08 11:39:55 -04:00
Heikki Linnakangas 68a2e52bba Replace the XLogInsert slots with regular LWLocks.
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)

Reviewed by Andres Freund.
2014-03-21 15:10:48 +01:00
Alvaro Herrera f88d4cfc9d Setup error context callback for transaction lock waits
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple.  For example:

LOG:  process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL:  Process holding the lock: 11366. Wait queue: 11367.
CONTEXT:  while updating tuple (0,2) in relation "foo"
STATEMENT:  UPDATE foo SET value = 3;

Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.

Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
2014-03-19 15:10:36 -03:00
Robert Haas 3bd261ca18 Improve shm_mq portability around MAXIMUM_ALIGNOF and sizeof(Size).
Revise the original decision to expose a uint64-based interface and
use Size everywhere possible.  Avoid assuming that MAXIMUM_ALIGNOF is
8, or making any assumption about the relationship between that value
and sizeof(Size).  If MAXIMUM_ALIGNOF is bigger, we'll now insert
padding after the length word; if it's smaller, we are now prepared
to read and write the length word in chunks.

Per discussion with Tom Lane.
2014-03-18 11:23:13 -04:00
Robert Haas 79a4d24f31 Make it easy to detach completely from shared memory.
The new function dsm_detach_all() can be used either by postmaster
children that don't wish to take any risk of accidentally corrupting
shared memory; or by forked children of regular backends with
the same need.  This patch also updates the postmaster children that
already do PGSharedMemoryDetach() to do dsm_detach_all() as well.

Per discussion with Tom Lane.
2014-03-18 07:58:53 -04:00
Robert Haas 8722017bbc Allow dynamic shared memory segments to be kept until shutdown.
Amit Kapila, reviewed by Kyotaro Horiguchi, with some further
changes by me.
2014-03-10 14:04:47 -04:00
Robert Haas cb9a0c7987 Teach on_exit_reset() to discard pending cleanups for dsm.
If a postmaster child invokes fork() and then calls on_exit_reset, that
should be sufficient to let it exit() without breaking anything, but
dynamic shared memory broke that by not updating on_exit_reset() to
discard callbacks registered with dynamic shared memory segments.

Per investigation of a complaint from Tom Lane.
2014-03-10 10:17:19 -04:00
Bruce Momjian 5024044a20 C comments: improve description of relfilenode uniqueness
Report by Antonin Houska
2014-03-08 12:20:30 -05:00
Heikki Linnakangas 55566c9a74 Fix dangling smgr_owner pointer when a fake relcache entry is freed.
A fake relcache entry can "own" a SmgrRelation object, like a regular
relcache entry. But when it was free'd, the owner field in SmgrRelation
was not cleared, so it was left pointing to free'd memory.

Amazingly this apparently hasn't caused crashes in practice, or we would've
heard about it earlier. Andres found this with Valgrind.

Report and fix by Andres Freund, with minor modifications by me. Backpatch
to all supported versions.
2014-03-07 13:28:52 +02:00
Bruce Momjian 0024a3a3b6 C comment update: relfilenode is only unique with a tablespace
Report from Antonin Houska
2014-03-05 20:52:34 -05:00
Robert Haas b89e151054 Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables.  The output format is controlled by a
so-called "output plugin"; an example is included.  To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.

Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.

Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 16:32:18 -05:00
Heikki Linnakangas f8ce16d0d2 Rename huge_tlb_pages to huge_pages, and improve docs.
Christian Kruse
2014-03-03 20:52:48 +02:00
Robert Haas dd1a3bccca Show xid and xmin in pg_stat_activity and pg_stat_replication.
Christian Kruse, reviewed by Andres Freund and myself, with further
minor adjustments by me.
2014-02-25 12:34:04 -05:00
Tom Lane fa1f0d7859 PGDLLIMPORT-ify MainLWLockArray, ProcDiePending, proc_exit_inprogress.
These are needed in HEAD to make assorted contrib modules build on Windows.
Now that all the MSVC and Mingw buildfarm members seem to be on the same
page about the need for them, we can have some confidence that future
problems of this ilk will be detected promptly; there seems nothing more
to be learned by delaying this fix further.

I chose to mark QueryCancelPending as well, since it's easy to imagine code
that wants to touch ProcDiePending also caring about QueryCancelPending.
2014-02-16 20:12:43 -05:00
Peter Eisentraut 66c04c981d Mark some more variables as static or include the appropriate header
Detected by clang's -Wmissing-variable-declarations.

From: Andres Freund <andres@anarazel.de>
2014-02-08 21:21:46 -05:00
Robert Haas 858ec11858 Introduce replication slots.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts.  Slots have some advantages over existing
techniques, as explained in the documentation.

In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.

Andres Freund and Robert Haas
2014-01-31 22:45:36 -05:00
Heikki Linnakangas 1a3458b6d8 Allow using huge TLB pages on Linux (MAP_HUGETLB)
This patch adds an option, huge_tlb_pages, which allows requesting the
shared memory segment to be allocated using huge pages, by using the
MAP_HUGETLB flag in mmap(). This can improve performance.

The default is 'try', which means that we will attempt using huge pages,
and fall back to non-huge pages if it doesn't work. Currently, only Linux
has MAP_HUGETLB. On other platforms, the default 'try' behaves the same as
'off'.

In the passing, don't try to round the mmap() size to a multiple of
pagesize. mmap() doesn't require that, and there's no particular reason for
PostgreSQL to do that either. When using MAP_HUGETLB, however, round the
request size up to nearest 2MB boundary. This is to work around a bug in
some Linux kernel versions, but also to avoid wasting memory, because the
kernel will round the size up anyway.

Many people were involved in writing this patch, including Christian Kruse,
Richard Poole, Abhijit Menon-Sen, reviewed by Peter Geoghegan, Andres Freund
and me.
2014-01-29 14:08:30 +02:00
Robert Haas ea9df812d8 Relax the requirement that all lwlocks be stored in a single array.
This makes it possible to store lwlocks as part of some other data
structure in the main shared memory segment, or in a dynamic shared
memory segment.  There is still a main LWLock array and this patch does
not move anything out of it, but it provides necessary infrastructure
for doing that in the future.

This change is likely to increase the size of LWLockPadded on some
platforms, especially 32-bit platforms where it was previously only
16 bytes.

Patch by me.  Review by Andres Freund and KaiGai Kohei.
2014-01-27 11:07:44 -05:00
Andrew Dunstan 7d7eee8bb7 Export a few more symbols required for test_shm_mq module.
Patch from Amit Kapila.
2014-01-18 15:29:45 -05:00
Andrew Dunstan 708c529c7f Export set_latch_on_sigusr1 symbol for Windows.
Per buildfarm currawong and grip from David Rowley.
2014-01-17 12:48:23 -05:00
Robert Haas ed46758381 Logging running transactions every 15 seconds.
Previously, we did this just once per checkpoint, but that could make
Hot Standby take a long time to initialize.  To avoid busying an
otherwise-idle system, we don't do this if no WAL has been written
since we did it last.

Andres Freund
2014-01-15 12:41:20 -05:00
Tom Lane 061b079f89 Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted.  (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.)  In Hot Standby,
the WAL replay process must likewise lock every leaf page.  There were
several bugs in the code for that:

* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error.  This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages".  To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.

* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about).  Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.

* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page.  However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.

The first two of these bugs were diagnosed by Andres Freund, the other one
by me.  Fixes based on ideas from Heikki Linnakangas and myself.

This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
2014-01-14 17:35:21 -05:00
Robert Haas ec9037df26 Single-reader, single-writer, lightweight shared message queue.
This code provides infrastructure for user backends to communicate
relatively easily with background workers.  The message queue is
structured as a ring buffer and allows messages of arbitary length
to be sent and received.

Patch by me.  Review by KaiGai Kohei and Andres Freund.
2014-01-14 12:23:22 -05:00
Robert Haas 6ddd5137b2 Simple table of contents for a shared memory segment.
This interface is intended to make it simple to divide a dynamic shared
memory segment into different regions with distinct purposes.  It
therefore serves much the same purpose that ShmemIndex accomplishes for
the main shared memory segment, but it is intended to be more
lightweight.

Patch by me.  Review by Andres Freund.
2014-01-14 12:18:58 -05:00
Tom Lane 220b34331f We don't need to include pg_sema.h in s_lock.h anymore.
Minor improvement to commit daa7527afc2274432094ebe7ceb03aa41f916607:
s_lock.h no longer has any need to mention PGSemaphoreData, so we can
rip out the #include that supplies that.  In a non-HAVE_SPINLOCKS
build, this doesn't really buy much since we still need the #include
in spin.h --- but everywhere else, this reduces #include footprint by
some trifle, and helps keep the different locking facilities separate.
2014-01-08 20:58:22 -05:00
Robert Haas daa7527afc Reduce the number of semaphores used under --disable-spinlocks.
Instead of allocating a semaphore from the operating system for every
spinlock, allocate a fixed number of semaphores (by default, 1024)
from the operating system and multiplex all the spinlocks that get
created onto them.  This could self-deadlock if a process attempted
to acquire more than one spinlock at a time, but since processes
aren't supposed to execute anything other than short stretches of
straight-line code while holding a spinlock, that shouldn't happen.

One motivation for this change is that, with the introduction of
dynamic shared memory, it may be desirable to create spinlocks that
last for less than the lifetime of the server.  Without this change,
attempting to use such facilities under --disable-spinlocks would
quickly exhaust any supply of available semaphores.  Quite apart
from that, it's desirable to contain the quantity of semaphores
needed to run the server simply on convenience grounds, since using
too many may make it harder to get PostgreSQL running on a new
platform, which is mostly the point of --disable-spinlocks in the
first place.

Patch by me; review by Tom Lane.
2014-01-08 18:58:00 -05:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Robert Haas 001a573a20 Allow on-detach callbacks for dynamic shared memory segments.
Just as backends must clean up their shared memory state (releasing
lwlocks, buffer pins, etc.) before exiting, they must also perform
any similar cleanups related to dynamic shared memory segments they
have mapped before unmapping those segments.  So add a mechanism to
ensure that.

Existing on_shmem_exit hooks include both "user level" cleanup such
as transaction abort and removal of leftover temporary relations and
also "low level" cleanup that forcibly released leftover shared
memory resources.  On-detach callbacks should run after the first
group but before the second group, so create a new before_shmem_exit
function for registering the early callbacks and keep on_shmem_exit
for the regular callbacks.  (An earlier draft of this patch added an
additional argument to on_shmem_exit, but that had a much larger
footprint and probably a substantially higher risk of breaking third
party code for no real gain.)

Patch by me, reviewed by KaiGai Kohei and Andres Freund.
2013-12-18 13:09:09 -05:00
Tatsuo Ishii 65d6e4cb5c Add ALTER SYSTEM command to edit the server configuration file.
Patch contributed by Amit Kapila. Reviewed by Hari Babu, Masao Fujii,
Boszormenyi Zoltan, Andres Freund, Greg Smith and others.
2013-12-18 23:42:44 +09:00
Tom Lane 7db285afc9 Fix stale-pointer problem in fast-path locking logic.
When acquiring a lock in fast-path mode, we must reset the locallock
object's lock and proclock fields to NULL.  They are not necessarily that
way to start with, because the locallock could be left over from a failed
lock acquisition attempt earlier in the transaction.  Failure to do this
led to all sorts of interesting misbehaviors when LockRelease tried to
clean up no-longer-related lock and proclock objects in shared memory.
Per report from Dan Wood.

In passing, modify LockRelease to elog not just Assert if it doesn't find
lock and proclock objects for a formerly fast-path lock, matching the code
in FastPathGetRelationLockEntry and LockRefindAndRelease.  This isn't a
bug but it will help in diagnosing any future bugs in this area.

Also, modify FastPathTransferRelationLocks and FastPathGetRelationLockEntry
to break out of their loops over the fastpath array once they've found the
sole matching entry.  This was inconsistently done in some search loops
and not others.

Improve assorted related comments, too.

Back-patch to 9.2 where the fast-path mechanism was introduced.
2013-11-27 18:10:00 -05:00
Robert Haas d2aecaea15 Modify dynamic shared memory code to use Size rather than uint64.
This is more consistent with what we do elsewhere.
2013-10-28 12:12:06 -04:00
Robert Haas ea91a6be89 Remove IRIX port.
Development of IRIX has been discontinued, and support is scheduled
to end in December of 2013.  Therefore, there will be no supported
versions of this operating system by the time PostgreSQL 9.4 is
released.  Furthermore, we have no maintainer for this platform.
2013-10-18 08:14:21 -04:00
Robert Haas 81051a86bc Remove spinlock support for SINIX, Sun3, and NS32K.
All of these platforms are very much obsolete.

As far as I can determine, the last version of SINIX, later renamed
Reliant, occurred some time between 2002 and 2005.

The last release of SunOS that would run on a sun3 was released in
November of 1991; the last release of OpenBSD which supported that
platform was in 2001.  The highest clock speed of any processor in
the family was 25MHz.

The NS32K (national semiconductor 320xx) architecture was retired
in 1990.

Support can be re-added if a maintainer emerges for any of these
platforms, but it seems unlikely.

Reviewed by Andres Freund.
2013-10-17 12:02:05 -04:00
Robert Haas 0ac5e5a7e1 Allow dynamic allocation of shared memory segments.
Patch by myself and Amit Kapila.  Design help from Noah Misch.  Review
by Andres Freund.
2013-10-09 21:05:02 -04:00
Kevin Grittner c01262a824 Eliminate xmin from hash tag for predicate locks on heap tuples.
If a tuple was frozen while its predicate locks mattered,
read-write dependencies could be missed, resulting in failure to
detect conflicts which could lead to anomalies in committed
serializable transactions.

This field was added to the tag when we still thought that it was
necessary to carry locks forward to a new version of an updated
row.  That was later proven to be unnecessary, which allowed
simplification of the code, but elimination of xmin from the tag
was missed at the time.

Per report and analysis by Heikki Linnakangas.
Backpatch to 9.1.
2013-10-07 14:16:54 -05:00
Alvaro Herrera 15732b34e8 Add WaitForLockers in lmgr, refactoring index.c code
This is in support of a future REINDEX CONCURRENTLY feature.

Michael Paquier
2013-10-01 17:57:01 -03:00
Alvaro Herrera 1247ea28cb Remove `proc` argument from LockCheckConflicts
This has been unused since commit 8563ccae2c.

Noted by Antonin Houska
2013-09-16 22:14:14 -03:00
Robert Haas cc52d5b33f Expose fsync_fname as a public API.
Andres Freund
2013-09-04 11:15:00 -04:00
Alvaro Herrera 8b290f3115 Update obsolete comment 2013-09-03 16:53:16 -04:00
Heikki Linnakangas b03d196be0 Use a non-locking initial test in TAS_SPIN on x86_64.
Testing done in 2011 by Tom Lane concluded that this is a win on Intel Xeons
and AMD Opterons, but it was not changed back then, because of an old
comment in tas() that suggested that it's a huge loss on older Opterons.
However, didn't have separate TAS() and TAS_SPIN() macros back then, so the
comment referred to doing a non-locked initial test even on the first
access, in uncontended case. I don't have access to older Opterons, but I'm
pretty sure that doing an initial unlocked test is unlikely to be a loss
while spinning, even though it might be for the first access.

We probably should do the same on 32-bit x86, but I'm afraid of changing it
without any testing. Hence just add a note to the x86 implementation
suggesting that we probably should do the same there.
2013-08-29 14:04:37 +03:00
Robert Haas 090d0f2050 Allow discovery of whether a dynamic background worker is running.
Using the infrastructure provided by this patch, it's possible either
to wait for the startup of a dynamically-registered background worker,
or to poll the status of such a worker without waiting.  In either
case, the current PID of the worker process can also be obtained.
As usual, worker_spi is updated to demonstrate the new functionality.

Patch by me.  Review by Andres Freund.
2013-08-28 14:08:13 -04:00
Tom Lane 89779bf2c8 Fix a few problems in barrier.h.
On HPPA, implement pg_memory_barrier() as pg_compiler_barrier(), which
should be correct since this arch doesn't do memory access reordering,
and is anyway better than the completely-nonfunctional-on-this-arch
dummy_spinlock code.  (But note this patch only fixes things for gcc,
not for builds with HP's compiler.)

Also, fix incorrect default definition of pg_memory_barrier as a macro
requiring an argument.

Also, fix incorrect spelling of "#elif" as "#else if" in icc code path
(spotted by pgindent).

This doesn't come close to fixing all of the functional and stylistic
deficiencies in barrier.h, but at least it un-breaks my personal build.
Now that we're actually using barriers in the code, this file is going
to need some serious attention.
2013-07-17 18:38:20 -04:00
Robert Haas 7f7485a0cd Allow background workers to be started dynamically.
There is a new API, RegisterDynamicBackgroundWorker, which allows
an ordinary user backend to register a new background writer during
normal running.  This means that it's no longer necessary for all
background workers to be registered during processing of
shared_preload_libraries, although the option of registering workers
at that time remains available.

When a background worker exits and will not be restarted, the
slot previously used by that background worker is automatically
released and becomes available for reuse.  Slots used by background
workers that are configured for automatic restart can't (yet) be
released without shutting down the system.

This commit adds a new source file, bgworker.c, and moves some
of the existing control logic for background workers there.
Previously, there was little enough logic that it made sense to
keep everything in postmaster.c, but not any more.

This commit also makes the worker_spi contrib module into an
extension and adds a new function, worker_spi_launch, which can
be used to demonstrate the new facility.
2013-07-16 13:02:15 -04:00
Heikki Linnakangas e5592c61ad Fix memory barrier support on icc on ia64, 2nd attempt.
Itanium doesn't have the mfence instruction - that's a 386 thing. Use the
"mf" instruction instead.

This reverts the previous commit to add "#include <emmintrinsic.h>"; the
problem was not with a missing #include.
2013-07-09 11:34:18 +03:00
Heikki Linnakangas 6052bceba5 Add #include needed for _mm_mfence() intrinsic on ia64.
Hopefully this fixes the build failure on buildfarm member dugong.
2013-07-09 10:29:43 +03:00
Heikki Linnakangas 9a20a9b21b Improve scalability of WAL insertions.
This patch replaces WALInsertLock with a number of WAL insertion slots,
allowing multiple backends to insert WAL records to the WAL buffers
concurrently. This is particularly useful for parallel loading large amounts
of data on a system with many CPUs.

This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the segment with
zeros. This potentially adds some overhead, but it has been a very common
practice by DBA's to clear the "tail" of the segment with an external
pg_clearxlogtail utility anyway, to make the WAL files compress better.
With this patch, it's no longer necessary to do that.

This patch adds a new GUC, xloginsert_slots, to tune the number of WAL
insertion slots. Performance testing suggests that the default, 8, works
pretty well for all kinds of worklods, but I left the GUC in place to allow
others with different hardware to test that easily. We might want to remove
that before release.

Reviewed by Andres Freund.
2013-07-08 11:23:56 +03:00
Robert Haas 568d4138c6 Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row.  In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result.  This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.

The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow.  However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads.  To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed.  The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all.  Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.

Patch by me.  Review by Michael Paquier and Andres Freund.
2013-07-02 09:47:01 -04:00
Peter Eisentraut 129759d6a5 Fix cpluspluscheck in checksum code
C++ is more picky about comparing signed and unsigned integers.
2013-06-30 10:25:43 -04:00
Simon Riggs 1f09121b4e Ensure no xid gaps during Hot Standby startup
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.

Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
2013-06-23 11:05:02 +01:00
Jeff Davis b8fd1a09f3 Add buffer_std flag to MarkBufferDirtyHint().
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.

In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.

Back-patch to 9.3.
2013-06-17 08:02:12 -07:00
Tom Lane f04216341d Refactor checksumming code to make it easier to use externally.
pg_filedump and other external utility programs are likely to want to be
able to check Postgres page checksums.  To avoid messy duplication of code,
move the checksumming functionality into an exported header file, much as
we did awhile back for the CRC code.

In passing, get rid of an unportable assumption that a static char[] array
will be word-aligned, and do some other minor code beautification.
2013-06-13 22:35:56 -04:00
Tom Lane 5c7603c318 Add ARM64 (aarch64) support to s_lock.h.
Use the same gcc atomic functions as we do on newer ARM chips.
(Basically this is a copy and paste of the __arm__ code block,
but omitting the SWPB option since that definitely won't work.)

Back-patch to 9.2.  The patch would work further back, but we'd also
need to update config.guess/config.sub in older branches to make them
build out-of-the-box, and there hasn't been demand for it.

Mark Salter
2013-06-04 15:42:02 -04:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Simon Riggs 443951748c Record data_checksum_version in control file.
The value is not used anywhere in code, but will
allow future changes to the checksum version
should that become necessary in the future.
2013-04-30 12:27:12 +01:00
Simon Riggs 43e7a66849 Introduce new page checksum algorithm and module.
Isolate checksum calculation to its own module, so that bufpage
knows little if anything about the details of the calculation.

This implementation is a modified FNV-1a hash checksum, details
of which are given in the new checksum.c header comments.

Basic implementation only, so we fix the output value.

Later related commits will add version numbers to pg_control,
compiler optimization flags and memory barriers.

Ants Aasma, reviewed by Jeff Davis and Simon Riggs
2013-04-29 09:05:27 +01:00
Simon Riggs 96ef3b8ff1 Allow I/O reliability checks using 16-bit checksums
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.

WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.

Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.

Checksum used is WAL CRC-32 truncated to 16-bits.

Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
2013-03-22 13:54:07 +00:00
Simon Riggs bb7cc2623f Remove PageSetTLI and rename pd_tli to pd_checksum
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.

Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.

Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
2013-03-18 13:46:42 +00:00
Tom Lane d43837d030 Add lock_timeout configuration parameter.
This GUC allows limiting the time spent waiting to acquire any one
heavyweight lock.

In support of this, improve the recently-added timeout infrastructure
to permit efficiently enabling or disabling multiple timeouts at once.
That reduces the performance hit from turning on lock_timeout, though
it's still not zero.

Zoltán Böszörményi, reviewed by Tom Lane,
Stephen Frost, and Hari Babu
2013-03-16 23:22:57 -04:00
Heikki Linnakangas 3d009e45bd Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.

In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.

This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.

Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 18:22:31 +02:00
Peter Eisentraut 0cb1fac3b1 Add noreturn attributes to some error reporting functions 2013-02-12 07:13:22 -05:00
Alvaro Herrera 0ac5ad5134 Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE".  These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE".  UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.

Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.

The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid.  Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates.  This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed.  pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.

Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header.  This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.

Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)

With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.

As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.

Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane.  There's probably room for several more tests.

There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it.  Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.

This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
	AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
	1290721684-sup-3951@alvh.no-ip.org
	1294953201-sup-2099@alvh.no-ip.org
	1320343602-sup-2290@alvh.no-ip.org
	1339690386-sup-8927@alvh.no-ip.org
	4FE5FF020200002500048A3D@gw.wicourts.gov
	4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 12:04:59 -03:00
Alvaro Herrera 279628a0a7 Accelerate end-of-transaction dropping of relations
When relations are dropped, at end of transaction we need to remove the
files and clean the buffer pool of buffers containing pages of those
relations.  Previously we would scan the buffer pool once per relation
to clean up buffers.  When there are many relations to drop, the
repeated scans make this process slow; so we now instead pass a list of
relations to drop and scan the pool once, checking each buffer against
the passed list.  When the number of relations is larger than a
threshold (which as of this patch is being set to 20 relations) we sort
the array before starting, and bsearch the array; when it's smaller, we
simply scan the array linearly each time, because that's faster.  The
exact optimal threshold value depends on many factors, but the
difference is not likely to be significant enough to justify making it
user-settable.

This has been measured to be a significant win (a 15x win when dropping
100,000 relations; an extreme case, but reportedly a real one).

Author: Tomas Vondra, some tweaks by me
Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera
2013-01-17 16:13:17 -03:00
Heikki Linnakangas 9ee4d06f3f Make GiST indexes on-disk compatible with 9.2 again.
The patch that turned XLogRecPtr into a uint64 inadvertently changed the
on-disk format of GiST indexes, because the NSN field in the GiST page
opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
field back to the two-field struct that XLogRecPtr was before. This is the
same we did to LSNs in the page header to avoid changing on-disk format.

Bump catversion, as this invalidates any existing GiST indexes built on
9.3devel.
2013-01-17 16:46:16 +02:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Kevin Grittner b19e4250b4 Fix performance problems with autovacuum truncation in busy workloads.
In situations where there are over 8MB of empty pages at the end of
a table, the truncation work for trailing empty pages takes longer
than deadlock_timeout, and there is frequent access to the table by
processes other than autovacuum, there was a problem with the
autovacuum worker process being canceled by the deadlock checking
code. The truncation work done by autovacuum up that point was
lost, and the attempt tried again by a later autovacuum worker. The
attempts could continue indefinitely without making progress,
consuming resources and blocking other processes for up to
deadlock_timeout each time.

This patch has the autovacuum worker checking whether it is
blocking any other thread at 20ms intervals. If such a condition
develops, the autovacuum worker will persist the work it has done
so far, release its lock on the table, and sleep in 50ms intervals
for up to 5 seconds, hoping to be able to re-acquire the lock and
try again. If it is unable to get the lock in that time, it moves
on and a worker will try to continue later from the point this one
left off.

While this patch doesn't change the rules about when and what to
truncate, it does cause the truncation to occur sooner, with less
blocking, and with the consumption of fewer resources when there is
contention for the table's lock.

The only user-visible change other than improved performance is
that the table size during truncation may change incrementally
instead of just once.

This problem exists in all supported versions but is infrequently
reported, although some reports of performance problems when
autovacuum runs might be caused by this. Initial commit is just the
master branch, but this should probably be backpatched once the
build farm and general developer usage confirm that there are no
surprising effects.

Jan Wieck
2012-12-11 14:33:08 -06:00
Alvaro Herrera da07a1e856 Background worker processes
Background workers are postmaster subprocesses that run arbitrary
user-specified code.  They can request shared memory access as well as
backend database connections; or they can just use plain libpq frontend
database connections.

Modules listed in shared_preload_libraries can register background
workers in their _PG_init() function; this is early enough that it's not
necessary to provide an extra GUC option, because the necessary extra
resources can be allocated early on.  Modules can install more than one
bgworker, if necessary.

Care is taken that these extra processes do not interfere with other
postmaster tasks: only one such process is started on each ServerLoop
iteration.  This means a large number of them could be waiting to be
started up and postmaster is still able to quickly service external
connection requests.  Also, shutdown sequence should not be impacted by
a worker process that's reasonably well behaved (i.e. promptly responds
to termination signals.)

The current implementation lets worker processes specify their start
time, i.e. at what point in the server startup process they are to be
started: right after postmaster start (in which case they mustn't ask
for shared memory access), when consistent state has been reached
(useful during recovery in a HOT standby server), or when recovery has
terminated (i.e. when normal backends are allowed).

In case of a bgworker crash, actions to take depend on registration
data: if shared memory was requested, then all other connections are
taken down (as well as other bgworkers), just like it were a regular
backend crashing.  The bgworker itself is restarted, too, within a
configurable timeframe (which can be configured to be never).

More features to add to this framework can be imagined without much
effort, and have been discussed, but this seems good enough as a useful
unit already.

An elementary sample module is supplied.

Author: Álvaro Herrera

This patch is loosely based on prior patches submitted by KaiGai Kohei,
and unsubmitted code by Simon Riggs.

Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund,
Heikki Linnakangas, Simon Riggs, Amit Kapila
2012-12-06 17:47:30 -03:00
Simon Riggs f21bb9cfb5 Refactor inCommit flag into generic delayChkpt flag.
Rename PGXACT->inCommit flag into delayChkpt flag,
and generalise comments to allow use in other situations,
such as the forthcoming potential use in checksum patch.
Replace wait loop to look for VXIDs with delayChkpt set.
No user visible changes, not behaviour changes at present.

Simon Riggs, reviewed and rebased by Jeff Davis
2012-12-03 13:13:53 +00:00
Tom Lane 3114cb60a1 Don't advance checkPoint.nextXid near the end of a checkpoint sequence.
This reverts commit c11130690d in favor of
actually fixing the problem: namely, that we should never have been
modifying the checkpoint record's nextXid at this point to begin with.
The nextXid should match the state as of the checkpoint's logical WAL
position (ie the redo point), not the state as of its physical position.
It's especially bogus to advance it in some wal_levels and not others.
In any case there is no need for the checkpoint record to carry the
same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
LogStandbySnapshot, as any replay operation will already have adopted
that value as current.

This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
#6291 from Daniel Farina, in that if a checkpoint were in progress at the
instant of XID wraparound, the epoch bump would be lost as reported.
(And, of course, these days there's at least a 50-50 chance of a checkpoint
being in progress at any given instant.)

Diagnosed by me and independently by Andres Freund.  Back-patch to all
branches supporting hot standby.
2012-12-02 15:20:41 -05:00
Simon Riggs 5c11725867 Rearrange storage of data in xl_running_xacts.
Previously we stored all xids mixed together.
Now we store top-level xids first, followed
by all subxids. Also skip logging any subxids
if the snapshot is suboverflowed, since there
are potentially large numbers of them and they
are not useful in that case anyway. Has value
in the envisaged design for decoding of WAL.
No planned effect on Hot Standby.

Andres Freund, reviewed by me
2012-12-02 19:39:37 +00:00
Simon Riggs f1e57a4ec9 Cleanup VirtualXact at end of Hot Standby. 2012-11-29 21:59:11 +00:00
Heikki Linnakangas 1f67078ea3 Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().

This plugs a few rare fd leaks in error cases:

1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
   use OpenTransientFile here because the fd is supposed to persist over
   transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
   PathNameOpenFile.

In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.

The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 10:25:50 +02:00
Tom Lane 3e7fdcffd6 Fix WaitLatch() to return promptly when the requested timeout expires.
If the sleep is interrupted by a signal, we must recompute the remaining
time to wait; otherwise, a steady stream of non-wait-terminating interrupts
could delay return from WaitLatch indefinitely.  This has been shown to be
a problem for the autovacuum launcher, and there may well be other places
now or in the future with similar issues.  So we'd better make the function
robust, even though this'll add at least one gettimeofday call per wait.

Back-patch to 9.2.  We might eventually need to fix 9.1 as well, but the
code is quite different there, and the usage of WaitLatch in 9.1 is so
limited that it's not clearly important to do so.

Reported and diagnosed by Jeff Janes, though I rewrote his patch rather
heavily.
2012-11-08 20:04:48 -05:00
Tom Lane ff3f9c8de5 Close un-owned SMgrRelations at transaction end.
If an SMgrRelation is not "owned" by a relcache entry, don't allow it to
live past transaction end.  This design allows the same SMgrRelation to be
used for blind writes of multiple blocks during a transaction, but ensures
that we don't hold onto such an SMgrRelation indefinitely.  Because an
SMgrRelation typically corresponds to open file descriptors at the fd.c
level, leaving it open when there's no corresponding relcache entry can
mean that we prevent the kernel from reclaiming deleted disk space.
(While CacheInvalidateSmgr messages usually fix that, there are cases
where they're not issued, such as DROP DATABASE.  We might want to add
some more sinval messaging for that, but I'd be inclined to keep this
type of logic anyway, since allowing VFDs to accumulate indefinitely
for blind-written relations doesn't seem like a good idea.)

This code replaces a previous attempt towards the same goal that proved
to be unreliable.  Back-patch to 9.1 where the previous patch was added.
2012-10-17 12:38:21 -04:00
Tom Lane 9bacf0e373 Revert "Use "transient" files for blind writes, take 2".
This reverts commit fba105b109.
That approach had problems with the smgr-level state not tracking what
we really want to happen, and with the VFD-level state not tracking the
smgr-level state very well either.  In consequence, it was still possible
to hold kernel file descriptors open for long-gone tables (as in recent
report from Tore Halset), and yet there were also cases of FDs being closed
undesirably soon.  A replacement implementation will follow.
2012-10-17 12:37:08 -04:00
Tom Lane e81e8f9342 Split up process latch initialization for more-fail-soft behavior.
In the previous coding, new backend processes would attempt to create their
self-pipe during the OwnLatch call in InitProcess.  However, pipe creation
could fail if the kernel is short of resources; and the system does not
recover gracefully from a FATAL error right there, since we have armed the
dead-man switch for this process and not yet set up the on_shmem_exit
callback that would disarm it.  The postmaster then forces an unnecessary
database-wide crash and restart, as reported by Sean Chittenden.

There are various ways we could rearrange the code to fix this, but the
simplest and sanest seems to be to split out creation of the self-pipe into
a new function InitializeLatchSupport, which must be called from a place
where failure is allowed.  For most processes that gets called in
InitProcess or InitAuxiliaryProcess, but processes that don't call either
but still use latches need their own calls.

Back-patch to 9.1, which has only a part of the latch logic that 9.2 and
HEAD have, but nonetheless includes this bug.
2012-10-14 22:59:56 -04:00
Tom Lane 7e0cce0265 Remove unnecessary overhead in backend's large-object operations.
Do read/write permissions checks at most once per large object descriptor,
not once per lo_read or lo_write call as before.  The repeated tests were
quite useless in the read case since the snapshot-based tests were
guaranteed to produce the same answer every time.  In the write case,
the extra tests could in principle detect revocation of write privileges
after a series of writes has started --- but there's a race condition there
anyway, since we'd check privileges before performing and certainly before
committing the write.  So there's no real advantage to checking every
single time, and we might as well redefine it as "only check the first
time".

On the same reasoning, remove the LargeObjectExists checks in inv_write
and inv_truncate.  We already checked existence when the descriptor was
opened, and checking again doesn't provide any real increment of safety
that would justify the cost.
2012-10-09 16:38:00 -04:00
Tom Lane 95d035e66d Autoconfiscate selection of 64-bit int type for 64-bit large object API.
Get rid of the fundamentally indefensible assumption that "long long int"
exists and is exactly 64 bits wide on every platform Postgres runs on.
Instead let the configure script select the type to use for "pg_int64".

This is a bit of a pain in the rear since we do not want to pollute client
namespace with all the random symbols that pg_config.h defines; instead
we have to create a separate generated header file, "pg_config_ext.h".
But now that the infrastructure is there, we might have the ability to
add some other stuff that's long been wanting in this area.
2012-10-07 21:52:43 -04:00
Tatsuo Ishii 7e2f8ed2b0 Fix compiling errors on Windows platform. Fix wrong usage of
INT64CONST macro. Fix lo_hton64 and lo_ntoh64 not to use int32_t and
uint32_t.
2012-10-07 23:30:31 +09:00
Tatsuo Ishii 461ef73f09 Add API for 64-bit large object access. Now users can access up to
4TB large objects (standard 8KB BLCKSZ case).  For this purpose new
libpq API lo_lseek64, lo_tell64 and lo_truncate64 are added.  Also
corresponding new backend functions lo_lseek64, lo_tell64 and
lo_truncate64 are added. inv_api.c is changed to handle 64-bit
offsets.

Patch contributed by Nozomi Anzai (backend side) and Yugo Nagata
(frontend side, docs, regression tests and example program). Reviewed
by Kohei Kaigai. Committed by Tatsuo Ishii with minor editings.
2012-10-07 08:36:48 +09:00
Tom Lane 73b796a52c Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs.  This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes.  Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim.  (There are other places that are assuming
this anyway, it turns out.)

In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it.  The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.

In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 16:56:54 -04:00
Alvaro Herrera f34c68f096 Introduce timeout handling framework
Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier.  Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.

External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.

timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.

Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase.  This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.

Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane
2012-07-16 22:55:33 -04:00
Robert Haas b79ab00144 When LWLOCK_STATS is defined, count spindelays.
When LWLOCK_STATS is *not* defined, the only change is that
SpinLockAcquire now returns the number of delays.

Patch by me, review by Jeff Janes.
2012-06-26 16:06:07 -04:00
Alvaro Herrera 77ed0c6950 Tighten up includes in sinvaladt.h, twophase.h, proc.h
Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in
proc.h with xlogdefs.h.
2012-06-25 18:40:40 -04:00
Peter Eisentraut eeece9e609 Unify calling conventions for postgres/postmaster sub-main functions
There was a wild mix of calling conventions: Some were declared to
return void and didn't return, some returned an int exit code, some
claimed to return an exit code, which the callers checked, but
actually never returned, and so on.

Now all of these functions are declared to return void and decorated
with attribute noreturn and don't return.  That's easiest, and most
code already worked that way.
2012-06-25 21:30:12 +03:00
Heikki Linnakangas 0ab9d1c4b3 Replace XLogRecPtr struct with a 64-bit integer.
This simplifies code that needs to do arithmetic on XLogRecPtrs.

To avoid changing on-disk format of data pages, the LSN on data pages is
still stored in the old format. That should keep pg_upgrade happy. However,
we have XLogRecPtrs embedded in the control file, and in the structs that
are sent over the replication protocol, so this changes breaks compatibility
of pg_basebackup and server. I didn't do anything about this in this patch,
per discussion on -hackers, the right thing to do would to be to change the
replication protocol to be architecture-independent, so that you could use
a newer version of pg_receivexlog, for example, against an older server
version.
2012-06-24 19:19:45 +03:00
Heikki Linnakangas eeb6f37d89 Add a small cache of locks owned by a resource owner in ResourceOwner.
This speeds up reassigning locks to the parent owner, when the transaction
holds a lot of locks, but only a few of them belong to the current resource
owner. This is particularly helps pg_dump when dumping a large number of
objects.

The cache can hold up to 15 locks in each resource owner. After that, the
cache is marked as overflowed, and we fall back to the old method of
scanning the whole local lock table. The tradeoff here is that the cache has
to be scanned whenever a lock is released, so if the cache is too large,
lock release becomes more expensive. 15 seems enough to cover pg_dump, and
doesn't have much impact on lock release.

Jeff Janes, reviewed by Amit Kapila and Heikki Linnakangas.
2012-06-21 15:30:26 +03:00
Bruce Momjian 927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Tom Lane ece01aae47 Scan the buffer pool just once, not once per fork, during relation drop.
This provides a speedup of about 4X when NBuffers is large enough.
There is also a useful reduction in sinval traffic, since we
only do CacheInvalidateSmgr() once not once per fork.

Simon Riggs, reviewed and somewhat revised by Tom Lane
2012-06-07 17:43:11 -04:00
Heikki Linnakangas 9e4637bf89 Update comments that became out-of-date with the PGXACT struct.
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.

Noah Misch
2012-05-14 10:28:55 +03:00
Tom Lane 6308ba05a7 Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89 added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption.  However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out.  That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening.  Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop.  To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.

In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes.  I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately.  In any case it did nothing to improve the readability or
robustness of the code.

In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant.  I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
2012-05-09 23:37:10 -04:00
Simon Riggs bbd3ec9dce Rename BgWriterCommLock to CheckpointerCommLock 2012-05-09 14:11:48 +01:00
Tom Lane 5461564a9d Reduce idle power consumption of walwriter and checkpointer processes.
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption.  It reverts to
normal as soon as it's done any successful flushes.  It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.

Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle.  This also is primarily useful for
reducing the server's power consumption when idle.

In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch.  Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.

Peter Geoghegan, somewhat simplified by Tom
2012-05-08 20:03:26 -04:00
Tom Lane 71b9549d05 Overdue code review for transaction-level advisory locks patch.
Commit 62c7bd31c8 had assorted problems, most
visibly that it broke PREPARE TRANSACTION in the presence of session-level
advisory locks (which should be ignored by PREPARE), as per a recent
complaint from Stephen Rees.  More abstractly, the patch made the
LockMethodData.transactional flag not merely useless but outright
dangerous, because in point of fact that flag no longer tells you anything
at all about whether a lock is held transactionally.  This fix therefore
removes that flag altogether.  We now rely entirely on the convention
already in use in lock.c that transactional lock holds must be owned by
some ResourceOwner, while session holds are never so owned.  Setting the
locallock struct's owner link to NULL thus denotes a session hold, and
there is no redundant marker for that.

PREPARE TRANSACTION now works again when there are session-level advisory
locks, and it is also able to transfer transactional advisory locks to the
prepared transaction, but for implementation reasons it throws an error if
we hold both types of lock on a single lockable object.  Perhaps it will be
worth improving that someday.

Assorted other minor cleanup and documentation editing, as well.

Back-patch to 9.1, except that in the 9.1 branch I did not remove the
LockMethodData.transactional flag for fear of causing an ABI break for
any external code that might be examining those structs.
2012-05-04 17:44:31 -04:00
Robert Haas 8e0c5195df Add missing parenthesis in comment. 2012-05-02 14:30:58 -04:00
Peter Eisentraut f2f9439fbf Remove dead ports
Remove the following ports:

- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel

These are obsolete and not worth rescuing.  In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.
2012-05-01 22:11:12 +03:00
Tom Lane 309c64745e Rename track_iotiming GUC to track_io_timing.
This spelling seems significantly more readable to me.
2012-04-29 16:23:54 -04:00
Robert Haas ca1e1a8da1 Remove prototype for nonexistent function. 2012-04-25 15:32:15 -04:00
Robert Haas 4a6fab03f2 Finish rename of FastPathStrongLocks to FastPathStrongRelationLocks.
Commit 8e5ac74c12 tried to do this renaming,
but I relied on gcc to tell me where I needed to make changes, instead of
grep.

Noted by Jeff Davis.
2012-04-18 11:29:34 -04:00
Robert Haas 53c5b869b4 Tighten up error recovery for fast-path locking.
The previous code could cause a backend crash after BEGIN; SAVEPOINT a;
LOCK TABLE foo (interrupted by ^C or statement timeout); ROLLBACK TO
SAVEPOINT a; LOCK TABLE foo, and might have leaked strong-lock counts
in other situations.

Report by Zoltán Böszörményi; patch review by Jeff Davis.
2012-04-18 11:17:30 -04:00
Heikki Linnakangas 5762a4d909 Inherit max_safe_fds to child processes in EXEC_BACKEND mode.
Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.

It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.
2012-03-29 08:19:11 +03:00
Robert Haas 40b9b95769 New GUC, track_iotiming, to track I/O timings.
Currently, the only way to see the numbers this gathers is via
EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through
the stats collector and pg_stat_statements in subsequent patches.

Ants Aasma, reviewed by Greg Smith, with some further changes by me.
2012-03-27 14:55:02 -04:00
Heikki Linnakangas aef5fe7efe Add comments explaining why our Itanium spinlock implementation is safe. 2012-03-16 10:14:45 +02:00
Heikki Linnakangas 1a01560cbb Rename LWLockWaitUntilFree to LWLockAcquireOrWait.
LWLockAcquireOrWait makes it more clear that the lock is acquired if it's
free.
2012-02-08 09:17:13 +02:00
Heikki Linnakangas 9b38d46d9f Make group commit more effective.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.

This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.

Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
2012-01-30 16:53:48 +02:00
Tom Lane dd243b3e40 Fix typo in comment.
Peter Geoghegan
2012-01-29 18:56:35 -05:00
Heikki Linnakangas 6d90eaaa89 Make bgwriter sleep longer when it has no work to do, to save electricity.
To make it wake up promptly when activity starts again, backends nudge it
by setting a latch in MarkBufferDirty(). The latch is kept set while
bgwriter is active, so there is very little overhead from that when the
system is busy. It is only armed before going into longer sleep.

Peter Geoghegan, with some changes by me.
2012-01-26 18:39:13 +02:00
Simon Riggs c172b7b02e Resolve timing issue with logging locks for Hot Standby.
We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.

Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
2012-01-23 23:37:32 +00:00
Tom Lane 0a41e86584 Use __sync_lock_test_and_set() for spinlocks on ARM, if available.
Historically we've used the SWPB instruction for TAS() on ARM, but this
is deprecated and not available on ARMv6 and later.  Instead, make use
of a GCC builtin if available.  We'll still fall back to SWPB if not,
so as not to break existing ports using older GCC versions.

Eventually we might want to try using __sync_lock_test_and_set() on some
other architectures too, but for now that seems to present only risk and
not reward.

Back-patch to all supported versions, since people might want to use any
of them on more recent ARM chips.

Martin Pitt
2012-01-07 15:38:52 -05:00
Tom Lane bc2a050d40 Use a non-locking initial test in TAS_SPIN on PPC.
Further testing convinces me that this is helpful at sufficiently high
contention levels, though it's still worrisome that it loses slightly
at lower contention levels.

Per Manabu Ori.
2012-01-03 16:00:06 -05:00
Tom Lane 631beeac35 Use LWSYNC in place of SYNC/ISYNC in PPC spinlocks, where possible.
This is allegedly a win, at least on some PPC implementations, according
to the PPC ISA documents.  However, as with LWARX hints, some PPC
platforms give an illegal-instruction failure.  Use the same trick as
before of assuming that PPC64 platforms will accept it; we might need to
refine that based on experience, but there are other projects doing
likewise according to google.

I did not add an assembler compatibility test because LWSYNC has been
around much longer than hint bits, and it seems unlikely that any
toolchains currently in use don't recognize it.
2012-01-02 00:02:02 -05:00
Tom Lane 8496c6cd77 Use 4-byte slock_t on both PPC and PPC64.
Previously we defined slock_t as 8 bytes on PPC64, but the TAS assembly
code uses word-wide operations regardless, so that the second word was
just wasted space.  There doesn't appear to be any performance benefit
in adding the second word, so get rid of it to simplify the code.
2012-01-02 00:02:01 -05:00
Tom Lane 5cfa8dd300 Use mutex hint bit in PPC LWARX instructions, where possible.
The hint bit makes for a small but measurable performance improvement
in access to contended spinlocks.

On the other hand, some PPC chips give an illegal-instruction failure.
There doesn't seem to be a completely bulletproof way to tell whether the
hint bit will cause an illegal-instruction failure other than by trying
it; but most if not all 64-bit PPC machines should accept it, so follow
the Linux kernel's lead and assume it's okay to use it in 64-bit builds.
Of course we must also check whether the assembler accepts the command,
since even with a recent CPU the toolchain could be old.

Patch by Manabu Ori, significantly modified by me.
2012-01-02 00:02:00 -05:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Robert Haas ed0b409d22 Move "hot" members of PGPROC into a separate PGXACT array.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order.  These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-25 08:02:10 -05:00
Robert Haas 71b2b657c0 Revert removal of trace_userlocks, because userlocks aren't gone.
This reverts commit 0180bd6180.
contrib/userlock is gone, but user-level locking still exists,
and is exposed via the pg_advisory* family of functions.
2011-11-10 17:54:27 -05:00
Simon Riggs 86e3364899 Derive oldestActiveXid at correct time for Hot Standby.
There was a timing window between when oldestActiveXid was derived
and when it should have been derived that only shows itself under
heavy load. Move code around to ensure correct timing of derivation.
No change to StartupSUBTRANS() code, which is where this failed.

Bug report by Chris Redekop
2011-11-02 08:54:56 +00:00
Simon Riggs 806a2aee37 Split work of bgwriter between 2 processes: bgwriter and checkpointer.
bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.

Patch by me, Review by Dickson Guedes
2011-11-01 17:14:47 +00:00
Robert Haas 53f1ca59b5 Allow hint bits to be set sooner for temporary and unlogged tables.
We need not wait until the commit record is durably on disk, because
in the event of a crash the page we're updating with hint bits will
be gone anyway.  Per off-list report from Heikki Linnakangas, this
can significantly degrade the performance of unlogged tables; I was
able to show a 2x speedup from this patch on a pgbench run with scale
factor 15.  In practice, this will mostly help small, heavily updated
tables, because on larger tables you're unlikely to run into the same
row again before the commit record makes it out to disk.
2011-10-28 17:08:09 -04:00
Robert Haas b6335a3f1b Demote some sanity checks in BufferIsValid() to assertions.
Testing reveals that this macro is a hot-spot for index-only-scans.
Per discussion with Tom Lane.
2011-10-28 17:04:22 -04:00
Tom Lane bb446b689b Support synchronization of snapshots through an export/import procedure.
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT.  The data does not
leave the server so there are not security issues.  A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.

I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.

Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
2011-10-22 18:23:30 -04:00
Bruce Momjian 0180bd6180 Remove all "traces" of trace_userlocks, because userlocks were removed
in PG 8.2.
2011-10-13 19:59:57 -04:00
Heikki Linnakangas 041dceb259 Fix typo. 2011-10-08 11:04:07 +03:00
Robert Haas 6a6082c27c Try to fix memory barriers on x86_64.
%esp is no good; must use %rsp there.
2011-10-07 23:42:57 -04:00
Tom Lane 57eb009092 Allow snapshot references to still work during transaction abort.
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction.  This is the root cause of bug #6218 from Yamamoto
Takashi.  While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources.  However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner.  Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward.  predicate.c's dependencies on TopTransactionResourceOwner
go away too.

I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
2011-09-26 22:25:28 -04:00
Robert Haas 0c8eda6258 Memory barrier support for PostgreSQL.
This is not actually used anywhere yet, but it gets the basic
infrastructure in place.  It is fairly likely that there are bugs, and
support for some important platforms may be missing, so we'll need to
refine this as we go along.
2011-09-23 17:52:43 -04:00
Tom Lane a7801b62f2 Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead.  Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.

No actual code changes here, just header refactoring.
2011-09-09 13:23:41 -04:00
Tom Lane 1609797c25 Clean up the #include mess a little.
walsender.h should depend on xlog.h, not vice versa.  (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.)  Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h.  Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.

This episode reinforces my feeling that pgrminclude should not be run
without adult supervision.  Inclusion changes in header files in particular
need to be reviewed with great care.  More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
2011-09-04 01:13:16 -04:00
Tom Lane f116b1f5b8 Remove unnecessary and circular #include.
storage/proc.h should not include replication/syncrep.h, especially not
when the latter includes storage/proc.h; but in any case this was a pretty
poor thing from a modular layering standpoint.
2011-09-03 22:14:45 -04:00
Bruce Momjian 6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Tom Lane be1e8053f4 Use a non-locking test in TAS_SPIN() on all IA64 platforms.
Per my testing, this works just as well with gcc as it does with HP's
compiler; and there is no reason to think that the effect doesn't occur
with icc, either.

Also, rewrite the header comment about enforcing sequencing around spinlock
operations, per Robert's gripe that it was misleading.
2011-08-29 13:18:44 -04:00
Robert Haas c01c25fbe5 Improve spinlock performance for HP-UX, ia64, non-gcc.
At least on this architecture, it's very important to spin on a
non-atomic instruction and only retry the atomic once it appears
that it will succeed.  To fix this, split TAS() into two macros:
TAS(), for trying to grab the lock the first time, and TAS_SPIN(),
for spinning until we get it.  TAS_SPIN() defaults to same as TAS(),
but we can override it when we know there's a better way.

It's likely that some of the other cases in s_lock.h require
similar treatment, but this is the only one we've got conclusive
evidence for at present.
2011-08-29 10:05:48 -04:00
Tom Lane b5282aa893 Revise sinval code to remove no-longer-used tuple TID from inval messages.
This requires adjusting the API for syscache callback functions: they now
get a hash value, not a TID, to identify the target tuple.  Most of them
weren't paying any attention to that argument anyway, but plancache did
require a small amount of fixing.

Also, improve performance a trifle by avoiding sending duplicate inval
messages when a heap_update isn't changing the catcache lookup columns.
2011-08-16 19:27:46 -04:00
Tom Lane 4dab3d5ae1 Change the autovacuum launcher to use WaitLatch instead of a poll loop.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens.  This will allow WaitLatch callers to be
wakened properly by these signals.

In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno.  Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch.  Much of
this has to be back-patched into 9.1.

Peter Geoghegan, with additional work by Tom
2011-08-10 12:22:21 -04:00
Tom Lane 4e15a4db5e Documentation improvement and minor code cleanups for the latch facility.
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code.  Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.

This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them.  In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
2011-08-09 15:30:45 -04:00
Robert Haas 84e3712677 Create VXID locks "lazily" in the main lock table.
Instead of entering them on transaction startup, we materialize them
only when someone wants to wait, which will occur only during CREATE
INDEX CONCURRENTLY.  In Hot Standby mode, the startup process must also
be able to probe for conflicting VXID locks, but the lock need never be
fully materialized, because the startup process does not use the normal
lock wait mechanism.  Since most VXID locks never need to touch the
lock manager partition locks, this can significantly reduce blocking
contention on read-heavy workloads.

Patch by me.  Review by Jeff Davis.
2011-08-04 12:38:33 -04:00
Tom Lane ac36e6f71f Move CheckRecoveryConflictDeadlock() call to a safer place.
This kluge was inserted in a spot apparently chosen at random: the lock
manager's state is not yet fully set up for the wait, and in particular
LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
will not get cleaned up if the ereport is thrown.  This seems to not cause
any observable problem in trivial test cases, because LockReleaseAll will
silently clean up the debris; but I was able to cause failures with tests
involving subtransactions.

Fixes breakage induced by commit c85c941470.
Back-patch to all affected branches.
2011-08-02 15:16:29 -04:00
Tom Lane 2e53bd5517 Fix incorrect initialization of ProcGlobal->startupBufferPinWaitBufId.
It was initialized in the wrong place and to the wrong value.  With bad
luck this could result in incorrect query-cancellation failures in hot
standby sessions, should a HS backend be holding pin on buffer number 1
while trying to acquire a lock.
2011-08-02 13:23:52 -04:00
Robert Haas 3cba8999b3 Create a "fast path" for acquiring weak relation locks.
When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested
on an unshared database relation, and we can verify that no conflicting
locks can possibly be present, record the lock in a per-backend queue,
stored within the PGPROC, rather than in the primary lock table.  This
eliminates a great deal of contention on the lock manager LWLocks.

This patch also refactors the interface between GetLockStatusData() and
pg_lock_status() to be a bit more abstract, so that we don't rely so
heavily on the lock manager's internal representation details.  The new
fast path lock structures don't have a LOCK or PROCLOCK structure to
return, so we mustn't depend on that for purposes of listing outstanding
locks.

Review by Jeff Davis.
2011-07-18 00:49:28 -04:00
Robert Haas 4240e429d0 Try to acquire relation locks in RangeVarGetRelid.
In the previous coding, we would look up a relation in RangeVarGetRelid,
lock the resulting OID, and then AcceptInvalidationMessages().  While
this was sufficient to ensure that we noticed any changes to the
relation definition before building the relcache entry, it didn't
handle the possibility that the name we looked up no longer referenced
the same OID.  This was particularly problematic in the case where a
table had been dropped and recreated: we'd latch on to the entry for
the old relation and fail later on.  Now, we acquire the relation lock
inside RangeVarGetRelid, and retry the name lookup if we notice that
invalidation messages have been processed meanwhile.  Many operations
that would previously have failed with an error in the presence of
concurrent DDL will now succeed.

There is a good deal of work remaining to be done here: many callers
of RangeVarGetRelid still pass NoLock for one reason or another.  In
addition, nothing in this patch guards against the possibility that
the meaning of an unqualified name might change due to the creation
of a relation in a schema earlier in the user's search path than the
one where it was previously found.  Furthermore, there's nothing at
all here to guard against similar race conditions for non-relations.
For all that, it's a start.

Noah Misch and Robert Haas
2011-07-08 22:19:30 -04:00
Heikki Linnakangas 89fd72cbf2 Introduce a pipe between postmaster and each backend, which can be used to
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.

This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.

The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.

Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
2011-07-08 18:44:07 +03:00
Heikki Linnakangas 406d61835b SSI has a race condition, where the order of commit sequence numbers of
transactions might not match the order the work done in those transactions
become visible to others. The logic in SSI, however, assumed that it does.
Fix that by having two sequence numbers for each serializable transaction,
one taken before a transaction becomes visible to others, and one after it.
This is easier than trying to make the the transition totally atomic, which
would require holding ProcArrayLock and SerializableXactHashLock at the same
time. By using prepareSeqNo instead of commitSeqNo in a few places where
commit sequence numbers are compared, we can make those comparisons err on
the safe side when we don't know for sure which committed first.

Per analysis by Kevin Grittner and Dan Ports, but this approach to fix it
is different from the original patch.
2011-07-07 23:26:34 +03:00
Heikki Linnakangas 5da417f7c4 Remove pointless const qualifiers from function arguments in the SSI code.
As Tom Lane pointed out, "const Relation foo" doesn't guarantee that you
can't modify the data the "foo" pointer points to. It just means that you
can't change the pointer to point to something else within the function,
which is not very useful.
2011-06-22 12:18:39 +03:00
Heikki Linnakangas 7cb2ff9621 Fix bug introduced by recent SSI patch to merge ROLLED_BACK and
MARKED_FOR_DEATH flags into one. We still need the ROLLED_BACK flag to
mark transactions that are in the process of being rolled back. To be
precise, ROLLED_BACK now means that a transaction has already been
discounted from the count of transactions with the oldest xmin, but not
yet removed from the list of active transactions.

Dan Ports
2011-06-21 14:49:50 +03:00
Heikki Linnakangas cb94db91b2 pgindent run of recent SSI changes. Also, remove an unnecessary #include.
Kevin Grittner
2011-06-16 16:17:22 +03:00
Heikki Linnakangas 264a6b127a The rolled-back flag on serializable xacts was pointless and redundant with
the marked-for-death flag. It was only set for a fleeting moment while a
transaction was being cleaned up at rollback. All the places that checked
for the rolled-back flag should also check the marked-for-death flag, as
both flags mean that the transaction will roll back. I also renamed the
marked-for-death into "doomed", which is a lot shorter name.
2011-06-15 13:35:28 +03:00
Heikki Linnakangas 0a0e2b52a5 Make non-MVCC snapshots exempt from predicate locking. Scans with non-MVCC
snapshots, like in REINDEX, are basically non-transactional operations. The
DDL operation itself might participate in SSI, but there's separate
functions for that.

Kevin Grittner and Dan Ports, with some changes by me.
2011-06-15 12:11:18 +03:00
Heikki Linnakangas cb2d158c58 Fix locking while setting flags in MySerializableXact.
Even if a flag is modified only by the backend owning the transaction, it's
not safe to modify it without a lock. Another backend might be setting or
clearing a different flag in the flags field concurrently, and that
operation might be lost because setting or clearing a bit in a word is not
atomic.

Make did-write flag a simple backend-private boolean variable, because it
was only set or tested in the owning backend (except when committing a
prepared transaction, but it's not worthwhile to optimize for the case of a
read-only prepared transaction). This also eliminates the need to add
locking where that flag is set.

Also, set the did-write flag when doing DDL operations like DROP TABLE or
TRUNCATE -- that was missed earlier.
2011-06-10 23:41:10 +03:00
Alvaro Herrera fba105b109 Use "transient" files for blind writes, take 2
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away.  We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.

To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends.  We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)

This commit fixes a bug in the previous one, which neglected to cleanly
handle the LRU ring that fd.c uses to manage open files, and caused an
unacceptable failure just before beta2 and was thus reverted.
2011-06-10 13:43:02 -04:00
Heikki Linnakangas c79c570bd8 Small comment fixes and enhancements. 2011-06-10 17:22:46 +03:00
Alvaro Herrera 9261557eb1 Revert "Use "transient" files for blind writes"
This reverts commit 54d9e8c6c1, which
caused a failure on the buildfarm.  Not a good thing to have just before
a beta release.
2011-06-09 16:41:44 -04:00
Alvaro Herrera 54d9e8c6c1 Use "transient" files for blind writes
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away.  We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.

To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends.  We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)
2011-06-09 16:25:49 -04:00
Heikki Linnakangas 8f9622bbb3 Make DDL operations play nicely with Serializable Snapshot Isolation.
Truncating or dropping a table is treated like deletion of all tuples, and
check for conflicts accordingly. If a table is clustered or rewritten by
ALTER TABLE, all predicate locks on the heap are promoted to relation-level
locks, because the tuple or page ids of any existing tuples will change and
won't be valid after rewriting the table. Arguably ALTER TABLE should be
treated like a mass-UPDATE of every row, but if you e.g change the datatype
of a column, you could also argue that it's just a change to the physical
layout, not a logical change. Reindexing promotes all locks on the index to
relation-level lock on the heap.

Kevin Grittner, with a lot of cosmetic changes by me.
2011-06-08 14:02:43 +03:00
Heikki Linnakangas c8630919e0 SSI comment fixes and enhancements. Notably, document that the conflict-out
flag actually means that the transaction has a conflict out to a transaction
that committed before the flagged transaction.

Kevin Grittner
2011-06-03 12:45:42 +03:00
Heikki Linnakangas 3103f9a77d The row-version chaining in Serializable Snapshot Isolation was still wrong.
On further analysis, it turns out that it is not needed to duplicate predicate
locks to the new row version at update, the lock on the version that the
transaction saw as visible is enough. However, there was a different bug in
the code that checks for dangerous structures when a new rw-conflict happens.
Fix that bug, and remove all the row-version chaining related code.

Kevin Grittner & Dan Ports, with some comment editorialization by me.
2011-05-30 20:47:17 +03:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Robert Haas 38b27792ea Avoid possible hang during smart shutdown.
If a smart shutdown occurs just as a child is starting up, and the
child subsequently becomes a walsender, there is a race condition:
the postmaster might count the exstant backends, determine that there
is one normal backend, and wait for it to die off.  Had the walsender
transition already occurred before the postmaster counted, it would
have proceeded with the shutdown.

To fix this, have each child that transforms into a walsender kick
the postmaster just after doing so, so that the state machine is
certain to advance.

Fujii Masao
2011-04-03 19:42:00 -04:00
Heikki Linnakangas c8ae318cbe Increase SHMEM_INDEX_SIZE from 32 to 64. We're currently at 40 entries in
ShmemIndex, so 64 leaves some headroom.

Kevin Grittner
2011-03-31 13:37:01 +03:00
Heikki Linnakangas 754baa21f7 Automatically terminate replication connections that are idle for more
than replication_timeout (a new GUC) milliseconds. The TCP timeout is often
too long, you want the master to notice a dead connection much sooner.
People complained about that in 9.0 too, but with synchronous replication
it's even more important to notice dead connections promptly.

Fujii Masao and Heikki Linnakangas
2011-03-30 10:20:37 +03:00
Robert Haas 2e019c8611 More synchronous replication typo fixes.
Fujii Masao
2011-03-10 15:56:18 -05:00
Heikki Linnakangas 4cd3fb6e12 Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer
than doing it aggressively whenever the tail-XID pointer is advanced, because
this way we don't need to do it while holding SerializableXactHashLock.

This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an
obsolete comment spotted by Kevin Grittner.
2011-03-08 12:12:54 +02:00
Simon Riggs a8a8a3e096 Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
2011-03-06 22:49:16 +00:00
Heikki Linnakangas 47ad79122b Fix bugs in Serializable Snapshot Isolation.
Change the way UPDATEs are handled. Instead of maintaining a chain of
tuple-level locks in shared memory, copy any existing locks on the old
tuple to the new tuple at UPDATE. Any existing page-level lock needs to
be duplicated too, as a lock on the new tuple. That was neglected
previously.

Store xmin on tuple-level predicate locks, to distinguish a lock on an old
already-recycled tuple from a new tuple at the same physical location.
Failure to distinguish them caused loops in the tuple-lock chains, as
reported by YAMAMOTO Takashi. Although we don't use the chain representation
of UPDATEs anymore, it seems like a good idea to store the xmin to avoid
some false positives if no other reason.

CheckSingleTargetForConflictsIn now correctly handles the case where a lock
that's being held is not reflected in the local lock table. That happens
if another backend acquires a lock on our behalf due to an UPDATE or a page
split.

PredicateLockPageCombine now retains locks for the page that is being
removed, rather than removing them. This prevents a potentially dangerous
false-positive inconsistency where the local lock table believes that a lock
is held, but it is actually not.

Dan Ports and Kevin Grittner
2011-03-01 19:05:16 +02:00
Itagaki Takahiro 62c7bd31c8 Add transaction-level advisory locks.
They share the same locking namespace with the existing session-level
advisory locks, but they are automatically released at the end of the
current transaction and cannot be released explicitly via unlock
functions.

Marko Tiikkaja, reviewed by me.
2011-02-18 14:05:12 +09:00
Heikki Linnakangas f9f9d696a9 UINT64_MAX isn't defined on MSVC. 2011-02-08 18:15:53 +02:00
Heikki Linnakangas dafaa3efb7 Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-08 00:09:08 +02:00
Heikki Linnakangas 8f5d65e916 Treat a WAL sender process that hasn't started streaming yet as a regular
backend, as far as the postmaster shutdown logic is concerned. That means,
fast shutdown will wait for WAL sender processes to exit before signaling
bgwriter to finish. This avoids race conditions between a base backup stopping
or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
want e.g the end-of-backup WAL record to be written after the shutdown
checkpoint.
2011-01-15 16:38:21 +02:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas 53dbc27c62 Support unlogged tables.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery.  Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
2010-12-29 06:48:53 -05:00
Simon Riggs e620ee35b2 Optimize commit_siblings in two ways to improve group commit.
First, avoid scanning the whole ProcArray once we know there
are at least commit_siblings active; second, skip the check
altogether if commit_siblings = 0.

Greg Smith
2010-12-08 18:48:03 +00:00
Heikki Linnakangas 5a031a5556 Fix bugs in the hot standby known-assigned-xids tracking logic. If there's
an old transaction running in the master, and a lot of transactions have
started and finished since, and a WAL-record is written in the gap between
the creating the running-xacts snapshot and WAL-logging it, recovery will fail
with "too many KnownAssignedXids" error. This bug was reported by
Joachim Wieland on Nov 19th.

In the same scenario, when fewer transactions have started so that all the
xids fit in KnownAssignedXids despite the first bug, a more serious bug
arises. We incorrectly initialize the clog code with the oldest still running
transaction, and when we see the WAL record belonging to a transaction with
an XID larger than one that committed already before the checkpoint we're
recovering from, we zero the clog page containing the already committed
transaction, leading to data loss.

In hindsight, trying to track xids in the known-assigned-xids array before
seeing the running-xacts record was too complicated. To fix that, hold
XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
ensures that no transaction can begin or end in that gap, so that in recvoery
we know that the snapshot contains all transactions running at that point in
WAL.
2010-12-07 09:23:30 +01:00
Simon Riggs ed78384acd Move call to GetTopTransactionId() earlier in LockAcquire(),
removing an infrequently occurring race condition in Hot Standby.
An xid must be assigned before a lock appears in shared memory,
rather than immediately after, else GetRunningTransactionLocks()
may see InvalidTransactionId, causing assertion failures during
lock processing on standby.

Bug report and diagnosis by Fujii Masao, fix by me.
2010-11-29 01:08:02 +00:00
Peter Eisentraut fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Robert Haas 20cf8ae478 Fix copy-and-pasteo a little more completely.
copydir.c is no longer in src/port
2010-11-15 10:10:58 -05:00
Alvaro Herrera ae4b17edee Fix copy-and-pasteo. 2010-11-15 11:52:56 -03:00
Robert Haas 11e482c350 Move copydir() prototype into its own header file.
Having this in src/include/port.h makes no sense, now that copydir.c lives
in src/backend/strorage rather than src/port.  Along the way, remove an
obsolete comment from contrib/pg_upgrade that makes reference to the old
location.
2010-11-12 16:39:53 -05:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Heikki Linnakangas 236b6bc29e Simplify Windows implementation of latches. There's no need to keep a
dynamic pool of event handles, we can permanently assign one for each
shared latch. Thanks to that, we no longer need a separate shared memory
block for latches, and we don't need to know in advance how many shared
latches there is, so you no longer need to remember to update
NumSharedLatches when you introduce a new latch to the system.
2010-09-15 10:06:21 +00:00
Heikki Linnakangas 2746e5f21d Introduce latches. A latch is a boolean variable, with the capability to
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.

On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.

Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.

Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
2010-09-11 15:48:04 +00:00
Tom Lane b9defe0405 Marginal code cleanup for streaming replication.
There is no reason that proc.c should have to get involved in this dirty hack
for letting the postmaster know which children are walsenders.  Revert that
file to the way it was, and confine the kluge to pmsignal.c and postmaster.c.
2010-08-23 17:20:01 +00:00
Tom Lane 946b078044 MyBackendId now needs to be PGDLLIMPORT, so that contrib modules can
access it on Windows.  Per buildfarm.
2010-08-14 13:37:21 +00:00
Robert Haas debcec7dc3 Include the backend ID in the relpath of temporary relations.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.

Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1.  It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.

Review by Jaime Casanova and Tom Lane.
2010-08-13 20:10:54 +00:00
Bruce Momjian 239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane e76c1a0f4d Replace max_standby_delay with two parameters, max_standby_archive_delay and
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server.  Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.

Do some desultory editing on the hot standby documentation, too.
2010-07-03 20:43:58 +00:00
Simon Riggs f9dbac9476 HS Defer buffer pin deadlock check until deadlock_timeout has expired.
During Hot Standby we need to check for buffer pin deadlocks when the
Startup process begins to wait, in case it never wakes up again. We
previously made the deadlock check immediately on the basis it was
cheap, though clearer thinking and prima facie evidence shows that
was too simple. Refactor existing code to make it easy to add in
deferral of deadlock check until deadlock_timeout allowing a good
reduction in deadlock checks since far few buffer pins are held for
that duration. It's worth doing anyway, though major goal is to
prevent further reports of context switching with high numbers of
users on occasional tests.
2010-05-26 19:52:52 +00:00
Robert Haas ea9968c331 Rename PM_RECOVERY_CONSISTENT and PMSIGNAL_RECOVERY_CONSISTENT.
The new names PM_HOT_STANDBY and PMSIGNAL_BEGIN_HOT_STANDBY more accurately
reflect their actual function.
2010-05-15 20:01:32 +00:00
Simon Riggs 8431e296ea Cleanup initialization of Hot Standby. Clarify working with reanalysis
of requirements and documentation on LogStandbySnapshot(). Fixes
two minor bugs reported by Tom Lane that would lead to an incorrect
snapshot after transaction wraparound. Also fix two other problems
discovered that would give incorrect snapshots in certain cases.
ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor
refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().
2010-05-13 11:15:38 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Tom Lane d1e027221d Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue.
In addition, add support for a "payload" string to be passed along with
each notify event.

This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage.  There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.

Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
2010-02-16 22:34:57 +00:00
Greg Stark f8c183a1ac Speed up CREATE DATABASE by deferring the fsyncs until after copying
all the data and using posix_fadvise to nudge the OS into flushing it
earlier. This also hopefully makes CREATE DATABASE avoid spamming the
cache.

Tests show a big speedup on Linux at least on some filesystems.

Idea and patch from Andres Freund.
2010-02-15 00:50:57 +00:00
Simon Riggs dd428c79a4 Fix relcache init file invalidation during Hot Standby for the case
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
2010-02-13 16:15:48 +00:00
Simon Riggs b95a720a48 Re-enable max_standby_delay = -1 using deadlock detection on startup
process. If startup waits on a buffer pin we send a request to all
backends to cancel themselves if they are holding the buffer pin
required and they are also waiting on a lock. If not, startup waits
until max_standby_delay before cancelling any backend waiting for
the requested buffer pin.
2010-02-13 01:32:20 +00:00
Tom Lane cbe9d6beb4 Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens.  Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation.  We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation.  This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday.  We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.

In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
2010-02-09 21:43:30 +00:00
Tom Lane b9b8831ad6 Create a "relation mapping" infrastructure to support changing the relfilenodes
of shared or nailed system catalogs.  This has two key benefits:

* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.

* We no longer have to use an unsafe reindex-in-place approach for reindexing
  shared catalogs.

CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.

Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.

This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch.  As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
2010-02-07 20:48:13 +00:00
Simon Riggs c85c941470 Detect early deadlock in Hot Standby when Startup is already waiting. First
stage of required deadlock detection to allow re-enabling max_standby_delay
setting of -1, which is now essential in the absence of improved relation-
specific conflict resoluton. Requested by Greg Stark et al.
2010-01-31 19:01:11 +00:00
Simon Riggs 76be0c81cc Filter recovery conflicts based upon dboid from relfilenode of WAL
records for heap and btree. Minor change, mostly API changes to
pass through the required values. This is a simple change though
also provides the refactoring required for further enhancements
to conflict processing using the relOid. Changes only have effect
during Hot Standby.
2010-01-29 17:10:05 +00:00
Heikki Linnakangas 1bb2558046 Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.

This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.

To implement that, refactor the ReadRecord/FetchRecord functions. The
FetchRecord() function introduced in the original streaming replication
patch is removed, and all the retry logic is now in a new function called
XLogReadPage(). XLogReadPage() is now responsible for executing
restore_command, launching walreceiver, and waiting for new WAL to arrive
from primary, as required.

This also changes the life cycle of walreceiver. When launched, it now only
tries to connect to the master once, and exits if the connection fails, or
is lost during streaming for any reason. The startup process detects the
death, and re-launches walreceiver if necessary.
2010-01-27 15:27:51 +00:00
Simon Riggs 959ac58c04 In HS, Startup process sets SIGALRM when waiting for buffer pin. If
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
2010-01-23 16:37:12 +00:00
Heikki Linnakangas 47ce95a7b9 Now that much of walreceiver has been pulled back into the postgres
binary, revert PGDLLIMPORT decoration of global variables. I'm not sure
if there's any real harm from unnecessary PGDLLIMPORTs, but these are all
internal variables that external modules really shouldn't be messing
with. ThisTimeLineID still needs PGDLLIMPORT.
2010-01-20 18:54:27 +00:00
Simon Riggs a8ce974cdd Teach standby conflict resolution to use SIGUSR1
Conflict reason is passed through directly to the backend, so we can
take decisions about the effect of the conflict based upon the local
state. No specific changes, as yet, though this prepares for later work.
CancelVirtualTransaction() sends signals while holding ProcArrayLock.
Introduce errdetail_abort() to give message detail explaining that the
abort was caused by conflict processing. Remove CONFLICT_MODE states
in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.
2010-01-16 10:05:59 +00:00
Tom Lane 47a09eda89 PGDLLIMPORT-ize the remaining variables needed by walreceiver. 2010-01-16 00:04:41 +00:00
Heikki Linnakangas 40f908bdcd Introduce Streaming Replication.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.

Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.

Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.

Fujii Masao, with additional hacking by me
2010-01-15 09:19:10 +00:00
Simon Riggs e99767bc28 First part of refactoring of code for ResolveRecoveryConflict. Purposes
of this are to centralise the conflict code to allow further change,
as well as to allow passing through the full reason for the conflict
through to the conflicting backends. Backend state alters how we
can handle different types of conflict so this is now required.
As originally suggested by Heikki, no longer optional.
2010-01-14 11:08:02 +00:00