Commit Graph

2307 Commits

Author SHA1 Message Date
Robert Haas 66abc2608c Add a new reloption, user_catalog_table.
When this reloption is set and wal_level=logical is configured,
we'll record the CIDs stamped by inserts, updates, and deletes to
the table just as we would for an actual catalog table.  This will
allow logical decoding to use historical MVCC snapshots to access
such tables just as they access ordinary catalog tables.

Replication solutions built around the logical decoding machinery
will likely need to set this operation for their configuration
tables; it might also be needed by extensions which perform table
access in their output functions.

Andres Freund, reviewed by myself and others.
2013-12-10 19:17:34 -05:00
Robert Haas e55704d8b2 Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983.  This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all.  Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.

Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.

Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-10 19:01:40 -05:00
Alvaro Herrera 312bde3d40 Fix improper abort during update chain locking
In 247c76a989, I added some code to do fine-grained checking of
MultiXact status of locking/updating transactions when traversing an
update chain.  There was a thinko in that patch which would have the
traversing abort, that is return HeapTupleUpdated, when the other
transaction is a committed lock-only.  In this case we should ignore it
and return success instead.  Of course, in the case where there is a
committed update, HeapTupleUpdated is the correct return value.

A user-visible symptom of this bug is that in REPEATABLE READ and
SERIALIZABLE transaction isolation modes spurious serializability errors
can occur:
  ERROR:  could not serialize access due to concurrent update

In order for this to happen, there needs to be a tuple that's key-share-
locked and also updated, and the update must abort; a subsequent
transaction trying to acquire a new lock on that tuple would abort with
the above error.  The reason is that the initial FOR KEY SHARE is seen
as committed by the new locking transaction, which triggers this bug.
(If the UPDATE commits, then the serialization error is correctly
reported.)

When running a query in READ COMMITTED mode, what happens is that the
locking is aborted by the HeapTupleUpdated return value, then
EvalPlanQual fetches the newest version of the tuple, which is then the
only version that gets locked.  (The second time the tuple is checked
there is no misbehavior on the committed lock-only, because it's not
checked by the code that traverses update chains; so no bug.) Only the
newest version of the tuple is locked, not older ones, but this is
harmless.

The isolation test added by this commit illustrates the desired
behavior, including the proper serialization errors that get thrown.

Backpatch to 9.3.
2013-12-05 17:47:51 -03:00
Heikki Linnakangas 9e857436ef Don't include unused space in LOG_NEWPAGE records.
This is the same trick we use when taking a full page image of a buffer
passed to XLogInsert.
2013-12-04 00:10:47 +02:00
Heikki Linnakangas 22122c83f1 Fix full-page writes of internal GIN pages.
Insertion to a non-leaf GIN page didn't make a full-page image of the page,
which is wrong. The code used to do it correctly, but was changed (commit
853d1c3103) because the redo-routine didn't
track incomplete splits correctly when the page was restored from a full
page image. Of course, that was not right way to fix it, the redo routine
should've been fixed instead. The redo-routine was surreptitiously fixed
in 2010 (commit 4016bdef8a), so all we need
to do now is revert the code that creates the record to its original form.

This doesn't change the format of the WAL record.

Backpatch to all supported versions.
2013-12-03 23:16:01 +02:00
Peter Eisentraut fef88b3fda Report exit code from external recovery commands properly
When an external recovery command such as restore_command or
archive_cleanup_command fails, report the exit code properly,
distinguishing signals and normal exists, using the existing
wait_result_to_str() facility, instead of just reporting the return
value from system().

Reviewed-by: Peter Geoghegan <pg@heroku.com>
2013-12-02 22:31:05 -05:00
Alvaro Herrera 2393c7d102 Fix a couple of bugs in MultiXactId freezing
Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid.  This means
that a very old Xid could survive hidden within a multi, possibly
outliving its CLOG storage.  In the distant future, this would cause
clog lookup failures:
ERROR:  could not access status of transaction 3883960912
DETAIL:  Could not open file "pg_clog/0E78": No such file or directory.

This mostly was problematic when the updating transaction aborted, since
in that case the row wouldn't get pruned away earlier in vacuum and the
multixact could possibly survive for a long time.  In many cases, data
that is inaccessible for this reason way can be brought back
heuristically.

As a second bug, heap_freeze_tuple() didn't properly handle multixacts
that need to be frozen according to cutoff_multi, but whose updater xid
is still alive.  Instead of preserving the update Xid, it just set Xmax
invalid, which leads to both old and new tuple versions becoming
visible.  This is pretty rare in practice, but a real threat
nonetheless.  Existing corrupted rows, unfortunately, cannot be repaired
in an automated fashion.

Existing physical replicas might have already incorrectly frozen tuples
because of different behavior than in master, which might only become
apparent in the future once pg_multixact/ is truncated; it is
recommended that all clones be rebuilt after upgrading.

Following code analysis caused by bug report by J Smith in message
CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
and privately by F-Secure.

Backpatch to 9.3, where freezing of MultiXactIds was introduced.

Analysis and patch by Andres Freund, with some tweaks by Álvaro.
2013-11-29 21:47:25 -03:00
Alvaro Herrera 1ce150b7bb Don't TransactionIdDidAbort in HeapTupleGetUpdateXid
It is dangerous to do so, because some code expects to be able to see what's
the true Xmax even if it is aborted (particularly while traversing HOT
chains).  So don't do it, and instead rely on the callers to verify for
abortedness, if necessary.

Several race conditions and bugs fixed in the process.  One isolation test
changes the expected output due to these.

This also reverts commit c235a6a589, which is no longer necessary.

Backpatch to 9.3, where this function was introduced.

Andres Freund
2013-11-29 21:47:21 -03:00
Alvaro Herrera 1df0122daa Truncate pg_multixact/'s contents during crash recovery
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash
recovery, because there was no guarantee that enough state had been
setup, and because it wasn't deemed to be a good idea to remove data
during crash recovery anyway.  Since then, due to Hot-Standby, streaming
replication and PITR, the amount of time a cluster can spend doing crash
recovery has increased significantly, to the point that a cluster may
even never come out of it.  This has made not truncating the content of
pg_multixact/ not defensible anymore.

To fix, take care to setup enough state for multixact truncation before
crash recovery starts (easy since checkpoints contain the required
information), and move the current end-of-recovery actions to a new
TrimMultiXact() function, analogous to TrimCLOG().

At some later point, this should probably done similarly to the way
clog.c is doing it, which is to just WAL log truncations, but we can't
do that for the back branches.

Back-patch to 9.0.  8.4 also has the problem, but since there's no hot
standby there, it's much less pressing.  In 9.2 and earlier, this patch
is simpler than in newer branches, because multixact access during
recovery isn't required.  Add appropriate checks to make sure that's not
happening.

Andres Freund
2013-11-29 21:47:15 -03:00
Alvaro Herrera f54106f77e Fix full-table-vacuum request mechanism for MultiXactIds
While autovacuum dutifully launched anti-multixact-wraparound vacuums
when the multixact "age" was reached, the vacuum code was not aware that
it needed to make them be full table vacuums.  As the resulting
partial-table vacuums aren't capable of actually increasing relminmxid,
autovacuum continued to launch anti-wraparound vacuums that didn't have
the intended effect, until age of relfrozenxid caused the vacuum to
finally be a full table one via vacuum_freeze_table_age.

To fix, introduce logic for multixacts similar to that for plain
TransactionIds, using the same GUCs.

Backpatch to 9.3, where permanent MultiXactIds were introduced.

Andres Freund, some cleanup by Álvaro
2013-11-29 21:47:13 -03:00
Alvaro Herrera 76a31c689c Replace hardcoded 200000000 with autovacuum_freeze_max_age
Parts of the code used autovacuum_freeze_max_age to determine whether
anti-multixact-wraparound vacuums are necessary, while others used a
hardcoded 200000000 value.  This leads to problems when
autovacuum_freeze_max_age is set to a non-default value.  Use the latter
everywhere.

Backpatch to 9.3, where vacuuming of multixacts was introduced.

Andres Freund
2013-11-29 21:47:09 -03:00
Tom Lane 16e1b7a1b7 Fix assorted race conditions in the new timeout infrastructure.
Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions).  That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.

We could still lose control if an asynchronous SIGINT arrives just as the
function is entered.  This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction.  We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.

Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return.  When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely.  Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.

These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored.  So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues.  (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)

Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.

Per reports from Dan Wood.

Back-patch to 9.3 where the new timeout handling infrastructure was
introduced.  We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
2013-11-29 16:41:00 -05:00
Robert Haas 8e18d04d4d Refine our definition of what constitutes a system relation.
Although user-defined relations can't be directly created in
pg_catalog, it's possible for them to end up there, because you can
create them in some other schema and then use ALTER TABLE .. SET SCHEMA
to move them there.  Previously, such relations couldn't afterwards
be manipulated, because IsSystemRelation()/IsSystemClass() rejected
all attempts to modify objects in the pg_catalog schema, regardless
of their origin.  With this patch, they now reject only those
objects in pg_catalog which were created at initdb-time, allowing
most operations on user-created tables in pg_catalog to proceed
normally.

This patch also adds new functions IsCatalogRelation() and
IsCatalogClass(), which is similar to IsSystemRelation() and
IsSystemClass() but with a slightly narrower definition: only TOAST
tables of system catalogs are included, rather than *all* TOAST tables.
This is currently used only for making decisions about when
invalidation messages need to be sent, but upcoming logical decoding
patches will find other uses for this information.

Andres Freund, with some modifications by me.
2013-11-28 20:57:20 -05:00
Heikki Linnakangas 2fe69cacff Another gin_desc fix.
The number of items inserted was incorrectly printed as if it was a boolean.
2013-11-28 23:35:50 +02:00
Heikki Linnakangas 97c19e6c38 Fix gin_desc routine to match the WAL format.
In the GIN incomplete-splits patch, I used BlockIdDatas to store the block
number of left and right children, when inserting a downlink after a split
to an internal page posting list page. But gin_desc thought they were stored
as BlockNumbers.
2013-11-28 21:57:42 +02:00
Alvaro Herrera d51a8c52ba Unbreak buildfarm
I removed an intermediate commit before pushing and forgot to test the
resulting tree :-(
2013-11-28 12:59:45 -03:00
Alvaro Herrera 247c76a989 Use a more granular approach to follow update chains
Instead of simply checking the KEYS_UPDATED bit, we need to check
whether each lock held on the future version of the tuple conflicts with
the lock we're trying to acquire.

Per bug report #8434 by Tomonari Katsumata
2013-11-28 12:00:12 -03:00
Alvaro Herrera e4828e9ccb Compare Xmin to previous Xmax when locking an update chain
Not doing so causes us to traverse an update chain that has been broken
by concurrent page pruning.  All other code that traverses update chains
uses this check as one of the cases in which to stop iterating, so
replicate it here too.  Failure to do so leads to erroneous CLOG,
subtrans or multixact lookups.

Per discussion following the bug report by J Smith in
CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
as diagnosed by Andres Freund.
2013-11-28 12:00:12 -03:00
Alvaro Herrera c235a6a589 Don't try to set InvalidXid as page pruning hint
If a transaction updates/deletes a tuple just before aborting, and a
concurrent transaction tries to prune the page concurrently, the pruner
may see HeapTupleSatisfiesVacuum return HEAPTUPLE_DELETE_IN_PROGRESS,
but a later call to HeapTupleGetUpdateXid() return InvalidXid.  This
would cause an assertion failure in development builds, but would be
otherwise Mostly Harmless.

Fix by checking whether the updater Xid is valid before trying to apply
it as page prune point.

Reported by Andres in 20131124000203.GA4403@alap2.anarazel.de
2013-11-28 12:00:12 -03:00
Alvaro Herrera e518fa7adf Cope with heap_fetch failure while locking an update chain
The reason for the fetch failure is that the tuple was removed because
it was dead; so the failure is innocuous and can be ignored.  Moreover,
there's no need for further work and we can return success to the caller
immediately.  EvalPlanQualFetch is doing something very similar to this
already.

Report and test case from Andres Freund in
20131124000203.GA4403@alap2.anarazel.de
2013-11-28 12:00:12 -03:00
Heikki Linnakangas 631118fe1e Get rid of the post-recovery cleanup step of GIN page splits.
Replace it with an approach similar to what GiST uses: when a page is split,
the left sibling is marked with a flag indicating that the parent hasn't been
updated yet. When the parent is updated, the flag is cleared. If an insertion
steps on a page with the flag set, it will finish split before proceeding
with the insertion.

The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.

This also divides the responsibility of WAL-logging more clearly between
the generic ginbtree.c code, and the parts specific to entry and posting
trees. There is now a common WAL record format for insertions and deletions,
which is written by ginbtree.c, followed by tree-specific payload, which is
returned by the placetopage- and split- callbacks.
2013-11-27 19:21:23 +02:00
Heikki Linnakangas ce5326eed3 More GIN refactoring.
Separate the insertion payload from the more static portions of GinBtree.
GinBtree now only contains information related to searching the tree, and
the information of what to insert is passed separately.

Add root block number to GinBtree, instead of passing it around all the
functions as argument.

Split off ginFinishSplit() from ginInsertValue(). ginFinishSplit is
responsible for finding the parent and inserting the downlink to it.
2013-11-27 15:43:05 +02:00
Bruce Momjian a6542a4b68 Change SET LOCAL/CONSTRAINTS/TRANSACTION and ABORT behavior
Change SET LOCAL/CONSTRAINTS/TRANSACTION behavior outside of a
transaction block from error (post-9.3) to warning.  (Was nothing in <=
9.3.)  Also change ABORT outside of a transaction block from notice to
warning.
2013-11-25 19:19:40 -05:00
Heikki Linnakangas 98f58a30c1 Fix Hot-Standby initialization of clog and subtrans.
These bugs can cause data loss on standbys started with hot_standby=on at
the moment they start to accept read only queries, by marking committed
transactions as uncommited. The likelihood of such corruptions is small
unless the primary has a high transaction rate.

5a031a5556 fixed bugs in HS's startup logic
by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state
was reached, missing the fact that both clog and subtrans are written to
before that. This only failed to fail in common cases because the usage
of ExtendCLOG in procarray.c was superflous since clog extensions are
actually WAL logged.

f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing
extensions of pg_subtrans due to the former commit's changes - which are
not WAL logged - by performing the extensions when switching to a state
> STANDBY_INITIALIZED and not performing xid assignments before that -
again missing the fact that ExtendCLOG is unneccessary - but screwed up
twice: Once because latestObservedXid wasn't updated anymore in that
state due to the earlier commit and once by having an off-by-one error in
the loop performing extensions. This means that whenever a
CLOG_XACTS_PER_PAGE (32768 with default settings) boundary was crossed
between the start of the checkpoint recovery started from and the first
xl_running_xact record old transactions commit bits in pg_clog could be
overwritten if they started and committed in that window.

Fix this mess by not performing ExtendCLOG() in HS at all anymore since
it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.

Analysis and patch by Andres Freund. Reported by Christophe Pettus.
Backpatch down to 9.0, like the previous commit that caused this.
2013-11-22 14:45:41 +02:00
Heikki Linnakangas 1a3d104475 Avoid acquiring spinlock when checking if recovery has finished, for speed.
RecoveryIsInProgress() can be called very frequently. During normal
operation, it just checks a backend-local variable and returns quickly,
but during hot standby, it checks a spinlock-protected shared variable.
Those spinlock acquisitions can become a point of contention on a busy
hot standby system.

Replace the spinlock acquisition with a memory barrier.

Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.
2013-11-22 13:07:23 +02:00
Tom Lane 784e762e88 Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry.  The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others.  This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.

This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.

Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).

The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does.  There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST().  After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.

Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-21 19:37:20 -05:00
Heikki Linnakangas 04eee1fa9e More GIN refactoring.
Split off the portion of ginInsertValue that inserts the tuple to current
level into a separate function, ginPlaceToPage. ginInsertValue's charter
is now to recurse up the tree to insert the downlink, when a page split is
required.

This is in preparation for a patch to change the way incomplete splits are
handled, which will need to do these operations separately. And IMHO makes
the code more readable anyway.
2013-11-20 17:01:33 +02:00
Heikki Linnakangas 501012631e Refactor the internal GIN B-tree interface for forming a downlink.
This creates a new gin-btree callback function for creating a downlink for
a page. Previously, ginxlog.c duplicated the logic used during normal
operation.
2013-11-20 16:57:41 +02:00
Heikki Linnakangas 04965ad40e Further GIN refactoring.
Merge some functions that were always called together. Makes the code
little bit more readable.
2013-11-20 16:09:14 +02:00
Heikki Linnakangas 07fca603b5 Fix bug in GIN posting tree root creation.
The root page is filled with as many items as fit, and the rest are inserted
using normal insertions. However, I fumbled the variable names, and the code
actually memcpy'd all the items on the page, overflowing the buffer. While
at it, rename the variable to make the distinction more clear.

Reported by Teodor Sigaev. This bug was introduced by my recent
refactorings, so no backpatching required.
2013-11-13 13:47:59 +02:00
Heikki Linnakangas ac4ab97ec0 Fix race condition in GIN posting tree page deletion.
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.

To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.

Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.

Backpatch to all supported versions.
2013-11-08 22:21:42 +02:00
Heikki Linnakangas fde7172d93 Fix setting of right bound at GIN page split.
Broken by my refactoring.
2013-11-07 19:45:07 +02:00
Heikki Linnakangas 0ea53256a8 Fix missing argument and function prototypes.
Not sure how I missed these in previous commit.
2013-11-06 11:22:58 +02:00
Heikki Linnakangas ecaa4708e5 Misc GIN refactoring.
Merge the isEnoughSpace and placeToPage functions in the b-tree interface
into one function that tries to put a tuple on page, and returns false if
it doesn't fit.

Move createPostingTree function to gindatapage.c, and change its contract
so that it can be passed more items than fit on the root page. It's in a
better position than the callers to know how many items fit.

Move ginMergeItemPointers out of gindatapage.c, into a separate file.

These changes make no difference now, but reduce the footprint of Alexander
Korotkov's upcoming patch to pack item pointers more tightly.
2013-11-06 10:32:09 +02:00
Tom Lane b006f4ddb9 Prevent memory leaks from accumulating across printtup() calls.
Historically, printtup() has assumed that it could prevent memory leakage
by pfree'ing the string result of each output function and manually
managing detoasting of toasted values.  This amounts to assuming that
datatype output functions never leak any memory internally; an assumption
we've already decided to be bogus elsewhere, for example in COPY OUT.
range_out in particular is known to leak multiple kilobytes per call, as
noted in bug #8573 from Godfried Vanluffelen.  While we could go in and fix
that leak, it wouldn't be very notationally convenient, and in any case
there have been and undoubtedly will again be other leaks in other output
functions.  So what seems like the best solution is to run the output
functions in a temporary memory context that can be reset after each row,
as we're doing in COPY OUT.  Some quick experimentation suggests this is
actually a tad faster than the retail pfree's anyway.

This patch fixes all the variants of printtup, except for debugtup()
which is used in standalone mode.  It doesn't seem worth worrying
about query-lifespan leaks in standalone mode, and fixing that case
would be a bit tedious since debugtup() doesn't currently have any
startup or shutdown functions.

While at it, remove manual detoast management from several other
output-function call sites that had copied it from printtup().  This
doesn't make a lot of difference right now, but in view of recent
discussions about supporting "non-flattened" Datums, we're going to
want that code gone eventually anyway.

Back-patch to 9.2 where range_out was introduced.  We might eventually
decide to back-patch this further, but in the absence of known major
leaks in older output functions, I'll refrain for now.
2013-11-03 11:33:05 -05:00
Tom Lane 24ace4053d Retry after buffer locking failure during SPGiST index creation.
The original coding thought this case was impossible, but it can happen
if the bgwriter or checkpointer processes decide to write out an index
page while creation is still proceeding, leading to a bogus "unexpected
spgdoinsert() failure" error.  Problem reported by Jonathan S. Katz.

Teodor Sigaev
2013-11-02 16:45:42 -04:00
Robert Haas cacbdd7810 Use appendStringInfoString instead of appendStringInfo where possible.
This shaves a few cycles, and generally seems like good programming
practice.

David Rowley
2013-10-31 10:55:59 -04:00
Tom Lane 9a9473f3cc Prevent using strncpy with src == dest in TupleDescInitEntry.
The C and POSIX standards state that strncpy's behavior is undefined when
source and destination areas overlap.  While it remains dubious whether any
implementations really misbehave when the pointers are exactly equal, some
platforms are now starting to force the issue by complaining when an
undefined call occurs.  (In particular OS X 10.9 has been seen to dump core
here, though the exact set of circumstances needed to trigger that remain
elusive.  Similar behavior can be expected to be optional on Linux and
other platforms in the near future.)  So tweak the code to explicitly do
nothing when nothing need be done.

Back-patch to all active branches.  In HEAD, this also lets us get rid of
an exception in valgrind.supp.

Per discussion of a report from Matthias Schmitt.
2013-10-28 20:49:24 -04:00
Heikki Linnakangas 4d6d425ab8 Fix typos in comments. 2013-10-24 11:50:02 +03:00
Noah Misch 709170b790 Consistently use unsigned arithmetic for alignment calculations.
This avoids an assumption about the signed number representation.  It is
anticipated to have no functional changes on supported configurations;
many two's complement assumptions remain elsewhere.

Per a suggestion from Andres Freund.
2013-10-20 21:04:52 -04:00
Heikki Linnakangas 5962519b36 TYPEALIGN doesn't work on int64 on 32-bit platforms.
The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with
values larger than intptr_t, because TYPEALIGN casts the argument to
intptr_t to do the arithmetic. That's not a problem when dealing with
pointers or lengths or offsets related to pointers, but the XLogInsert
scaling patch added a call to MAXALIGN with an XLogRecPtr argument.

To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64,
which are just like the existing variants but work with uint64 instead of
intptr_t.

Report and patch by David Rowley, analysis by Andres Freund.
2013-10-08 01:59:57 +03:00
Heikki Linnakangas 81fbbfe335 Fix bugs in SSI tuple locking.
1. In heap_hot_search_buffer(), the PredicateLockTuple() call is passed
wrong offset number. heapTuple->t_self is set to the tid of the first
tuple in the chain that's visited, not the one actually being read.

2. CheckForSerializableConflictIn() uses the tuple's t_ctid field
instead of t_self to check for exiting predicate locks on the tuple. If
the tuple was updated, but the updater rolled back, t_ctid points to the
aborted dead tuple.

Reported by Hannu Krosing. Backpatch to 9.1.
2013-10-08 00:18:43 +03:00
Heikki Linnakangas c2b175b948 Minor GIN code refactoring.
It makes for cleaner code to have separate Get/Add functions for PostingItems
and ItemPointers.  A few callsites that have to deal with both types need to
be duplicated because of this, but all the callers have to know which one
they're dealing with anyway. Overall, this reduces the amount of casting
required.

Extracted from Alexander Korotkov's larger patch to change the data page
format.
2013-10-03 11:51:31 +03:00
Alvaro Herrera b2fc4d6142 Fix pgindent comment breakage 2013-09-24 18:20:37 -03:00
Robert Haas 86a174bff0 Typo fix.
Etsuro Fujita
2013-09-18 08:57:44 -04:00
Alvaro Herrera dd778e9d88 Rename various "freeze multixact" variables
It seems to make more sense to use "cutoff multixact" terminology
throughout the backend code; "freeze" is associated with replacing of an
Xid with FrozenTransactionId, which is not what we do for MultiXactIds.

Andres Freund
Some adjustments by Álvaro Herrera
2013-09-16 15:47:31 -03:00
Heikki Linnakangas 0892ecbc01 Add a GUC to report whether data page checksums are enabled.
Bernd Helmle
2013-09-16 14:36:01 +03:00
Robert Haas 71901ab6da Introduce InvalidCommandId.
This allows a 32-bit field to represent an *optional* command ID
without a separate flag bit.

Andres Freund
2013-09-09 16:25:29 -04:00
Jeff Davis b1892aaeaa Revert WAL posix_fallocate() patches.
This reverts commit 269e780822
and commit 5b571bb8c8.

Unfortunately, the initial patch had insufficient performance testing,
and resulted in a regression.

Per report by Thom Brown.
2013-09-04 23:43:41 -07:00
Heikki Linnakangas 375d8526f2 Keep heavily-contended fields in XLogCtlInsert on different cache lines.
Performance testing shows that if the insertpos_lck spinlock and the fields
that it protects are on the same cache line with other variables that are
frequently accessed, the false sharing can hurt performance a lot. Keep
them apart by adding some padding.
2013-09-04 23:14:33 +03:00
Heikki Linnakangas 3619a20d33 Rename the "fast_promote" file to just "promote".
This keeps the usual trigger file name unchanged from 9.2, avoiding nasty
issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa.
The fallback behavior of creating a full checkpoint before starting up is now
triggered by a file called "fallback_promote". That can be useful for
debugging purposes, but we don't expect any users to have to resort to that
and we might want to remove that in the future, which is why the fallback
mechanism is undocumented.
2013-08-19 20:59:51 +03:00
Alvaro Herrera 78e1220104 Fix pg_upgrade failure from servers older than 9.3
When upgrading from servers of versions 9.2 and older, and MultiXactIds
have been used in the old server beyond the first page (that is, 2048
multis or more in the default 8kB-page build), pg_upgrade would set the
next multixact offset to use beyond what has been allocated in the new
cluster.  This would cause a failure the first time the new cluster
needs to use this value, because the pg_multixact/offsets/ file wouldn't
exist or wouldn't be large enough.  To fix, ensure that the transient
server instances launched by pg_upgrade extend the file as necessary.

Per report from Jesse Denardo in
CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com
2013-08-19 12:56:18 -04:00
Peter Eisentraut 072457b360 Message punctuation and pluralization fixes 2013-08-09 08:02:44 -04:00
Greg Stark c62736cc37 Add SQL Standard WITH ORDINALITY support for UNNEST (and any other SRF)
Author: Andrew Gierth, David Fetter
Reviewers: Dean Rasheed, Jeevan Chalke, Stephen Frost
2013-07-29 16:38:01 +01:00
Peter Eisentraut 626092a2e1 Message style improvements 2013-07-28 07:01:13 -04:00
Robert Haas 765ad89be3 Use InvalidSnapshot, now SnapshotNow, as the default snapshot.
As far as I can determine, there's no code in the core distribution
that fails to explicitly set the snapshot of a scan or executor
state.  If there is any such code, this will probably cause it to
seg fault; friendlier suggestions were discussed on pgsql-hackers,
but there was no consensus that anything more than this was
needed.

This is another step towards the hoped-for complete removal of
SnapshotNow.
2013-07-23 10:58:32 -04:00
Robert Haas 0518eceec3 Adjust HeapTupleSatisfies* routines to take a HeapTuple.
Previously, these functions took a HeapTupleHeader, but upcoming
patches for logical replication will introduce new a new snapshot
type under which the tuple's TID will be used to lookup (CMIN, CMAX)
for visibility determination purposes.  This makes that information
available.  Code churn is minimal since HeapTupleSatisfiesVisibility
took the HeapTuple anyway, and deferenced it before calling the
satisfies function.

Independently of logical replication, this allows t_tableOid and
t_self to be cross-checked via assertions in tqual.c.  This seems
like a useful way to make sure that all callers are setting these
values properly, which has been previously put forward as
desirable.

Andres Freund, reviewed by Álvaro Herrera
2013-07-22 13:38:44 -04:00
Stephen Frost 4cbe3ac3e8 WITH CHECK OPTION support for auto-updatable VIEWs
For simple views which are automatically updatable, this patch allows
the user to specify what level of checking should be done on records
being inserted or updated.  For 'LOCAL CHECK', new tuples are validated
against the conditionals of the view they are being inserted into, while
for 'CASCADED CHECK' the new tuples are validated against the
conditionals for all views involved (from the top down).

This option is part of the SQL specification.

Dean Rasheed, reviewed by Pavel Stehule
2013-07-18 17:10:16 -04:00
Heikki Linnakangas 107cbc90a7 Fix variable names mentioned in comment to match the code.
Also, in another comment, explain why holding an insertion slot is a
critical section.

Per review by Amit Kapila.
2013-07-17 23:32:32 +03:00
Heikki Linnakangas 59c02a36f0 Fix assert failure at end of recovery, broken by XLogInsert scaling patch.
Initialization of the first XLOG buffer at end-of-recovery was broken for
the case that the last read WAL record ended at a page boundary. Instead of
trying to copy the last full xlog page to the buffer cache in that case,
just set shared state so that the next page is initialized when the first
WAL record after startup is inserted. (that's what we did in earlier
version, too)

To make the shared state required for that case less surprising, replace the
XLogCtl->curridx variable, which was the index of the latest initialized
buffer, with an XLogRecPtr of how far the buffers have been initialized.
That also allows us to get rid of the XLogRecEndPtrToBufIdx macro.

While we're at it, make a similar change for XLogCtl->Write.curridx, getting
rid of that variable and calculating the next buffer to write from
XLogCtl->LogwrtResult instead.
2013-07-17 23:12:22 +03:00
Noah Misch ffcf654547 Fix systable_recheck_tuple() for MVCC scan snapshots.
Since this function assumed non-MVCC snapshots, it broke when commit
568d4138c6 switched its one caller from
SnapshotNow scans to MVCC-snapshot scans.

Reviewed by Robert Haas, Tom Lane and Andres Freund.
2013-07-16 20:16:32 -04:00
Heikki Linnakangas f489470f8a Fix Windows build.
Was broken by my xloginsert scaling patch. XLogCtl global variable needs
to be initialized in each process, as it's not inherited by fork() on
Windows.
2013-07-08 17:28:48 +03:00
Heikki Linnakangas 9a20a9b21b Improve scalability of WAL insertions.
This patch replaces WALInsertLock with a number of WAL insertion slots,
allowing multiple backends to insert WAL records to the WAL buffers
concurrently. This is particularly useful for parallel loading large amounts
of data on a system with many CPUs.

This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the segment with
zeros. This potentially adds some overhead, but it has been a very common
practice by DBA's to clear the "tail" of the segment with an external
pg_clearxlogtail utility anyway, to make the WAL files compress better.
With this patch, it's no longer necessary to do that.

This patch adds a new GUC, xloginsert_slots, to tune the number of WAL
insertion slots. Performance testing suggests that the default, 8, works
pretty well for all kinds of worklods, but I left the GUC in place to allow
others with different hardware to test that easily. We might want to remove
that before release.

Reviewed by Andres Freund.
2013-07-08 11:23:56 +03:00
Jeff Davis 5b571bb8c8 Handle posix_fallocate() errors.
On some platforms, posix_fallocate() is available but may still return
EINVAL if the underlying filesystem does not support it.  So, in case
of an error, fall through to the alternate implementation that just
writes zeros.

Per buildfarm failure and analysis by Tom Lane.
2013-07-06 13:46:04 -07:00
Noah Misch 02d2b694ee Update messages, comments and documentation for materialized views.
All instances of the verbiage lagging the code.  Back-patch to 9.3,
where materialized views were introduced.
2013-07-05 15:37:51 -04:00
Jeff Davis 269e780822 Use posix_fallocate() for new WAL files, where available.
This function is more efficient than actually writing out zeroes to
the new file, per microbenchmarks by Jon Nelson. Also, it may reduce
the likelihood of WAL file fragmentation.

Jon Nelson, with review by Andres Freund, Greg Smith and me.
2013-07-05 12:30:29 -07:00
Fujii Masao 7842d41df5 Fix typo in comment.
Michael Paquier
2013-07-05 02:47:49 +09:00
Robert Haas 6bc8ef0b7f Add new GUC, max_worker_processes, limiting number of bgworkers.
In 9.3, there's no particular limit on the number of bgworkers;
instead, we just count up the number that are actually registered,
and use that to set MaxBackends.  However, that approach causes
problems for Hot Standby, which needs both MaxBackends and the
size of the lock table to be the same on the standby as on the
master, yet it may not be desirable to run the same bgworkers in
both places.  9.3 handles that by failing to notice the problem,
which will probably work fine in nearly all cases anyway, but is
not theoretically sound.

A further problem with simply counting the number of registered
workers is that new workers can't be registered without a
postmaster restart.  This is inconvenient for administrators,
since bouncing the postmaster causes an interruption of service.
Moreover, there are a number of applications for background
processes where, by necessity, the background process must be
started on the fly (e.g. parallel query).  While this patch
doesn't actually make it possible to register new background
workers after startup time, it's a necessary prerequisite.

Patch by me.  Review by Michael Paquier.
2013-07-04 11:24:24 -04:00
Fujii Masao 2ef085d0e6 Get rid of pg_class.reltoastidxid.
Treat TOAST index just the same as normal one and get the OID
of TOAST index from pg_index but not pg_class.reltoastidxid.
This change allows us to handle multiple TOAST indexes, and
which is required infrastructure for upcoming
REINDEX CONCURRENTLY feature.

Patch by Michael Paquier, reviewed by Andres Freund and me.
2013-07-04 03:24:09 +09:00
Robert Haas 3682025015 Add support for multiple kinds of external toast datums.
To that end, support tags rather than lengths for external datums.
As an example of how this can be used, add support or "indirect"
tuples which point to some externally allocated memory containing
a toast tuple.  Similar infrastructure could be used for other
purposes, including, perhaps, support for alternative compression
algorithms.

Andres Freund, reviewed by Hitoshi Harada and myself
2013-07-02 13:38:55 -04:00
Robert Haas 568d4138c6 Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row.  In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result.  This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.

The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow.  However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads.  To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed.  The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all.  Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.

Patch by me.  Review by Michael Paquier and Andres Freund.
2013-07-02 09:47:01 -04:00
Heikki Linnakangas 79ce29c734 Retry short writes when flushing WAL.
We don't normally bother retrying when the number of bytes written by
write() is short of what was requested. It is generally assumed that a
write() to disk doesn't return short, unless you run out of disk space.
While writing the WAL, however, it seems prudent to try a bit harder,
because a failure leads to PANIC. The write() is also much larger than most
write()s in the backend (up to wal_buffers), so there's more room for
surprises.

Also retry on EINTR. All signals used in the backend are flagged SA_RESTART
nowadays, so it shouldn't happen, but better to be defensive.
2013-07-01 09:36:00 +03:00
Heikki Linnakangas ee6556555b Inline ginCompareItemPointers function for speed.
ginCompareItemPointers function is called heavily in gin index scans -
inlining it speeds up some kind of queries a lot.
2013-06-29 12:55:34 +03:00
Noah Misch 19085116ee Cooperate with the Valgrind instrumentation framework.
Valgrind "client requests" in aset.c and mcxt.c teach Valgrind and its
Memcheck tool about the PostgreSQL allocator.  This makes Valgrind
roughly as sensitive to memory errors involving palloc chunks as it is
to memory errors involving malloc chunks.  Further client requests in
PageAddItem() and printtup() verify that all bits being added to a
buffer page or furnished to an output function are predictably-defined.
Those tests catch failures of C-language functions to fully initialize
the bits of a Datum, which in turn stymie optimizations that rely on
_equalConst().  Define the USE_VALGRIND symbol in pg_config_manual.h to
enable these additions.  An included "suppression file" silences nominal
errors we don't plan to fix.

Reviewed in earlier versions by Peter Geoghegan and Korry Douglas.
2013-06-26 20:22:25 -04:00
Noah Misch 1d96bb9602 Initialize pad bytes in GinFormTuple().
Every other core buffer page consumer initializes the bytes it furnishes
to PageAddItem().  For consistency, do the same here.  No back-patch;
regardless, we couldn't count on the fix so long as binary upgrade can
carry forward affected index builds.
2013-06-26 19:55:15 -04:00
Alvaro Herrera 4ca50e0710 Avoid inconsistent type declaration
Clang 3.3 correctly complains that a variable of type enum
MultiXactStatus cannot hold a value of -1, which makes sense.  Change
the declared type of the variable to int instead, and apply casting as
necessary to avoid the warning.

Per notice from Andres Freund
2013-06-25 16:41:47 -04:00
Simon Riggs 1f09121b4e Ensure no xid gaps during Hot Standby startup
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.

Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
2013-06-23 11:05:02 +01:00
Peter Eisentraut 7dfd5cd21c Clarify terminology standalone backend vs. single-user mode
Most of the documentation uses "single-user mode", so use that in the
code as well.  Adjust the documentation to match the new error message
wording.  Also add a documentation index entry for "single-user mode".

Based-on-patch-by: Jeff Janes <jeff.janes@gmail.com>
2013-06-20 23:03:18 -04:00
Jeff Davis b8fd1a09f3 Add buffer_std flag to MarkBufferDirtyHint().
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.

In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.

Back-patch to 9.3.
2013-06-17 08:02:12 -07:00
Tom Lane e472b92140 Avoid deadlocks during insertion into SP-GiST indexes.
SP-GiST's original scheme for avoiding deadlocks during concurrent index
insertions doesn't work, as per report from Hailong Li, and there isn't any
evident way to make it work completely.  We could possibly lock individual
inner tuples instead of their whole pages, but preliminary experimentation
suggests that the performance penalty would be huge.  Instead, if we fail
to get a buffer lock while descending the tree, just restart the tree
descent altogether.  We keep the old tuple positioning rules, though, in
hopes of reducing the number of cases where this can happen.

Teodor Sigaev, somewhat edited by Tom Lane
2013-06-14 14:26:43 -04:00
Tom Lane c62866eeaf Remove special-case treatment of LOG severity level in standalone mode.
elog.c has historically treated LOG messages as low-priority during
bootstrap and standalone operation.  This has led to confusion and even
masked a bug, because the normal expectation of code authors is that
elog(LOG) will put something into the postmaster log, and that wasn't
happening during initdb.  So get rid of the special-case rule and make
the priority order the same as it is in normal operation.  To keep from
cluttering initdb's output and the behavior of a standalone backend,
tweak the severity level of three messages routinely issued by xlog.c
during startup and shutdown so that they won't appear in these cases.
Per my proposal back in December.
2013-06-13 23:15:15 -04:00
Noah Misch fb435f40d5 Observe array length in HaveVirtualXIDsDelayingChkpt().
Since commit f21bb9cfb5, this function
ignores the caller-provided length and loops until it finds a
terminator, which GetVirtualXIDsDelayingChkpt() never adds.  Restore the
previous loop control logic.  In passing, revert the addition of an
unused variable by the same commit, presumably a debugging relic.
2013-06-12 19:50:14 -04:00
Heikki Linnakangas f73cb5567c Fix typo in comment. 2013-06-06 18:27:01 +03:00
Stephen Frost f129615fe7 Additional spelling corrections
A few more minor spelling corrections, no functional changes.

Thom Brown
2013-06-03 08:40:27 -04:00
Heikki Linnakangas e1e2bb34f1 Code review of recycling WAL segments in a restartpoint.
Seems cleaner to get the currently-replayed TLI in the same call to
GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the
comment what the code does when recovery has already ended
(RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make
resetting ThisTimeLineID afterwards more explicit.
2013-06-03 09:25:12 +03:00
Stephen Frost c9fc28a7f1 Minor spelling fixes
Fix a few spelling mistakes.

Per bug report #8193 from Lajos Veres.
2013-06-01 10:18:59 -04:00
Stephen Frost 551938ae22 Post-pgindent cleanup
Make slightly better decisions about indentation than what pgindent
is capable of.  Mostly breaking out long function calls into one
line per argument, with a few other minor adjustments.

No functional changes- all whitespace.
pgindent ran cleanly (didn't change anything) after.
Passes all regressions.
2013-06-01 09:38:15 -04:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Simon Riggs 22a27ef113 After fast promotion use CHECKPOINT_FORCE
Not necessary for correctness, just to make
log_checkpoints output look less singular.

Requested by Fujii Masao
2013-05-21 21:27:12 +01:00
Simon Riggs 75a192638f Maintain ThisTimeLineID correctly in checkpointer
checkpointer needs to reset ThisTimeLineID after
a restartpoint to allow installing/recycling new
WAL files. If recovery has already ended this
would leave ThisTimeLineID set incorrectly and
so we must reset it otherwise later checkpoints
do not have the correct timeline.

Bug report by Heikki Linnakangas.
Further investigation by Heikki and myself.
2013-05-21 21:17:04 +01:00
Simon Riggs d4337a0dcb Init crash recovery using the latest available TLI
This simplifies the handling of crashes after fast promotion and various
minor cases that can exist in short timing windows around that case.

Broad fix to bug reported by Michael Paquier on -hackers,
approach prompted by Heikki Linnakangas
2013-05-19 17:31:07 +01:00
Simon Riggs 1781744cfc Emit msg correctly for timeline-crossing crash 2013-05-19 17:00:18 +01:00
Simon Riggs c94dff4c3c Remove single space on end of a line in xlog.c
Michael Paquier
2013-05-19 15:38:47 +01:00
Tom Lane e9c336c786 Fix handling of OID wraparound while in standalone mode.
If OID wraparound should occur while in standalone mode (unlikely but
possible), we want to advance the counter to FirstNormalObjectId not
FirstBootstrapObjectId.  Otherwise, user objects might be created with OIDs
in the system-reserved range.  That isn't immediately harmful but it poses
a risk of conflicts during future pg_upgrade operations.

Noted by Andres Freund.  Back-patch to all supported branches, since all of
them are supported sources for pg_upgrade operations.
2013-05-13 15:40:16 -04:00
Tom Lane 91715e8293 Fix management of fn_extra caching during repeated GiST index scans.
Commit d22a09dc70 introduced official support
for GiST consistentFns that want to cache data using the FmgrInfo fn_extra
pointer: the idea was to preserve the cached values across gistrescan(),
whereas formerly they'd been leaked.  However, there was an oversight in
that, namely that multiple scan keys might reference the same column's
consistentFn; the code would result in propagating the same cache value
into multiple scan keys, resulting in crashes or wrong answers.  Use a
separate array instead to ensure that each scan key keeps its own state.

Per bug #8143 from Joel Roller.  Back-patch to 9.2 where the bug was
introduced.
2013-05-09 23:09:04 -04:00
Heikki Linnakangas 2ffa66f497 Fix walsender failure at promotion.
If a standby server has a cascading standby server connected to it, it's
possible that WAL has already been sent up to the next WAL page boundary,
splitting a WAL record in the middle, when the first standby server is
promoted. Don't throw an assertion failure or error in walsender if that
happens.

Also, fix a variant of the same bug in pg_receivexlog: if it had already
received WAL on previous timeline up to a segment boundary, when the
upstream standby server is promoted so that the timeline switch record falls
on the previous segment, pg_receivexlog would miss the segment containing
the timeline switch. To fix that, have walsender send the position of the
timeline switch at end-of-streaming, in addition to the next timeline's ID.
It was previously assumed that the switch happened exactly where the
streaming stopped.

Note: this is an incompatible change in the streaming protocol. You might
get an error if you try to stream over timeline switches, if the client is
running 9.3beta1 and the server is more recent. It should be fine after a
reconnect, however.

Reported by Fujii Masao.
2013-05-08 20:30:17 +03:00
Heikki Linnakangas cb953d8b1b Use the term "radix tree" instead of "suffix tree" for SP-GiST text opclass.
What we have implemented is a radix tree (or a radix trie or a patricia
trie), but the docs and code comments incorrectly called it a "suffix tree".

Alexander Korotkov
2013-05-08 14:34:26 +03:00
Simon Riggs 443951748c Record data_checksum_version in control file.
The value is not used anywhere in code, but will
allow future changes to the checksum version
should that become necessary in the future.
2013-04-30 12:27:12 +01:00
Simon Riggs 2317a63328 Make fast promotion the default promotion mode.
Continue to allow a request for synchronous
checkpoints as a mechanism in case of problems.
2013-04-24 12:21:18 +01:00
Heikki Linnakangas 87ae9e7265 Remove some unused and seldom used fields from RelationAmInfo.
This saves some memory from each index relcache entry. At least on a 64-bit
machine, it saves just enough to shrink a typical relcache entry's memory
usage from 2k to 1k. That's nice if you have a lot of backends and a lot of
indexes.
2013-04-16 15:07:58 +03:00