Commit Graph

468 Commits

Author SHA1 Message Date
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Simon Riggs cf8dc9e10c Fix checksums for CLUSTER, VACUUM FULL etc.
In CLUSTER, VACUUM FULL and ALTER TABLE SET TABLESPACE
I erroneously set checksum before log_newpage, which
sets the LSN and invalidates the checksum. So set
checksum immediately *after* log_newpage.

Bug report Fujii Masao, Fix and patch by Jeff Davis
2013-04-07 22:16:51 +01:00
Heikki Linnakangas 3cfb572dde Fix buffer pin leak in heap update redo routine.
In a heap update, if the old and new tuple were on different pages, and the
new page no longer existed (because it was subsequently truncated away by
vacuum), heap_xlog_update forgot to release the pin on the old buffer. This
bug was introduced by the "Fix multiple problems in WAL replay" patch,
commit 3bbf668de9 (on master branch).

With full_page_writes=off, this triggered an "incorrect local pin count"
error later in replay, if the old page was vacuumed.

This fixes bug #7969, reported by Yunong Xiao. Backpatch to 9.0, like the
commit that introduced this bug.
2013-03-27 22:00:01 +02:00
Simon Riggs 96ef3b8ff1 Allow I/O reliability checks using 16-bit checksums
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.

WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.

Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.

Checksum used is WAL CRC-32 truncated to 16-bits.

Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
2013-03-22 13:54:07 +00:00
Simon Riggs bb7cc2623f Remove PageSetTLI and rename pd_tli to pd_checksum
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.

Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.

Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
2013-03-18 13:46:42 +00:00
Kevin Grittner 3bf3ab8c56 Add a materialized view relations.
A materialized view has a rule just like a view and a heap and
other physical properties like a table.  The rule is only used to
populate the table, references in queries refer to the
materialized data.

This is a minimal implementation, but should still be useful in
many cases.  Currently data is only populated "on demand" by the
CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW statements.
It is expected that future releases will add incremental updates
with various timings, and that a more refined concept of defining
what is "fresh" data will be developed.  At some point it may even
be possible to have queries use a materialized in place of
references to underlying tables, but that requires the other
above-mentioned features to be working first.

Much of the documentation work by Robert Haas.
Review by Noah Misch, Thom Brown, Robert Haas, Marko Tiikkaja
Security review by KaiGai Kohei, with a decision on how best to
implement sepgsql still pending.
2013-03-03 18:23:31 -06:00
Alvaro Herrera 5766228bc6 Fix Xmax freeze conditions
I broke this in 0ac5ad5134; previously, freezing a tuple marked with an
IS_MULTI xmax was not necessary.

Per brokenness report from Jeff Janes.
2013-02-08 12:50:58 -03:00
Alvaro Herrera 9ee00ef4c7 Fill tuple before HeapSatisfiesHOTAndKeyUpdate
Failing to do this results in almost all updates to system catalogs
being non-HOT updates, because the OID column would differ (not having
been set for the new tuple), which is an indexed column.

While at it, make sure to set the tableoid early in both old and new
tuples as well.  This isn't of much consequence, since that column is
seldom (never?) indexed.

Report and patch from Andres Freund.
2013-02-01 10:43:09 -03:00
Alvaro Herrera b78647a0e6 Restrict infomask bits to set on multixacts
We must only set the bit(s) for the strongest lock held in the tuple;
otherwise, a multixact containing members with exclusive lock and
key-share lock will behave as though only a share lock is held.

This bug was introduced in commit 0ac5ad5134, somewhere along
development, when we allowed a singleton FOR SHARE lock to be
implemented without a MultiXact by using a multi-bit pattern.
I overlooked that GetMultiXactIdHintBits() needed to be tweaked as well.
Previously, we could have the bits for FOR KEY SHARE and FOR UPDATE
simultaneously set and it wouldn't cause a problem.

Per report from digoal@126.com
2013-01-31 19:35:31 -03:00
Simon Riggs 5c54f63fd6 Fix rare missing cancellations in Hot Standby.
The machinery around XLOG_HEAP2_CLEANUP_INFO failed
to correctly pass through the necessary information
on latestRemovedXid, avoiding cancellations in some
infrequent concurrent update/cleanup scenarios.

Backpatchable fix to 9.0

Detailed bug report and fix by Noah Misch,
backpatchable version by me.
2013-01-24 14:19:29 +00:00
Alvaro Herrera 0ac5ad5134 Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE".  These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE".  UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.

Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.

The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid.  Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates.  This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed.  pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.

Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header.  This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.

Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)

With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.

As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.

Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane.  There's probably room for several more tests.

There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it.  Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.

This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
	AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
	1290721684-sup-3951@alvh.no-ip.org
	1294953201-sup-2099@alvh.no-ip.org
	1320343602-sup-2290@alvh.no-ip.org
	1339690386-sup-8927@alvh.no-ip.org
	4FE5FF020200002500048A3D@gw.wicourts.gov
	4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 12:04:59 -03:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Alvaro Herrera 5ab3af46dd Remove obsolete XLogRecPtr macros
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter.  The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway.  The clarity of simpler expressions
makes the change worthwhile.

Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).

Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
2012-12-28 13:06:15 -03:00
Robert Haas 75758a6ff0 Update comment in heapgetpage() regarding PD_ALL_VISIBLE vs. Hot Standby.
Pavan Deolasee, slightly modified by me
2012-12-14 15:44:38 -05:00
Heikki Linnakangas 6264cd3d69 In multi-insert, don't go into infinite loop on a huge tuple and fillfactor.
If a tuple is larger than page size minus space reserved for fillfactor,
heap_multi_insert would never find a page that it fits in and repeatedly ask
for a new page from RelationGetBufferForTuple. If a tuple is too large to
fit on any page, taking fillfactor into account, RelationGetBufferForTuple
will always expand the relation. In a normal insert, heap_insert will accept
that and put the tuple on the new page. heap_multi_insert, however, does a
fillfactor check of its own, and doesn't accept the newly-extended page
RelationGetBufferForTuple returns, even though there is no other choice to
make the tuple fit.

Fix that by making the logic in heap_multi_insert more like the heap_insert
logic. The first tuple is always put on the page RelationGetBufferForTuple
gives us, and the fillfactor check is only applied to the subsequent tuples.

Report from David Gould, although I didn't use his patch.
2012-12-12 13:54:42 +02:00
Simon Riggs 5457a130d3 Reduce scope of changes for COPY FREEZE.
Allow support only for freezing tuples by explicit
command. Previous coding mistakenly extended
slightly beyond what was agreed as correct on -hackers.
So essentially a partial revoke of earlier work,
leaving just the COPY FREEZE command.
2012-12-02 20:52:52 +00:00
Simon Riggs 8de72b66a2 COPY FREEZE and mark committed on fresh tables.
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.

Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner
2012-12-01 12:54:20 +00:00
Tom Lane 3c84046490 Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654db, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation.  The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY.  This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions.  Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.

To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished.  (This
change obviously is only possible in HEAD.  This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)

In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update.  The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption.  This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.

In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate.  These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.

Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation.  Previously we
could have been left with stale values of some fields in an index relcache
entry.  It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.

In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.

This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.

Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-28 21:26:01 -05:00
Alvaro Herrera 1577b46b7c Split out rmgr rm_desc functions into their own files
This is necessary (but not sufficient) to have them compilable outside
of a backend environment.
2012-11-28 13:01:15 -03:00
Tom Lane 3bbf668de9 Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states.  This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.

The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page.  This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates.  Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another.  Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.

To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time.  This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images.  We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods.  A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.

In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images.  In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4.  They are
still there in the header files in previous branches, but are no longer
used by the code.

In addition, fix some other bugs identified in the course of making these
changes:

spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already.  This would result in permanent index corruption,
not just transient failure of concurrent queries.

Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image.  A plain exclusive lock seems sufficient, so change to
that.

Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.

Back-patch to 9.0, where hot standby was introduced.  Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records.  Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-12 22:05:53 -05:00
Kevin Grittner 6868ed7491 Throw error if expiring tuple is again updated or deleted.
This prevents surprising behavior when a FOR EACH ROW trigger
BEFORE UPDATE or BEFORE DELETE directly or indirectly updates or
deletes the the old row.  Prior to this patch the requested action
on the row could be silently ignored while all triggered actions
based on the occurence of the requested action could be committed.
One example of how this could happen is if the BEFORE DELETE
trigger for a "parent" row deleted "children" which had trigger
functions to update summary or status data on the parent.

This also prevents similar surprising problems if the query has a
volatile function which updates a target row while it is already
being updated.

There are related issues present in FOR UPDATE cursors and READ
COMMITTED queries which are not handled by this patch.  These
issues need further evalution to determine what change, if any, is
needed.

Where the new error messages are generated, in most cases the best
fix will be to move code from the BEFORE trigger to an AFTER
trigger.  Where this is not feasible, the trigger can avoid the
error by re-issuing the triggering statement and returning NULL.

Documentation changes will be submitted in a separate patch.

Kevin Grittner and Tom Lane with input from Florian Pflug and
Robert Haas, based on problems encountered during conversion of
Wisconsin Circuit Court trigger logic to plpgsql triggers.
2012-10-26 14:55:36 -05:00
Tom Lane 96cc18eef6 Put back AcceptInvalidationMessages calls in heap_openrv(_extended).
These calls were removed in commit 4240e429d0
as part of a general refactoring and improvement of DDL locking.  However,
there's a problem not solved by the rewrite, which is that GRANT/REVOKE
update pg_class.relacl without taking any particular lock on the target
table as such.  If another backend fails to do AcceptInvalidationMessages,
it won't notice a recently-committed change in ACLs.  Bug #7557 from Piotr
Czachur demonstrates that there's at least one code path in 9.2.0 in which
a command fails to do any AcceptInvalidationMessages calls at all, if the
current transaction already holds all the locks it will need.

Since we're hard up against the release deadline for 9.2.1, fix this by
putting back the AcceptInvalidationMessages calls in heap_openrv and
heap_openrv_extended, thereby restoring the historical behavior in this
area.  We ought to look for a more elegant and perhaps more bulletproof
solution, but there's no time for that right now.
2012-09-19 17:10:37 -04:00
Alvaro Herrera c219d9b0a5 Split tuple struct defs from htup.h to htup_details.h
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.

I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h.  In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
2012-08-30 16:52:35 -04:00
Alvaro Herrera 21c09e99dc Split heapam_xlog.h from heapam.h
The heapam XLog functions are used by other modules, not all of which
are interested in the rest of the heapam API.  With this, we let them
get just the XLog stuff in which they are interested and not pollute
them with unrelated includes.

Also, since heapam.h no longer requires xlog.h, many files that do
include heapam.h no longer get xlog.h automatically, including a few
headers.  This is useful because heapam.h is getting pulled in by
execnodes.h, which is in turn included by a lot of files.
2012-08-28 19:02:00 -04:00
Bruce Momjian 3325957656 Delete inaccurate C comment about FSM and adding pages, per Robert Haas. 2012-08-16 19:02:58 -04:00
Robert Haas 6cd015bea3 Add new function log_newpage_buffer.
When I implemented the ginbuildempty() function as part of
implementing unlogged tables, I falsified the note in the header
comment for log_newpage.  Although we could fix that up by changing
the comment, it seems cleaner to add a new function which is
specifically intended to handle this case.  So do that.
2012-06-14 10:11:16 -04:00
Bruce Momjian 927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Robert Haas b50991eedb Fix more crash-safe visibility map bugs, and improve comments.
In lazy_scan_heap, we could issue bogus warnings about incorrect
information in the visibility map, because we checked the visibility
map bit before locking the heap page, creating a race condition.  Fix
by rechecking the visibility map bit before we complain.  Rejigger
some related logic so that we rely on the possibly-outdated
all_visible_according_to_vm value as little as possible.

In heap_multi_insert, it's not safe to clear the visibility map bit
before beginning the critical section.  The visibility map is not
crash-safe unless we treat clearing the bit as a critical operation.
Specifically, if the transaction were to error out after we set the
bit and before entering the critical section, we could end up writing
the heap page to disk (with the bit cleared) and crashing before the
visibility map page made it to disk.  That would be bad.  heap_insert
has this correct, but somehow the order of operations got rearranged
when heap_multi_insert was added.

Also, add some more comments to visibilitymap_test, lazy_scan_heap,
and IndexOnlyNext, expounding on concurrency issues.

Per extensive code review by Andres Freund, and further review by Tom
Lane, who also made the original report about the bogus warnings.
2012-06-07 12:48:13 -04:00
Simon Riggs a2b516dab9 Only throw recovery conflicts when InHotStandby. Bug fix to recent
patch to allow Index Only Scans on Hot Standby.

Bug report from Jaime Casanova
2012-05-31 13:11:47 +01:00
Tom Lane ed962fd712 Ensure that seqscans check for interrupts at least once per page.
If a seqscan encounters many consecutive pages containing only dead tuples,
it can remain in the loop in heapgettup for a long time, and there was no
CHECK_FOR_INTERRUPTS anywhere in that loop.  This meant there were
real-world situations where a query would be effectively uncancelable for
long stretches.  Add a check placed to occur once per page, which should be
enough to provide reasonable response time without adding any measurable
overhead.

Report and patch by Merlin Moncure (though I tweaked it a bit).
Back-patch to all supported branches.
2012-05-22 19:42:05 -04:00
Heikki Linnakangas 6593c5b5dc Fix bug in freespace calculation in heap_multi_insert().
If the amount of freespace on page was less than the amount reserved by
fillfactor, the calculation would underflow.

This fixes bug #6643 reported by Tomonari Katsumata.
2012-05-16 14:13:06 +03:00
Robert Haas 0038110421 Avoid repeated CLOG access from heap_hot_search_buffer.
At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate.  So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.

Review by Tom Lane.
2012-05-02 12:40:07 -04:00
Robert Haas 3424bff90f Prevent index-only scans from returning wrong answers under Hot Standby.
The alternative of disallowing index-only scans in HS operation was
discussed, but the consensus was that it was better to treat marking
a page all-visible as a recovery conflict for snapshots that could still
fail to see XIDs on that page.  We may in the future try to soften this,
so that we simply force index scans to do heap fetches in cases where
this may be an issue, rather than throwing a hard conflict.
2012-04-26 20:00:21 -04:00
Robert Haas 5d4b60f2f2 Lots of doc corrections.
Josh Kupershmidt
2012-04-23 22:43:09 -04:00
Robert Haas 7386089d23 Code cleanup for heap_freeze_tuple.
It used to be case that lazy vacuum could call this function with only
a shared lock on the buffer, but neither lazy vacuum nor any other
code path does that any more.  Simplify the code accordingly and clean
up some related, obsolete comments.
2012-03-26 11:03:06 -04:00
Heikki Linnakangas 21b1634275 Fix heap_multi_insert to set t_self field in the caller's tuples.
If tuples were toasted, heap_multi_insert didn't update the ctid on the
original tuples. This caused a failure if there was an after trigger
(including a foreign key), on the table, and a tuple got toasted.

Per off-list report and test case from Ted Phelps
2012-02-13 10:20:50 +02:00
Robert Haas b4e0741727 Avoid re-checking for visibility map extension too frequently.
When testing bits (but not when setting or clearing them), we now
won't check whether the map has been extended.  This significantly
improves performance in the case where the visibility map doesn't
exist yet, by avoiding an extra system call per tuple.  To make
sure backends notice eventually, send an smgr inval on VM extension.

Dean Rasheed, with minor modifications by me.
2012-02-01 20:35:42 -05:00
Robert Haas 754b8140a1 fastgetattr is in access/htup.h, not access/heapam.h
Noted by Peter Geoghegan
2012-01-16 20:37:01 -05:00
Tom Lane 21b446dd09 Fix CLUSTER/VACUUM FULL for toast values owned by recently-updated rows.
In commit 7b0d0e9356, I made CLUSTER and
VACUUM FULL try to preserve toast value OIDs from the original toast table
to the new one.  However, if we have to copy both live and recently-dead
versions of a row that has a toasted column, those versions may well
reference the same toast value with the same OID.  The patch then led to
duplicate-key failures as we tried to insert the toast value twice with the
same OID.  (The previous behavior was not very desirable either, since it
would have silently inserted the same value twice with different OIDs.
That wastes space, but what's worse is that the toast values inserted for
already-dead heap rows would not be reclaimed by subsequent ordinary
VACUUMs, since they go into the new toast table marked live not deleted.)

To fix, check if the copied OID already exists in the new toast table, and
if so, assume that it stores the desired value.  This is reasonably safe
since the only case where we will copy an OID from a previous toast pointer
is when toast_insert_or_update was given that toast pointer and so we just
pulled the data from the old table; if we got two different values that way
then we have big problems anyway.  We do have to assume that no other
backend is inserting items into the new toast table concurrently, but
that's surely safe for CLUSTER and VACUUM FULL.

Per bug #6393 from Maxim Boguk.  Back-patch to 9.0, same as the previous
patch.
2012-01-12 16:40:14 -05:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Robert Haas 2ad36c4e44 Improve table locking behavior in the face of current DDL.
In the previous coding, callers were faced with an awkward choice:
look up the name, do permissions checks, and then lock the table; or
look up the name, lock the table, and then do permissions checks.
The first choice was wrong because the results of the name lookup
and permissions checks might be out-of-date by the time the table
lock was acquired, while the second allowed a user with no privileges
to interfere with access to a table by users who do have privileges
(e.g. if a malicious backend queues up for an AccessExclusiveLock on
a table on which AccessShareLock is already held, further attempts
to access the table will be blocked until the AccessExclusiveLock
is obtained and the malicious backend's transaction rolls back).

To fix, allow callers of RangeVarGetRelid() to pass a callback which
gets executed after performing the name lookup but before acquiring
the relation lock.  If the name lookup is retried (because
invalidation messages are received), the callback will be re-executed
as well, so we get the best of both worlds.  RangeVarGetRelid() is
renamed to RangeVarGetRelidExtended(); callers not wishing to supply
a callback can continue to invoke it as RangeVarGetRelid(), which is
now a macro.  Since the only one caller that uses nowait = true now
passes a callback anyway, the RangeVarGetRelid() macro defaults nowait
as well.  The callback can also be used for supplemental locking - for
example, REINDEX INDEX needs to acquire the table lock before the index
lock to reduce deadlock possibilities.

There's a lot more work to be done here to fix all the cases where this
can be a problem, but this commit provides the general infrastructure
and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE,
LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE.

Per discussion with Noah Misch and Alvaro Herrera.
2011-11-30 10:27:00 -05:00
Heikki Linnakangas dea5f6cefe Take fillfactor into account in the new COPY bulk heap insert code.
Jeff Janes
2011-11-26 12:11:00 +02:00
Heikki Linnakangas 2e02280726 Fix another bug in the redo of COPY batches.
I got alignment wrong in the redo routine. Spotted by redoing the log
genereated by copy regression test.
2011-11-10 12:21:43 +02:00
Heikki Linnakangas f81648cb1e Fix bugs in the COPY heap-insert batching patch.
Forgot to call RestoreBkpBlocks() in the redo-function, as pointed out by
Simon Riggs. In redo of a regular heap insert, it's taken care of in
heap_redo(), but this new record type uses the heap2 RM, and heap2_redo()
does not take care of that for you.

Also, failed to reset the vmbuffer and all_visibile_cleared local variables
after switching to a new buffer.
2011-11-09 21:28:25 +02:00
Heikki Linnakangas d326d9e8ea In COPY, insert tuples to the heap in batches.
This greatly reduces the WAL volume, especially when the table is narrow.
The overhead of locking the heap page is also reduced. Reduced WAL traffic
also makes it scale a lot better, if you run multiple COPY processes at
the same time.
2011-11-09 10:54:41 +02:00
Robert Haas bbb6e559c4 Make VACUUM avoid waiting for a cleanup lock, where possible.
In a regular VACUUM, it's OK to skip pages for which a cleanup lock
isn't immediately available; the next VACUUM will deal with them.  If
we're scanning the entire relation to advance relfrozenxid, we might
need to wait, but only if there are tuples on the page that actually
require freezing.  These changes should greatly reduce the incidence
of of vacuum processes getting "stuck".

Simon Riggs and Robert Haas
2011-11-07 21:39:40 -05:00
Tom Lane 039680affb Don't assume that a tuple's header size is unchanged during toasting.
This assumption can be wrong when the toaster is passed a raw on-disk
tuple, because the tuple might pre-date an ALTER TABLE ADD COLUMN operation
that added columns without rewriting the table.  In such a case the tuple's
natts value is smaller than what we expect from the tuple descriptor, and
so its t_hoff value could be smaller too.  In fact, the tuple might not
have a null bitmap at all, and yet our current opinion of it is that it
contains some trailing nulls.

In such a situation, toast_insert_or_update did the wrong thing, because
to save a few lines of code it would use the old t_hoff value as the offset
where heap_fill_tuple should start filling data.  This did not leave enough
room for the new nulls bitmap, with the result that the first few bytes of
data could be overwritten with null flag bits, as in a recent report from
Hubert Depesz Lubaczewski.

The particular case reported requires ALTER TABLE ADD COLUMN followed by
CREATE TABLE AS SELECT * FROM ... or INSERT ... SELECT * FROM ..., and
further requires that there be some out-of-line toasted fields in one of
the tuples to be copied; else we'll not reach the troublesome code.
The problem can only manifest in this form in 8.4 and later, because
before commit a77eaa6a95, CREATE TABLE AS or
INSERT/SELECT wouldn't result in raw disk tuples getting passed directly
to heap_insert --- there would always have been at least a junkfilter in
between, and that would reconstitute the tuple header with an up-to-date
t_natts and hence t_hoff.  But I'm backpatching the tuptoaster change all
the way anyway, because I'm not convinced there are no older code paths
that present a similar risk.
2011-11-04 23:22:50 -04:00
Tom Lane 08e261cbc9 Fix race condition with toast table access from a stale syscache entry.
If a tuple in a syscache contains an out-of-line toasted field, and we
try to fetch that field shortly after some other transaction has committed
an update or deletion of the tuple, there is a race condition: vacuum
could come along and remove the toast tuples before we can fetch them.
This leads to transient failures like "missing chunk number 0 for toast
value NNNNN in pg_toast_2619", as seen in recent reports from Andrew
Hammond and Tim Uckun.

The design idea of syscache is that access to stale syscache entries
should be prevented by relation-level locks, but that fails for at least
two cases where toasted fields are possible: ANALYZE updates pg_statistic
rows without locking out sessions that might want to plan queries on the
same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without
any meaningful lock at all.

The least risky fix seems to be an idea that Heikki suggested when we
were dealing with a related problem back in August: forcibly detoast any
out-of-line fields before putting a tuple into syscache in the first place.
This avoids the problem because at the time we fetch the parent tuple from
the catalog, we should be holding an MVCC snapshot that will prevent
removal of the toast tuples, even if the parent tuple is outdated
immediately after we fetch it.  (Note: I'm not convinced that this
statement holds true at every instant where we could be fetching a syscache
entry at all, but it does appear to hold true at the times where we could
fetch an entry that could have a toasted field.  We will need to be a bit
wary of adding toast tables to low-level catalogs that don't have them
already.)  An additional benefit is that subsequent uses of the syscache
entry should be faster, since they won't have to detoast the field.

Back-patch to all supported versions.  The problem is significantly harder
to reproduce in pre-9.0 releases, because of their willingness to flush
every entry in a syscache whenever the underlying catalog is vacuumed
(cf CatalogCacheFlushRelation); but there is still a window for trouble.
2011-11-01 19:49:58 -04:00
Robert Haas fae54e4a16 Update visibilitymap.c header comments.
Recent work on index-only scans left this somewhat out of date.
2011-10-29 14:46:59 -04:00
Tom Lane e6858e6657 Measure the number of all-visible pages for use in index-only scan costing.
Add a column pg_class.relallvisible to remember the number of pages that
were all-visible according to the visibility map as of the last VACUUM
(or ANALYZE, or some other operations that update pg_class.relpages).
Use relallvisible/relpages, instead of an arbitrary constant, to estimate
how many heap page fetches can be avoided during an index-only scan.

This is pretty primitive and will no doubt see refinements once we've
acquired more field experience with the index-only scan mechanism, but
it's way better than using a constant.

Note: I had to adjust an underspecified query in the window.sql regression
test, because it was changing answers when the plan changed to use an
index-only scan.  Some of the adjacent tests perhaps should be adjusted
as well, but I didn't do that here.
2011-10-14 17:23:46 -04:00