Commit Graph

176 Commits

Author SHA1 Message Date
Teodor Sigaev 8224de4f42 Indexes with INCLUDE columns and their support in B-tree
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.

Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.

In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.

Bump catalog version

Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
			 David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
2018-04-07 23:00:39 +03:00
Teodor Sigaev 0a64b45152 Fix handling of non-upgraded B-tree metapages
857f9c36 bumps B-tree metapage version while upgrade is performed "on the fly"
when needed. However, some asserts fired when old version metapage was
cached to rel->rd_amcache. Despite new metadata fields are never used from
rel->rd_amcache, that needs to be fixed. This patch introduces metadata
upgrade during its caching, which fills unavailable fields with their default
values. contrib/pageinspect is also patched to handle non-upgraded metapages
in the same way.

Author: Alexander Korotkov
2018-04-05 17:56:00 +03:00
Teodor Sigaev 17d8beb4f5 Remove unused vars and mark assert-only vars
Kyotaro HORIGUCHI
2018-04-05 13:16:15 +03:00
Teodor Sigaev 51e6562324 Fix typo
Masahiko Sawada
2018-04-05 13:04:18 +03:00
Teodor Sigaev 857f9c36cd Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.

This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).

This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.

Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 19:29:00 +03:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Tom Lane 4c11d2c559 Flag index metapages as standard-format in xlog.c calls.
btree, hash, and bloom indexes all set up their metapages in standard
format (that is, with pd_lower and pd_upper correctly delimiting the
unused area); but they mostly didn't inform the xlog routines of this.
When calling log_newpage[_buffer], this is bad because it loses the
opportunity to compress unused data out of the WAL record.  When
calling XLogRegisterBuffer, it's not such a performance problem because
all of these call sites also use REGBUF_WILL_INIT, preventing an FPI
image from being written.  But it's still a good idea to provide the
flag when relevant, because that aids WAL consistency checking.

This completes the project of getting all the in-core index AMs to
handle their metapage WAL operations similarly.

Amit Kapila, reviewed by Michael Paquier

Discussion: https://postgr.es/m/0d273805-0e9e-ec1a-cb84-d4da400b8f85@lab.ntt.co.jp
2017-11-03 16:31:32 -04:00
Tom Lane eb5c404b17 Minor code-cleanliness improvements for btree.
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool.  Use them
in a few places that were randomly referencing the flag bits directly.

In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ.  (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)

Masahiko Sawada, reviewed by Doug Doole

Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
2017-09-18 16:36:28 -04:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Robert Haas 8da9a22636 Split index xlog headers from other private index headers.
The xlog-specific headers need to be included in both frontend code -
specifically, pg_waldump - and the backend, but the remainder of the
private headers for each index are only needed by the backend.  By
splitting the xlog stuff out into separate headers, pg_waldump pulls
in fewer backend headers, which is a good thing.

Patch by me, reviewed by Michael Paquier and Andres Freund, per a
complaint from Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmoZ=F=GkxV0YEv-A8tb+AEGy_Qa7GSiJ8deBKFATnzfEug@mail.gmail.com
2017-02-14 15:37:59 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane e89526d4f3 In B-tree page deletion, clean up properly after page deletion failure.
In _bt_unlink_halfdead_page(), we might fail to find an immediate left
sibling of the target page, perhaps because of corruption of the page
sibling links.  The code intends to cope with this by just abandoning
the deletion attempt; but what actually happens is that it fails outright
due to releasing the same buffer lock twice.  (And error recovery masks
a second problem, which is possible leakage of a pin on another page.)
Seems to have been introduced by careless refactoring in commit efada2b8e.
Since there are multiple cases to consider, let's make releasing the buffer
lock in the failure case the responsibility of _bt_unlink_halfdead_page()
not its caller.

Also, avoid fetching the leaf page's left-link again after we've dropped
lock on the page.  This is probably harmless, but it's not exactly good
coding practice.

Per report from Kyotaro Horiguchi.  Back-patch to 9.4 where the faulty code
was introduced.

Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
2016-08-06 14:28:37 -04:00
Kevin Grittner a343e223a5 Revert no-op changes to BufferGetPage()
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature.  Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming).  The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.

This change should have little or no effect on generated executable
code.

Motivated by the back-patching pain of Tom Lane and Robert Haas
2016-04-20 08:31:19 -05:00
Kevin Grittner 848ef42bb8 Add the "snapshot too old" feature
This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
2016-04-08 14:36:30 -05:00
Kevin Grittner 8b65cf4c5e Modify BufferGetPage() to prepare for "snapshot too old" feature
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.
2016-04-08 14:30:10 -05:00
Teodor Sigaev 8b99edefca Revert CREATE INDEX ... INCLUDING ...
It's not ready yet, revert two commits
690c543550 - unstable test output
386e3d7609 - patch itself
2016-04-08 21:52:13 +03:00
Teodor Sigaev 386e3d7609 CREATE INDEX ... INCLUDING (column[, ...])
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
2016-04-08 19:45:59 +03:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Bruce Momjian 807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Heikki Linnakangas c73669c0e0 Fix WAL-logging of B-tree "unlink halfdead page" operation.
There was some confusion on how to record the case that the operation
unlinks the last non-leaf page in the branch being deleted.
_bt_unlink_halfdead_page set the "topdead" field in the WAL record to
the leaf page, but the redo routine assumed that it would be an invalid
block number in that case. This commit fixes _bt_unlink_halfdead_page to
do what the redo routine expected.

This code is new in 9.4, so backpatch there.
2014-11-17 18:45:46 +02:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Peter Eisentraut b7a08c8028 Message improvements 2014-10-12 01:06:35 -04:00
Heikki Linnakangas 8da3183780 Fix error when trying to delete page with half-dead left sibling.
The new page deletion code didn't cope with the case the target page's
right sibling was marked half-dead. It failed a sanity check which checked
that the downlinks in the parent page match the lower level, because a
half-dead page has no downlink. To cope, check for that condition, and
just give up on the deletion if it happens. The vacuum will finish the
deletion of the half-dead page when it gets there, and on the next vacuum
after that the empty can be deleted.

Reported by Jeff Janes.
2014-05-25 18:18:09 -04:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Heikki Linnakangas 4fafc4ecd9 Cleanup of new b-tree page deletion code.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
2014-04-23 10:19:54 +03:00
Heikki Linnakangas 77fe2b6d79 Fix bug in new B-tree page deletion code.
When modifying a page, must hold an exclusive lock. A shared lock is
obviously not good enough.
2014-04-22 15:34:54 +03:00
Heikki Linnakangas 40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Heikki Linnakangas efada2b8e9 Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 16:07:19 +02:00
Tom Lane ac8bc3b6e4 Remove unnecessary relcache flushes after changing btree metapages.
These flushes were added in my commit d2896a9ed, which added the btree
logic that keeps a cached copy of the index metapage data in index relcache
entries.  The idea was to ensure that other backends would promptly update
their cached copies after a change.  However, this is not really necessary,
since _bt_getroot() has adequate defenses against believing a stale root
page link, and _bt_getrootheight() doesn't have to be 100% right.
Moreover, if it were necessary, a relcache flush would be an unreliable way
to do it, since the sinval mechanism believes that relcache flush requests
represent transactional updates, and therefore discards them on transaction
rollback.  Therefore, we might as well drop these flush requests and save
the time to rebuild the whole relcache entry after a metapage change.

If we ever try to support in-place truncation of btree indexes, it might
be necessary to revisit this issue so that _bt_getroot() can't get caught
by trying to follow a metapage link to a page that no longer exists.
A possible solution to that is to make use of an smgr, rather than
relcache, inval request to force other backends to discard their cached
metapages.  But for the moment this is not worth pursuing.
2014-02-05 13:43:46 -05:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Simon Riggs bb7cc2623f Remove PageSetTLI and rename pd_tli to pd_checksum
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.

Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.

Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
2013-03-18 13:46:42 +00:00
Tom Lane 31f38f28b0 Redesign the planner's handling of index-descent cost estimation.
Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited.  In
commit 21a39de580 I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b55 fudged it some more,
but still with not a lot of principle behind it.

What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs.  We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads.  However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs.  So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent.  The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).

To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree.  There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.

This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward.  In 9.2 I'll just revert
the fudge-factor change.
2013-01-11 12:56:58 -05:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Bruce Momjian 927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Simon Riggs d3abbbebe5 Avoid early reuse of btree pages, causing incorrect query results.
When we allowed read-only transactions to skip assigning XIDs
we introduced the possibility that a fully deleted btree page
could be reused. This broke the index link sequence which could
then lead to indexscans silently returning fewer rows than would
have been correct. The actual incidence of silent errors from
this is thought to be very low because of the exact workload
required and locking pre-conditions. Fix is to remove pages only
if index page opaque->btpo.xact precedes RecentGlobalXmin.

Noah Misch, reviewed by Simon Riggs
2012-06-01 12:21:45 +01:00
Tom Lane 593a9631a7 Don't clear btpo_cycleid during _bt_vacuum_one_page.
When "vacuuming" a single btree page by removing LP_DEAD tuples, we are not
actually within a vacuum operation, but rather in an ordinary insertion
process that could well be running concurrently with a vacuum.  So clearing
the cycleid is incorrect, and could cause the concurrent vacuum to miss
removing tuples that it needs to remove.  This is a longstanding bug
introduced by commit e6284649b9 of
2006-07-25.  I believe it explains Maxim Boguk's recent report of index
corruption, and probably some other previously unexplained reports.

In 9.0 and up this is a one-line fix; before that we need to introduce a
flag to tell _bt_delitems what to do.
2012-02-21 15:03:36 -05:00
Tom Lane 9789c99d01 Cosmetic cleanup for commit a760893dbd.
Mostly, fixing overlooked comments.
2012-02-21 14:14:16 -05:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Simon Riggs 2d2841a56c Continue to allow VACUUM to mark last block of index dirty
even when there is no work to do. Further analysis required.
Revert of patch c1458cc495
2011-11-22 09:48:06 +00:00
Simon Riggs c1458cc495 Avoid marking buffer dirty when VACUUM has no work to do.
When wal_level = 'hot_standby' we touched the last page of the
relation during a VACUUM, even if nothing else had happened.
That would alter the LSN of the last block and set the mtime
of the relation file unnecessarily. Noted by Thom Brown.
2011-11-18 16:06:53 +00:00
Bruce Momjian 6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Simon Riggs e1cd66f748 Reduce impact of btree page reuse on Hot Standby by fixing off-by-1 error.
WAL records of type XLOG_BTREE_REUSE_PAGE were generated using a
latestRemovedXid one higher than actually needed because xid used was
page opaque->btpo.xact rather than an actually removed xid.
Noticed on an otherwise quiet system by Noah Misch.

Noah Misch and Simon Riggs
2011-06-27 22:12:09 +01:00
Simon Riggs 758bd2a433 Respect Hot Standby controls while recycling btree index pages.
Btree pages were recycled after VACUUM deletes all records on a
page and then a subsequent VACUUM occurs after the RecentXmin
horizon is reached. Using RecentXmin meant that we did not respond
correctly to the user controls provide to avoid Hot Standby
conflicts and so spurious conflicts could be generated in some
workload combinations. We now reuse pages only when we reach
RecentGlobalXmin, which can be much later in the presence of long
running queries and is also controlled by vacuum_defer_cleanup_age
and hot_standby_feedback.

Noah Misch and Simon Riggs
2011-06-16 10:19:10 +01:00
Peter Eisentraut 5caa3479c2 Clean up most -Wunused-but-set-variable warnings from gcc 4.6
This warning is new in gcc 4.6 and part of -Wall.  This patch cleans
up most of the noise, but there are some still warnings that are
trickier to remove.
2011-04-11 22:28:45 +03:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Heikki Linnakangas dafaa3efb7 Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-08 00:09:08 +02:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00