Commit Graph

320 Commits

Author SHA1 Message Date
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Peter Geoghegan 2110f71696 nbtree: Tweak _bt_pgaddtup() comments.
Make it clear that _bt_pgaddtup() truncates the first data item on an
internal page because its key is supposed to be treated as minus
infinity within _bt_compare().
2019-11-18 13:04:53 -08:00
Peter Geoghegan 867d25ccb4 Explain subtlety in nbtree locking protocol.
The Postgres approach to coupling locks during an ascent of the tree is
slightly different to the approach taken by Lehman and Yao.  Add a new
paragraph to the "Differences to the Lehman & Yao algorithm" section of
the nbtree README that explains the similarities and differences.
2019-08-23 20:24:49 -07:00
Peter Geoghegan 9c02cf5661 Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field
recorded during an initial descent of the tree wasn't actually used.
Both _bt_getstackbuf() callers overwrote the value with their own value.

Remove the block number field from the stack struct, and add a child
block number argument to _bt_getstackbuf() in its place.  This makes the
overall design of _bt_getstackbuf() clearer.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
2019-08-14 11:32:35 -07:00
Peter Eisentraut fd6ec93bf8 Add error codes to some corruption log messages
In some cases we have elog(ERROR) while corruption is certain and we
can give a clear error code ERRCODE_DATA_CORRUPTED or
ERRCODE_INDEX_CORRUPTED.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
2019-08-01 11:15:26 +02:00
Amit Kapila 9679345f3c Fix typos.
Reported-by: Alexander Lakhin
Author: Alexander Lakhin
Reviewed-by: Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
2019-05-26 18:28:18 +05:30
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Peter Geoghegan 489e431ba5 Remove obsolete nbtree insertion comment.
Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes.  This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.

I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards.  The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec53 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.
2019-05-15 16:53:11 -07:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Peter Geoghegan 9b42e71376 Don't leave behind junk nbtree pages during split.
Commit 8fa30f906b reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR.  At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section.  This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions.  The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog().  There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.

Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes.  This happened at a point
that broke the rule around zeroing the right page in _bt_split().  If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents.  This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.

To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired.  A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.

Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.
2019-05-13 10:27:59 -07:00
Peter Geoghegan 7b37f4b02e Correct more obsolete nbtree page split comments.
Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself.  Both sets of comments were rendered
obsolete by commit 40dae7ec53, which made the nbtree page split
algorithm more robust.  Fix the comments that were missed the first time
around now.

In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.
2019-05-03 13:34:45 -07:00
Peter Geoghegan 9ee7414ed0 Remove obsolete _bt_insert_parent() comment.
Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty().  It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.
2019-04-29 14:14:38 -07:00
Peter Geoghegan 9b10926263 Prevent O(N^2) unique index insertion edge case.
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate.  The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on.  The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.

Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values.  Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.

To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely.  _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.

Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-23 10:33:57 -07:00
Peter Geoghegan 74eb2176bf Invalidate binary search bounds consistently.
_bt_check_unique() failed to invalidate binary search bounds in the
event of a live conflict following commit e5adcb78.  This resulted in
problems after waiting for the conflicting xact to commit or abort.  The
subsequent call to _bt_check_unique() would restore the initial binary
search bounds, rather than starting a new search.  Fix by explicitly
invalidating bounds when it becomes clear that there is a live conflict
that insertion will have to wait to resolve.

Ashutosh Sharma, with a few additional tweaks by me.

Author: Ashutosh Sharma
Reported-By: Ashutosh Sharma
Diagnosed-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com
2019-04-04 09:38:08 -07:00
Andres Freund 71bdc99d0d tableam: Add helper for indexes to check if a corresponding table tuples exist.
This is, likely exclusively, useful to verify that conflicts detected
in a unique index are with live tuples, rather than dead ones.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 16:52:55 -07:00
Peter Geoghegan fab2502433 Consider secondary factors during nbtree splits.
Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes.  Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
2019-03-20 10:12:19 -07:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Peter Geoghegan 3f34283973 Correct obsolete nbtree page split comment.
Commit 40dae7ec53, which made the nbtree page split algorithm more
robust, made _bt_insert_parent() only unlock the right child of the
parent page before inserting a new downlink into the parent.  Update a
comment from the Berkeley days claiming that both left and right child
pages are unlocked before the new downlink actually gets inserted.

The claim that it is okay to release both locks early based on Lehman
and Yao's say-so never made much sense.  Lehman and Yao must sometimes
"couple" buffer locks across a pair of internal pages when relocating a
downlink, unlike the corresponding code within _bt_getstack().
2019-03-12 16:40:05 -07:00
Peter Geoghegan 72c7c4e386 Correct obsolete nbtree page split WAL comment.
Commit 2c03216d83, which revamped the WAL record format, failed to
update a comment referencing the old API.  Update the comment.
2019-03-04 12:32:40 -08:00
Peter Geoghegan 2ab23445bc Remove unneeded argument from _bt_getstackbuf().
_bt_getstackbuf() is called at exactly two points following commit
efada2b8e9 (one call site is concerned with page splits, while the
other is concerned with page deletion).  The parent buffer returned by
_bt_getstackbuf() is write-locked in both cases.  Remove the 'access'
argument and make _bt_getstackbuf() assume that callers require a
write-lock.
2019-02-25 17:47:43 -08:00
Andres Freund e7cc78ad43 Remove superfluous tqual.h includes.
Most of these had been obsoleted by 568d4138c / the SnapshotNow
removal.

This is is preparation for moving most of tqual.[ch] into either
snapmgr.h or heapam.h, which in turn is in preparation for pluggable
table AMs.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 12:15:02 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Peter Geoghegan 61a4480a68 Remove obsolete nbtree duplicate entries comment.
Remove a comment from the Berkeley days claiming that nbtree must
disambiguate duplicate keys within _bt_moveright().  There is no special
care taken around duplicates within _bt_moveright(), at least since
commit 9e85183bfc removed inscrutable _bt_moveright() code to handle
pages full of duplicates.
2018-12-18 21:40:38 -08:00
Peter Geoghegan 60f3cc9553 Correct obsolete nbtree recovery comments.
Commit 40dae7ec53, which made the handling of interrupted nbtree page
splits more robust, removed an nbtree-specific end-of-recovery cleanup
step.  This meant that it was no longer possible to complete an
interrupted page split during recovery.  However, a reference to
recovery as a reason for using a NULL stack while inserting into a
parent page was missed.  Remove the reference.

Remove a similar obsolete reference to recovery that was introduced much
more recently, as part of the btree fastpath optimization enhancement
that made it into Postgres 11 (commit 2b272734, and follow-up commits).

Backpatch: 11-, where the fastpath optimization was introduced.
2018-12-18 16:59:50 -08:00
Alexander Korotkov d2086b08b0 Reduce path length for locking leaf B-tree pages during insertion
In our B-tree implementation appropriate leaf page for new tuple
insertion is acquired using _bt_search() function.  This function always
returns leaf page locked in shared mode.  In order to obtain exclusive
lock, caller have to relock the page.

This commit makes _bt_search() function lock leaf page immediately in
exclusive mode when needed.  That removes unnecessary relock and, in
turn reduces lock contention for B-tree leaf pages.  Our experiments
on multi-core systems showed acceleration up to 4.5 times in corner
case.

Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan
2018-07-28 00:31:40 +03:00
Peter Geoghegan e915fed291 Correct obsolete unique index insertion comment.
Commit bc292937ae failed to update a comment about unique index
checking.  _bt_insertonpg() is no longer responsible for finding an
insertion location while preventing conflicting insertions.
2018-07-08 10:50:13 -07:00
Teodor Sigaev 08186dc05b Move _bt_upgrademetapage() into critical section.
Any changes on page should be done in critical section, so move
_bt_upgrademetapage into critical section. Improve comment. Found by Amit
Kapila during post-commit review of 857f9c36.

Author: Amit Kapila
2018-05-30 19:45:39 +03:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Teodor Sigaev f97f0c921a Adjust _bt_insertonpg() comments
Remove an obsolete reference to the 'afteritem' argument, which was
removed by commit bc292937.  Add a comment that clarifies how
_bt_insertonpg() indirectly handles the insertion of high key items.

Author: Peter Geoghegan
2018-04-19 11:08:45 +03:00
Teodor Sigaev 075aade436 Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected.  Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature.  A
similar check will be performed for the page highkey within
_bt_moveright() in most cases.  Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.

Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer).  This isn't strictly necessary,
but seems more future-proof than the old approach.  It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation.  Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.

In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX.  Also pfree during a page split, just to be
consistent.

Refactor _bt_check_natts() to be more readable.

Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 08:45:58 +03:00
Heikki Linnakangas 55101549d5 Fix a few typos in comments and variable names.
Author: Michael Paquier
Discussion: https://www.postgresql.org/message-id/20180411075223.GB19732%40paquier.xyz
2018-04-17 11:54:57 -04:00
Andrew Dunstan 8716b264ed minor comment fixes in nbtinsert.c 2018-04-10 18:36:40 -04:00
Andrew Dunstan 074251db67 Adjustments to the btree fastpath optimization.
This optimization was introduced in commit 2b272734. The changes include
some additional comments and documentation, and also these more
substantive changes:
. ensure the optimization is only applied on the leaf node of a tree
whose root is on level 2 or more. It's of little value on small trees.
. Delay calling RelationSetTargetBlock() until after the critical
section of _bt_insertonpg
. ensure the optimization is also applied to unlogged tables.

Pavan Deolasee and Peter Geoghegan with some very light editing from me.

Discussion: https://postgr.es/m/CABOikdO8jhRarNC60nZLktZYhxt+TK8z_V97+Ny499YQdyAfug@mail.gmail.com
2018-04-10 18:21:03 -04:00
Heikki Linnakangas 29d7ebf51e Fix comment on B-tree insertion fastpath condition.
The comment earlier in the function correctly states "and the insertion
key is strictly greater than the first key in this page". That is what
we check here, not "greater than or equal".
2018-04-10 16:57:19 +03:00
Teodor Sigaev 8224de4f42 Indexes with INCLUDE columns and their support in B-tree
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.

Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.

In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.

Bump catalog version

Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
			 David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
2018-04-07 23:00:39 +03:00
Teodor Sigaev 857f9c36cd Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.

This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).

This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.

Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 19:29:00 +03:00
Tom Lane 0b11a674fb Fix a boatload of typos in C comments.
Justin Pryzby

Discussion: https://postgr.es/m/20180331105640.GK28454@telsasoft.com
2018-04-01 15:01:28 -04:00
Andrew Dunstan 2b27273435 Optimize btree insertions for common case of increasing values
Remember the last page of an index insert if it's the rightmost leaf
page. If the next entry belongs on and can fit in the remembered page,
insert the new entry there as long as we can get a lock on the page.
Otherwise, fall back on the more expensive method of searching for
the right place to insert the entry.

This provides a performance improvement for the common case where an
index entry is for monotonically increasing or nearly monotonically
increasing value such as an identity field or a current timestamp.

Pavan Deolasee
Reviewed by Claudio Freire, Simon Riggs and Peter Geoghegan

Discussion: https://postgr.es/m/CABOikdM9DrupjyKZZFM5k8-0RCDs1wk6JzEkg7UgSW6QzOwMZw@mail.gmail.com
2018-03-26 22:39:24 +10:30
Tom Lane d79e7e92bf Remove redundant IndexTupleDSize macro.
Use IndexTupleSize everywhere, instead.  Also, remove IndexTupleSize's
internal typecast, as that's not really needed and might mask coding
errors.  Change some pointer variable datatypes in the call sites
to compensate for that and make it clearer what we're assuming.

Ildar Musin, Robert Haas, Stephen Frost

Discussion: https://postgr.es/m/0274288e-9e88-13b6-c61c-7b36928bf221@postgrespro.ru
2018-02-28 19:25:54 -05:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Peter Eisentraut 2eb4a831e5 Change TRUE/FALSE to true/false
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources.  The upper case spellings are only used
in some files/modules.  So standardize on the standard spellings.

The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.

In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2017-11-08 11:37:28 -05:00
Tom Lane 4c11d2c559 Flag index metapages as standard-format in xlog.c calls.
btree, hash, and bloom indexes all set up their metapages in standard
format (that is, with pd_lower and pd_upper correctly delimiting the
unused area); but they mostly didn't inform the xlog routines of this.
When calling log_newpage[_buffer], this is bad because it loses the
opportunity to compress unused data out of the WAL record.  When
calling XLogRegisterBuffer, it's not such a performance problem because
all of these call sites also use REGBUF_WILL_INIT, preventing an FPI
image from being written.  But it's still a good idea to provide the
flag when relevant, because that aids WAL consistency checking.

This completes the project of getting all the in-core index AMs to
handle their metapage WAL operations similarly.

Amit Kapila, reviewed by Michael Paquier

Discussion: https://postgr.es/m/0d273805-0e9e-ec1a-cb84-d4da400b8f85@lab.ntt.co.jp
2017-11-03 16:31:32 -04:00
Heikki Linnakangas 0c504a80cf Remove dedicated B-tree root-split record types.
Since commit 40dae7ec53, which changed the way b-tree page splitting
works, there has been no difference in the handling of root, and non-root
split WAL records. We don't need to distinguish them anymore

If you're worried about the loss of debugging information, note that
usually a root split record will normally be followed by a WAL record to
create the new root page. The root page will also have the BTP_ROOT flag
set on the page itself, and there is a pointer to it from the metapage.

Author: Aleksander Alekseev
Discussion: https://www.postgresql.org/message-id/20170406122116.GA11081@e733.localdomain
2017-08-16 12:24:40 +03:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Robert Haas 8da9a22636 Split index xlog headers from other private index headers.
The xlog-specific headers need to be included in both frontend code -
specifically, pg_waldump - and the backend, but the remainder of the
private headers for each index are only needed by the backend.  By
splitting the xlog stuff out into separate headers, pg_waldump pulls
in fewer backend headers, which is a good thing.

Patch by me, reviewed by Michael Paquier and Andres Freund, per a
complaint from Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmoZ=F=GkxV0YEv-A8tb+AEGy_Qa7GSiJ8deBKFATnzfEug@mail.gmail.com
2017-02-14 15:37:59 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Robert Haas 4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Kevin Grittner a343e223a5 Revert no-op changes to BufferGetPage()
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature.  Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming).  The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.

This change should have little or no effect on generated executable
code.

Motivated by the back-patching pain of Tom Lane and Robert Haas
2016-04-20 08:31:19 -05:00
Kevin Grittner 848ef42bb8 Add the "snapshot too old" feature
This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
2016-04-08 14:36:30 -05:00
Kevin Grittner 8b65cf4c5e Modify BufferGetPage() to prepare for "snapshot too old" feature
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.
2016-04-08 14:30:10 -05:00
Teodor Sigaev 8b99edefca Revert CREATE INDEX ... INCLUDING ...
It's not ready yet, revert two commits
690c543550 - unstable test output
386e3d7609 - patch itself
2016-04-08 21:52:13 +03:00
Teodor Sigaev 386e3d7609 CREATE INDEX ... INCLUDING (column[, ...])
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
2016-04-08 19:45:59 +03:00
Kevin Grittner fcff8a5751 Detect SSI conflicts before reporting constraint violations
While prior to this patch the user-visible effect on the database
of any set of successfully committed serializable transactions was
always consistent with some one-at-a-time order of execution of
those transactions, the presence of declarative constraints could
allow errors to occur which were not possible in any such ordering,
and developers had no good workarounds to prevent user-facing
errors where they were not necessary or desired.  This patch adds
a check for serialization failure ahead of duplicate key checking
so that if a developer explicitly (redundantly) checks for the
pre-existing value they will get the desired serialization failure
where the problem is caused by a concurrent serializable
transaction; otherwise they will get a duplicate key error.

While it would be better if the reads performed by the constraints
could count as part of the work of the transaction for
serialization failure checking, and we will hopefully get there
some day, this patch allows a clean and reliable way for developers
to work around the issue.  In many cases existing code will already
be doing the right thing for this to "just work".

Author: Thomas Munro, with minor editing of docs by me
Reviewed-by: Marko Tiikkaja, Kevin Grittner
2016-04-07 11:12:35 -05:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Bruce Momjian 807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Andres Freund 168d5805e4 Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint.  DO NOTHING avoids the
constraint violation, without touching the pre-existing row.  DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed.  The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.

This feature is often referred to as upsert.

This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert.  If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made.  If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.

To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.

Bumps catversion as stored rules change.

Author: Peter Geoghegan, with significant contributions from Heikki
    Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
    Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:43:10 +02:00
Kevin Grittner 2ed5b87f96 Reduce pinning and buffer content locking for btree scans.
Even though the main benefit of the Lehman and Yao algorithm for
btrees is that no locks need be held between page reads in an
index search, we were holding a buffer pin on each leaf page after
it was read until we were ready to read the next one.  The reason
was so that we could treat this as a weak lock to create an
"interlock" with vacuum's deletion of heap line pointers, even
though our README file pointed out that this was not necessary for
a scan using an MVCC snapshot.

The main goal of this patch is to reduce the blocking of vacuum
processes by in-progress btree index scans (including a cursor
which is idle), but the code rearrangement also allows for one
less buffer content lock to be taken when a forward scan steps from
one page to the next, which results in a small but consistent
performance improvement in many workloads.

This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits.  These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged.  If later patches allow  all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page.

Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
2015-03-25 14:24:43 -05:00
Stephen Frost 804b6b6db4 Fix column-privilege leak in error-message paths
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.

Instead, include only those columns which the user is providing or which
the user has select rights on.  If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription.  Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.

Further, in master only, do not return any data for cases where row
security is enabled on the relation and row security should be applied
for the user.  This required a bit of refactoring and moving of things
around related to RLS- note the addition of utils/misc/rls.c.

Back-patch all the way, as column-level privileges are now in all
supported versions.

This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
2015-01-28 12:31:30 -05:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Heikki Linnakangas aae7af3df8 Remove dead InRecovery check.
With the new B-tree incomplete split handling in 9.4, _bt_insert_parent is
never called in recovery.
2014-09-11 22:43:56 +03:00
Heikki Linnakangas c91a9b5a28 Fix backup-block numbering in redo of b-tree split.
I got the backup block numbers off-by-one in the commit that changed the
way incomplete-splits are handled. I blame the comments, which said
"backup block 1" and "backup block 2", even though the backup blocks
are numbered starting from 0, in the macros and functions used in replay.
Fix the comments and the code.

Per Jeff Janes' bug report about corruption caused by torn page writes.
The incorrect code is new in git master, but backpatch the comment change
down to 9.0, where the numbering in the redo-side macros  was changed.
2014-05-19 13:28:04 +03:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Heikki Linnakangas a4ad9afec2 Update obsolete comments.
We no longer have a TLI field in the page header.
2014-04-23 14:41:51 +03:00
Heikki Linnakangas 4a5d55ec2b Fix bug in the new B-tree incomplete-split code.
Forgot to update LSN of left sibling's page, when creating a new root.
I fixed this for regular insertions and page splits earlier, but missed
new root creation.
2014-04-22 22:40:44 +03:00
Heikki Linnakangas 877b088785 Avoid allocations in critical sections.
If a palloc in a critical section fails, it becomes a PANIC.
2014-04-04 13:35:44 +03:00
Heikki Linnakangas 954523cdfe Fix bug in the new B-tree incomplete-split code.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated.
2014-04-01 19:19:47 +03:00
Alvaro Herrera f88d4cfc9d Setup error context callback for transaction lock waits
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple.  For example:

LOG:  process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL:  Process holding the lock: 11366. Wait queue: 11367.
CONTEXT:  while updating tuple (0,2) in relation "foo"
STATEMENT:  UPDATE foo SET value = 3;

Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.

Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
2014-03-19 15:10:36 -03:00
Heikki Linnakangas 40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Heikki Linnakangas 00976f202c btbuild no longer calls _bt_doinsert(), update comment.
Peter Geoghegan
2014-02-26 18:49:04 +02:00
Tom Lane ac8bc3b6e4 Remove unnecessary relcache flushes after changing btree metapages.
These flushes were added in my commit d2896a9ed, which added the btree
logic that keeps a cached copy of the index metapage data in index relcache
entries.  The idea was to ensure that other backends would promptly update
their cached copies after a change.  However, this is not really necessary,
since _bt_getroot() has adequate defenses against believing a stale root
page link, and _bt_getrootheight() doesn't have to be 100% right.
Moreover, if it were necessary, a relcache flush would be an unreliable way
to do it, since the sinval mechanism believes that relcache flush requests
represent transactional updates, and therefore discards them on transaction
rollback.  Therefore, we might as well drop these flush requests and save
the time to rebuild the whole relcache entry after a metapage change.

If we ever try to support in-place truncation of btree indexes, it might
be necessary to revisit this issue so that _bt_getroot() can't get caught
by trying to follow a metapage link to a page that no longer exists.
A possible solution to that is to make use of an smgr, rather than
relcache, inval request to force other backends to discard their cached
metapages.  But for the moment this is not worth pursuing.
2014-02-05 13:43:46 -05:00
Tom Lane ac4ef637ad Allow use of "z" flag in our printf calls, and use it where appropriate.
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has".  Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64).  So let's
start using "z" instead.  To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".

Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead.  This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them.  I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.

It's possible that this change will result in format-string warnings from
pre-C99 compilers.  We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.

Andres Freund, with a little additional work by me
2014-01-23 17:18:33 -05:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Jeff Davis b8fd1a09f3 Add buffer_std flag to MarkBufferDirtyHint().
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.

In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.

Back-patch to 9.3.
2013-06-17 08:02:12 -07:00
Simon Riggs 96ef3b8ff1 Allow I/O reliability checks using 16-bit checksums
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.

WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.

Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.

Checksum used is WAL CRC-32 truncated to 16-bits.

Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
2013-03-22 13:54:07 +00:00
Simon Riggs bb7cc2623f Remove PageSetTLI and rename pd_tli to pd_checksum
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.

Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.

Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
2013-03-18 13:46:42 +00:00
Tom Lane 991f3e5ab3 Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure.  It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error.  (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)

Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.

Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 17:08:26 -05:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Bruce Momjian 6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Tom Lane d64713df7e Pass collations to functions in FunctionCallInfoData, not FmgrInfo.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings.  Fix by passing it in FunctionCallInfoData
instead.  In particular this allows a clean fix for bug #5970 (record_cmp
not working).  This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
2011-04-12 19:19:24 -04:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Heikki Linnakangas dafaa3efb7 Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-08 00:09:08 +02:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas 5f7b58fad8 Generalize concept of temporary relations to "relation persistence".
This commit replaces pg_class.relistemp with pg_class.relpersistence;
and also modifies the RangeVar node type to carry relpersistence rather
than istemp.  It also removes removes rd_istemp from RelationData and
instead performs the correct computation based on relpersistence.

For clarity, we add three new macros: RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we
can clarify the purpose of each check that previous depended on
rd_istemp.

This is intended as infrastructure for the upcoming unlogged tables
patch, as well as for future possible work on global temporary tables.
2010-12-13 12:34:26 -05:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Tom Lane 8fa30f906b Reduce PANIC to ERROR in some occasionally-reported btree failure cases.
This patch changes _bt_split() and _bt_pagedel() to throw a plain ERROR,
rather than PANIC, for several cases that are reported from the field
from time to time:
* right sibling's left-link doesn't match;
* PageAddItem failure during _bt_split();
* parent page's next child isn't right sibling during _bt_pagedel().
In addition the error messages for these cases have been made a bit
more verbose, with additional values included.

The original motivation for PANIC here was to capture core dumps for
subsequent analysis.  But with so many users whose platforms don't capture
core dumps by default, or who are unprepared to analyze them anyway, it's hard
to justify a forced database restart when we can fairly easily detect the
problems before we've reached the critical sections where PANIC would be
necessary.  It is not currently known whether the reports of these messages
indicate well-hidden bugs in Postgres, or are a result of storage-level
malfeasance; the latter possibility suggests that we ought to try to be more
robust even if there is a bug here that's ultimately found.

Backpatch to 8.2.  The code before that is sufficiently different that
it doesn't seem worth the trouble to back-port further.
2010-08-29 19:33:14 +00:00
Simon Riggs a760893dbd Derive latestRemovedXid for btree deletes by reading heap pages. The
WAL record for btree delete contains a list of tids, even when backup
blocks are present. We follow the tids to their heap tuples, taking
care to follow LP_REDIRECT tuples. We ignore LP_DEAD tuples on the
understanding that they will always have xmin/xmax earlier than any
LP_NORMAL tuples referred to by killed index tuples. Iff all tuples
are LP_DEAD we return InvalidTransactionId. The heap relfilenode is
added to the WAL record, requiring API changes to pass down the heap
Relation. XLOG_PAGE_MAGIC updated.
2010-03-28 09:27:02 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Simon Riggs efc16ea520 Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 01:32:45 +00:00
Tom Lane e66d714386 Make sure that GIN fast-insert and regular code paths enforce the same
tuple size limit.  Improve the error message for index-tuple-too-large
so that it includes the actual size, the limit, and the index name.
Sync with the btree occurrences of the same error.

Back-patch to 8.4 because it appears that the out-of-sync problem
is occurring in the field.

Teodor and Tom
2009-10-02 21:14:04 +00:00
Tom Lane 527f0ae3fa Department of second thoughts: let's show the exact key during unique index
build failures, too.  Refactor a bit more since that error message isn't
spelled the same.
2009-08-01 20:59:17 +00:00
Tom Lane b680ae4bdb Improve unique-constraint-violation error messages to include the exact
values being complained of.

In passing, also remove the arbitrary length limitation in the similar
error detail message for foreign key violations.

Itagaki Takahiro
2009-08-01 19:59:41 +00:00
Tom Lane 25d9bf2e3e Support deferrable uniqueness constraints.
The current implementation fires an AFTER ROW trigger for each tuple that
looks like it might be non-unique according to the index contents at the
time of insertion.  This works well as long as there aren't many conflicts,
but won't scale to massive unique-key reassignments.  Improving that case
is a TODO item.

Dean Rasheed
2009-07-29 20:56:21 +00:00