Commit Graph

3523 Commits

Author SHA1 Message Date
Alvaro Herrera 8b21b416ed Avoid spurious deadlocks when upgrading a tuple lock
This puts back reverted commit de87a084c0, with some bug fixes.

When two (or more) transactions are waiting for transaction T1 to release a
tuple-level lock, and transaction T1 upgrades its lock to a higher level, a
spurious deadlock can be reported among the waiting transactions when T1
finishes.  The simplest example case seems to be:

T1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for T1
Z: select id from job where name = 'a' for key share;
T1: update job set name = 'b' where id = 1;
Z: update job set name = 'c' where id = 1; -- starts waiting for T1
T1: rollback;

At this point, transaction Y is rolled back on account of a deadlock: Y
holds the heavyweight tuple lock and is waiting for the Xmax to be released,
while Z holds part of the multixact and tries to acquire the heavyweight
lock (per protocol) and goes to sleep; once T1 releases its part of the
multixact, Z is awakened only to be put back to sleep on the heavyweight
lock that Y is holding while sleeping.  Kaboom.

This can be avoided by having Z skip the heavyweight lock acquisition.  As
far as I can see, the biggest downside is that if there are multiple Z
transactions, the order in which they resume after T1 finishes is not
guaranteed.

Backpatch to 9.6.  The patch applies cleanly on 9.5, but the new tests don't
work there (because isolationtester is not smart enough), so I'm not going
to risk it.

Author: Oleksii Kliukin
Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
Discussion: https://postgr.es/m/2815.1560521451@sss.pgh.pa.us
2019-06-18 18:23:16 -04:00
Michael Paquier 3412030205 Fix more typos and inconsistencies in the tree
Author: Alexander Lakhin
Discussion: https://postgr.es/m/0a5419ea-1452-a4e6-72ff-545b1a5a8076@gmail.com
2019-06-17 16:13:16 +09:00
Alvaro Herrera 9d20b0ec8f Revert "Avoid spurious deadlocks when upgrading a tuple lock"
This reverts commits 3da73d6839 and de87a084c0.

This code has some tricky corner cases that I'm not sure are correct and
not properly tested anyway, so I'm reverting the whole thing for next
week's releases (reintroducing the deadlock bug that we set to fix).
I'll try again afterwards.

Discussion: https://postgr.es/m/E1hbXKQ-0003g1-0C@gemulon.postgresql.org
2019-06-16 22:24:21 -04:00
Alvaro Herrera 3da73d6839 Silence compiler warning
Introduced in de87a084c0.
2019-06-14 11:33:40 -04:00
Michael Paquier f43608bda2 Fix typos and inconsistencies in code comments
Author: Alexander Lakhin
Discussion: https://postgr.es/m/dec6aae8-2d63-639f-4d50-20e229fb83e3@gmail.com
2019-06-14 09:34:34 +09:00
Alvaro Herrera de87a084c0 Avoid spurious deadlocks when upgrading a tuple lock
When two (or more) transactions are waiting for transaction T1 to release a
tuple-level lock, and transaction T1 upgrades its lock to a higher level, a
spurious deadlock can be reported among the waiting transactions when T1
finishes.  The simplest example case seems to be:

T1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for X
Z: select id from job where name = 'a' for key share;
T1: update job set name = 'b' where id = 1;
Z: update job set name = 'c' where id = 1; -- starts waiting for X
T1: rollback;

At this point, transaction Y is rolled back on account of a deadlock: Y
holds the heavyweight tuple lock and is waiting for the Xmax to be released,
while Z holds part of the multixact and tries to acquire the heavyweight
lock (per protocol) and goes to sleep; once X releases its part of the
multixact, Z is awakened only to be put back to sleep on the heavyweight
lock that Y is holding while sleeping.  Kaboom.

This can be avoided by having Z skip the heavyweight lock acquisition.  As
far as I can see, the biggest downside is that if there are multiple Z
transactions, the order in which they resume after X finishes is not
guaranteed.

Backpatch to 9.6.  The patch applies cleanly on 9.5, but the new tests don't
work there (because isolationtester is not smart enough), so I'm not going
to risk it.

Author: Oleksii Kliukin
Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
2019-06-13 17:28:24 -04:00
Andres Freund fff2a7d7bd Don't access catalogs to validate GUCs when not connected to a DB.
Vignesh found this bug in the check function for
default_table_access_method's check hook, but that was just copied
from older GUCs. Investigation by Michael and me then found the bug in
further places.

When not connected to a database (e.g. in a walsender connection), we
cannot perform (most) GUC checks that need database access. Even when
only shared tables are needed, unless they're
nailed (c.f. RelationCacheInitializePhase2()), they cannot be accessed
without pg_class etc. being present.

Fix by extending the existing IsTransactionState() checks to also
check for MyDatabaseOid.

Reported-By: Vignesh C, Michael Paquier, Andres Freund
Author: Vignesh C, Andres Freund
Discussion: https://postgr.es/m/CALDaNm1KXK9gbZfY-p_peRFm_XrBh1OwQO1Kk6Gig0c0fVZ2uw%40mail.gmail.com
Backpatch: 9.4-
2019-06-10 23:34:50 -07:00
Noah Misch 31d250e049 Update stale comments, and fix comment typos. 2019-06-08 10:12:26 -07:00
Amit Kapila 92c4abc736 Fix assorted inconsistencies.
There were a number of issues in the recent commits which include typos,
code and comments mismatch, leftover function declarations.  Fix them.

Reported-by: Alexander Lakhin
Author: Alexander Lakhin, Amit Kapila and Amit Langote
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
2019-06-08 08:16:38 +05:30
Alvaro Herrera e8bdea58f9 Fix message style
Mark one message not for translation, and prefer "cannot" over "may
not", per commentary from Robert Haas.

Discussion: https://postgr.es/m/20190430145813.GA29872@alvherre.pgsql
2019-06-06 12:57:57 -04:00
Michael Paquier 1fb6f62a84 Fix typos in various places
Author: Andrea Gelmini
Reviewed-by: Michael Paquier, Justin Pryzby
Discussion: https://postgr.es/m/20190528181718.GA39034@glet
2019-06-03 13:44:03 +09:00
Alvaro Herrera d22f885f89 Fix double-phrase typo in message
New in 147e3722f7.
2019-05-31 10:08:37 -04:00
Amit Kapila 9679345f3c Fix typos.
Reported-by: Alexander Lakhin
Author: Alexander Lakhin
Reviewed-by: Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
2019-05-26 18:28:18 +05:30
Andres Freund 73b8c3bd28 tableam: Rename wrapper functions to match callback names.
Some of the wrapper functions didn't match the callback names. Many of
them due to staying "consistent" with historic naming of the wrapped
functionality. We decided that for most cases it's more important to
be for tableam to be consistent going forward, than with the past.

The one exception is beginscan/endscan/...  because it'd have looked
odd to have systable_beginscan/endscan/... with a different naming
scheme, and changing the systable_* APIs would have caused way too
much churn (including breaking a lot of external users).

Author: Ashwin Agrawal, with some small additions by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeiugyrXZfX7n0ORCa4L-m834dzmaE8eFdbNR6PMpetU4Ww@mail.gmail.com
2019-05-23 16:32:36 -07:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Robert Haas 1171d7d585 tableam: Move heap-specific logic from needs_toast_table below tableam.
This allows table AMs to completely suppress TOAST table creation, or
to modify the conditions under which they are created.

Patch by me.  Reviewed by Andres Freund.

Discussion: http://postgr.es/m/CA+Tgmoa4O2n=yphqD2pERUnYmUO84bH1SqMsA-nSxBGsZ7gWfA@mail.gmail.com
2019-05-21 11:57:13 -04:00
Fujii Masao b8e2170e40 Fix comment for issue_xlog_fsync().
"segno" is the argument for the function, not "log" and "seg".

Author: Antonin Houska
Discussion: https://postgr.es/m/11863.1558361020@spoje.net
2019-05-21 00:44:00 +09:00
Andres Freund c3b23ae457 Don't to predicate lock for analyze scans, refactor scan option passing.
Before this commit, when ANALYZE was run on a table and serializable
was used (either by virtue of an explicit BEGIN TRANSACTION ISOLATION
LEVEL SERIALIZABLE, or default_transaction_isolation being set to
serializable) a null pointer dereference lead to a crash.

The analyze scan doesn't need a snapshot (nor predicate locking), but
before this commit a scan only contained information about being a
bitmap or sample scan.

Refactor the option passing to the scan_begin callback to use a
bitmask instead. Alternatively we could have added a new boolean
parameter, but that seems harder to read. Even before this issue
various people (Heikki, Tom, Robert) suggested doing so.

These changes don't change the scan APIs outside of tableam. The flags
argument could be exposed, it's not necessary to fix this
problem. Also the wrapper table_beginscan* functions encapsulate most
of that complexity.

After these changes fixing the bug is trivial, just don't acquire
predicate lock for analyze style scans. That was already done for
bitmap heap scans.  Add an assert that a snapshot is passed when
acquiring the predicate lock, so this kind of bug doesn't require
running with serializable.

Also add a comment about sample scans currently requiring predicate
locking the entire relation, that previously wasn't remarked upon.

Reported-By: Joe Wildish
Author: Andres Freund
Discussion:
    https://postgr.es/m/4EA80A20-E9BF-49F1-9F01-5B66CAB21453@elusive.cx
    https://postgr.es/m/20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de
    https://postgr.es/m/20190518203102.g7peu2fianukjuxm@alap3.anarazel.de
2019-05-19 15:10:28 -07:00
Tom Lane d307954a7d "A void function may not return a value".
Per buildfarm.
2019-05-18 00:40:39 -04:00
Andres Freund 147e3722f7 tableam: Avoid relying on relation size to determine validity of tids.
Instead add a tableam callback to do so. To avoid adding per
validation overhead, pass a scan to tuple_tid_valid. In heap's case
we'd otherwise incurred a RelationGetNumberOfBlocks() call for each
tid - which'd have added noticable overhead to nodeTidscan.c.

Author: Andres Freund
Reviewed-By: Ashwin Agrawal
Discussion: https://postgr.es/m/20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
2019-05-17 18:56:55 -07:00
Andres Freund 7f44ede594 tableam: Don't assume that every AM uses md.c style storage.
Previously various parts of the code routed size requests through
RelationGetNumberOfBlocks[InFork]. That works if md.c is used by the
AM, but not otherwise.

Add a tableam callback to return the size of the table. As not every
AM will use postgres' BLCKSZ, have it return bytes, and have
RelationGetNumberOfBlocksInFork() round the byte size up into blocks.

To allow code outside of the AM to determine the actual relation size
map InvalidForkNumber the total size of a relation, as not every AM
might just need the postgres defined forks.

A few users of RelationGetNumberOfBlocks() ought to be converted away
from that. One case, the use of it to determine whether a tid is
valid, will be fixed in a follow up commit. Others will have to wait
for v13.

Author: Andres Freund
Discussion: https://postgr.es/m/20190423225201.3bbv6tbqzkb5w7cw@alap3.anarazel.de
2019-05-17 18:56:47 -07:00
Peter Geoghegan 3f58cc6dd8 Remove extra nbtree half-dead internal page check.
It's not safe for nbtree VACUUM to attempt to delete a target page whose
right sibling is already half-dead, since that would fail the
cross-check when VACUUM attempts to re-find a downlink to the right
sibling in the parent page.  Logic to prevent this from happening was
added by commit 8da3183780, which addressed a bug in the overhaul of
page deletion that went into PostgreSQL 9.4 (commit efada2b8e9).
VACUUM was made to check the right sibling page, and back off when it
happened to be half-dead already.

However, it is only truly necessary to do the right sibling check on the
leaf level, since that transitively determines if the deletion target's
parent's right sibling page is itself undergoing deletion.  Remove the
internal page level check, and add a comment explaining why the leaf
level check alone suffices.

The extra check is also unnecessary due to the fact that internal pages
that are marked half-dead are generally considered corrupt.  Commit
efada2b8e9 established the principle that there should never be
half-dead internal pages (internal pages pending deletion are possible,
but that status is never directly represented in the internal page).
VACUUM will complain about corruption when it encounters half-dead
internal pages, so VACUUM is bound to raise an error one way or another
when an nbtree index has a half-dead internal page (contrib/amcheck will
also report that the page is corrupt).

It's possible that a pg_upgrade'd 9.3 database will still have half-dead
internal pages, so it may seem like there is an argument for leaving the
check in place to reliably get a cleaner error message that advises the
user to REINDEX.  However, leaf pages are also deleted in the first
phase of deletion prior to PostgreSQL 9.4, so I believe we won't even
attempt to re-find the parent page anyway (we won't have the fully
deleted leaf page as the right sibling of our target page, so we won't
even try to find a downlink for it).

Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com
2019-05-16 15:11:58 -07:00
Peter Geoghegan 489e431ba5 Remove obsolete nbtree insertion comment.
Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes.  This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.

I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards.  The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec53 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.
2019-05-15 16:53:11 -07:00
Peter Geoghegan 7505da2f45 Reverse order of newitem nbtree candidate splits.
Commit fab25024, which taught nbtree to choose candidate split points
more carefully, had _bt_findsplitloc() record all possible split points
in an initial pass over a page that is about to be split.  The order
that candidate split points were processed and stored in was assumed to
match the offset number order of split points on an imaginary version of
the page that contains the same items as the original, but also fits
newitem (the item that provoked the split precisely because it didn't
fit).

However, the order of split points in the final array was not quite what
was expected: the split point that makes newitem the firstright item
came after the split point that makes newitem the lastleft item -- not
before.  As a result, _bt_findsplitloc() could get confused about the
leftmost and rightmost tuples among all possible split points recorded
for the page.  This seems to have no appreciable impact on the quality
of the final split point chosen by _bt_findsplitloc(), but it's still
wrong.

To fix, switch the order in which newitem candidate splits are recorded
in.  This also makes it possible to describe candidate split points in
terms of which pair of adjoining tuples enclose the split point within
_bt_findsplitloc(), making it clearer why it's generally safe for
_bt_split() to expect lastleft and firstright tuples.
2019-05-15 12:22:07 -07:00
Andres Freund aa4b8c61d2 Handle table_complete_speculative's succeeded argument as documented.
For some reason both callsite and the implementation for heapam had
the meaning inverted (i.e. succeeded == true was passed in case of
conflict). That's confusing.

I (Andres) briefly pondered whether it'd be better to rename
table_complete_speculative's argument to 'bool specConflict' or such,
but decided not to. The 'complete' in the function name for me makes
`succeeded` sound a bit better.

Reported-By: Ashwin Agrawal, Melanie Plageman, Heikki Linnakangas
Discussion:
   https://postgr.es/m/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR+F1Y4w@mail.gmail.com
   https://postgr.es/m/97673451-339f-b21e-a781-998d06b1067c@iki.fi
2019-05-14 12:19:32 -07:00
Heikki Linnakangas 22251686f0 Detect internal GiST page splits correctly during index build.
As we descend the GiST tree during insertion, we modify any downlinks on
the way down to include the new tuple we're about to insert (if they don't
cover it already). Modifying an existing downlink might cause an internal
page to split, if the new downlink tuple is larger than the old one. If
that happens, we need to back up to the parent and re-choose a page to
insert to. We used to detect that situation, thanks to the NSN-LSN
interlock normally used to detect concurrent page splits, but that got
broken by commit 9155580fd5. With that commit, we now use a dummy constant
LSN value for every page during index build, so the LSN-NSN interlock no
longer works. I thought that was OK because there can't be any other
backends modifying the index during index build, but missed that the
insertion itself can modify the page we're inserting to. The consequence
was that we would sometimes insert the new tuple to an incorrect page, one
whose downlink doesn't cover the new tuple.

To fix, add a flag to the stack that keeps track of the state while
descending tree, to indicate that a page was split, and that we need to
retry the descend from the parent.

Thomas Munro first reported that the contrib/intarray regression test was
failing occasionally on the buildfarm after commit 9155580fd5. The failure
was intermittent, because the gistchoose() function is not deterministic,
and would only occasionally create the right circumstances for this bug to
cause the failure.

Patch by Anastasia Lubennikova, with some changes by me to make it work
correctly also when the internal page split also causes the "grandparent"
to be split.

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJRzLo7tZExWfSbwM3XuK7aAK7FhdBV0FLkbUG%2BW0v0zg%40mail.gmail.com
2019-05-14 13:18:44 +03:00
Heikki Linnakangas e95d550bbb Fix comment on when HOT update is possible.
The conditions listed in this comment have changed several times, and at
some point the thing that the "if so" referred to was negated.

The text was OK up to 9.6. It was differently wrong in v10, v11 and
master, so fix in all those versions.
2019-05-14 13:06:28 +03:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Peter Geoghegan 9b42e71376 Don't leave behind junk nbtree pages during split.
Commit 8fa30f906b reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR.  At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section.  This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions.  The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog().  There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.

Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes.  This happened at a point
that broke the rule around zeroing the right page in _bt_split().  If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents.  This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.

To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired.  A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.

Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.
2019-05-13 10:27:59 -07:00
Tom Lane 2d7d946cd3 Clean up the behavior and API of catalog.c's is-catalog-relation tests.
The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.

The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed.  That's even sillier.  With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.

With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID.  Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)".  But keep
IsCatalogRelation as a convenience function.

This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid.  The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.

While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.

Also improve the comments in catalog.c.

There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId.  I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId.  But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.

Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
2019-05-08 23:27:38 -04:00
Peter Geoghegan d95e36dc38 Remove obsolete nbtree split REDO routine comment.
Commit dd299df818, which added suffix truncation to nbtree, simplified
the WAL record format used by page splits.  It became necessary to
explicitly WAL-log the new high key for the left half of a split in all
cases, which relieved the REDO routine from having to reconstruct a new
high key for the left page by copying the first item from the right
page.  Remove a comment that referred to the previous practice.
2019-05-08 12:47:20 -07:00
Heikki Linnakangas 256be1050c Remove leftover reference to old "flat file" mechanism in a comment.
The flat file mechanism was removed in PostgreSQL 9.0.
2019-05-08 09:32:34 +03:00
Peter Geoghegan d65b5ccad6 Correct obsolete nbtsort.c minimum key comment.
It is no longer possible under any circumstances for nbtree code to
reconstruct a strict lower bound key (parent page's pivot tuple key) for
a right sibling page by retrieving the first item in the right sibling
page.
2019-05-07 21:42:12 -07:00
Fujii Masao b84dbc8eb8 Add TRUNCATE parameter to VACUUM.
This commit adds new parameter to VACUUM command, TRUNCATE,
which specifies that VACUUM should attempt to truncate off
any empty pages at the end of the table and allow the disk space
for the truncated pages to be returned to the operating system.

This parameter, if specified, overrides the vacuum_truncate
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.

Author: Fujii Masao
Reviewed-by: Julien Rouhaud, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoD+qtrSDL=GSma4Wd3kLYLeRC0hPna-YAdkDeV4z156vg@mail.gmail.com
2019-05-08 02:10:33 +09:00
Amit Kapila 7db0cde6b5 Revert "Avoid the creation of the free space map for small heap relations".
This feature was using a process local map to track the first few blocks
in the relation.  The map was reset each time we get the block with enough
freespace.  It was discussed that it would be better to track this map on
a per-relation basis in relcache and then invalidate the same whenever
vacuum frees up some space in the page or when FSM is created.  The new
design would be better both in terms of API design and performance.

List of commits reverted, in reverse chronological order:

06c8a5090e  Improve code comments in b0eaa4c51b.
13e8643bfc  During pg_upgrade, conditionally skip transfer of FSMs.
6f918159a9  Add more tests for FSM.
9c32e4c350  Clear the local map when not used.
29d108cdec  Update the documentation for FSM behavior..
08ecdfe7e5  Make FSM test portable.
b0eaa4c51b  Avoid creation of the free space map for small heap relations.

Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
2019-05-07 09:30:24 +05:30
Peter Geoghegan 7b37f4b02e Correct more obsolete nbtree page split comments.
Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself.  Both sets of comments were rendered
obsolete by commit 40dae7ec53, which made the nbtree page split
algorithm more robust.  Fix the comments that were missed the first time
around now.

In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.
2019-05-03 13:34:45 -07:00
Alvaro Herrera 2bf372a4ae heap_prepare_freeze_tuple: Simplify coding
Commit d2599ecfcc introduced some contorted, confused code around:
readers would think that it's possible for HeapTupleHeaderGetXmin return
a non-frozen value for some frozen tuples, which would be disastrous.
There's no actual bug, but it seems better to make it clearer.

Per gripe from Tom Lane and Andres Freund.
Discussion: https://postgr.es/m/30116.1555430496@sss.pgh.pa.us
2019-05-02 16:14:08 -04:00
Peter Geoghegan 6dd86c269d Fix nbtsort.c's page space accounting.
Commit dd299df818, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe.  In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page.  This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.

Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page.  This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").

Several conditions needed to be met all at once for CREATE INDEX to
fail.  The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit.  The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.

To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.

Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 12:33:35 -07:00
Robert Haas dd69597988 Fix some problems with VACUUM (INDEX_CLEANUP FALSE).
The new nleft_dead_tuples and nleft_dead_itemids fields are confusing
and do not seem like the correct way forward.  One of them is tested
via an assertion that can fail, as it has already done on buildfarm
member topminnow.  Remove the assertion and the fields.

Change the logic for the case where a tuple is not initially pruned
by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
HeapTupleSatisfiesVacuum.  Previously, tupgone = true was set in
that case, which leads to treating the tuple as one that will be
removed.  In a normal vacuum, that's OK, because we'll remove
index entries for it and then the second heap pass will remove the
tuple itself, but when index cleanup is disabled, those things
don't happen, so we must instead treat it as a recently-dead
tuple that we have voluntarily chosen to keep.

Report and analysis by Tom Lane.  This patch loosely based on one
from Masahiko Sawada, but I changed most of it.
2019-05-02 10:07:13 -04:00
Alvaro Herrera 9f8b717a80 Message style fixes 2019-04-30 10:33:37 -04:00
Alvaro Herrera 9a83afecb7 Widen tuple counter variables from long to int64
Mistake in ab0dfc961b6a; progress reporting would have wrapped around
for indexes created with more than 2^31 tuples.

Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com
2019-04-30 10:27:38 -04:00
Andres Freund 5c1560606d Fix several recently introduced issues around handling new relation forks.
Most of these stem from d25f519107 "tableam: relation creation, VACUUM
FULL/CLUSTER, SET TABLESPACE.".

1) To pass data to the relation_set_new_filenode()
   RelationSetNewRelfilenode() was made to update RelationData.rd_rel
   directly. That's not OK however, as it makes the relcache entries
   temporarily inconsistent. Which among other scenarios is a problem
   if a REINDEX targets an index on pg_class - the
   CatalogTupleUpdate() in RelationSetNewRelfilenode().  Presumably
   that was introduced because other places in the code do so - while
   those aren't "good practice" they don't appear to be actively
   buggy (e.g. because system tables may not be targeted).

   I (Andres) should have caught this while reviewing and signficantly
   evolving the code in that commit, mea culpa.

   Fix that by instead passing in the new RelFileNode as separate
   argument to relation_set_new_filenode() and rely on the relcache to
   update the catalog entry. Also revert that the
   RelationMapUpdateMap() call was changed to immediate, and undo some
   other more unnecessary changes.

2) Document that the relation_set_new_filenode cannot rely on the
   whole relcache entry to be valid. It might be worthwhile to
   refactor the code to never have to rely on that, but given the way
   heap_create() is currently coded, that'd be a large change.

3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A
   table AM might not use shared buffers at all. Move to
   index_copy_data() and heapam_relation_copy_data().

4) heapam_relation_set_new_filenode() previously sometimes accessed
   rel->rd_rel->relpersistence rather than the `persistence`
   argument. Code movement mistake.

5) Previously heapam_relation_set_new_filenode() re-opened the smgr
   relation to create the init for, if necesary. Instead have
   RelationCreateStorage() return the SMgrRelation and use it to
   create the init fork.

6) Add a note about the danger of modifying the relcache directly to
   ATExecSetTableSpace() - it's currently not a bug because there's a
   check ERRORing for catalog tables.

Regression tests and assertion improvements that together trigger the
bug described in 1) will be added in a later commit, as there is a
related bug on all branches.

Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
2019-04-29 19:28:05 -07:00
Peter Geoghegan 9ee7414ed0 Remove obsolete _bt_insert_parent() comment.
Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty().  It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.
2019-04-29 14:14:38 -07:00
Andres Freund fdc7efcc30 Allow pg_class xid & multixid horizons to not be set.
This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM code contained
the change itself). Defang the asserts in the general code, and move
the stronger ones into heap AM.

Relatedly, after CLUSTER/VACUUM, we'd always assign a relfrozenxid /
relminmxid. Change the table_relation_copy_for_cluster() interface to
allow the AM to overwrite the horizons that get set on the pg_class
entry.  This'd also in the future allow AMs like heap to compute a
relfrozenxid during rewrite that's the table's actual minimum rather
than a pre-determined value.  Arguably it'd have been better to move
the whole computation / setting of those values into the callback, but
it seems likely that for other reasons it'd be better to be able to
use one value to vacuum/cluster multiple tables (e.g. a toast's
horizon shouldn't be different than the table's).

Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi
2019-04-23 21:42:12 -07:00
Peter Geoghegan 9b10926263 Prevent O(N^2) unique index insertion edge case.
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate.  The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on.  The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.

Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values.  Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.

To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely.  _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.

Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-23 10:33:57 -07:00
Andres Freund b5f58cf213 Convert gist to compute page level xid horizon on primary.
Due to parallel development, gist added the missing conflict
information in c952eae52a, while 558a9165e0 moved that computation
to the primary for the index types that already had it.  Thus adapt
gist to also compute on the primary, using
index_compute_xid_horizon_for_tuples() instead of its own copy of the
logic.

This also adds pg_waldump support for XLOG_GIST_DELETE records, which
previously was not properly present.

Bumps WAL version.

Author: Andres Freund
Discussion: https://postgr.es/m/20190406050243.bszosdg4buvabfrt@alap3.anarazel.de
2019-04-22 14:28:30 -07:00
Michael Paquier 47ac2033d4 Simplify some ERROR paths clearing wait events and transient files
Transient files and wait events get normally cleaned up when seeing an
exception (be it in the context of a transaction for a backend or
another process like the checkpointer), hence there is little point in
complicating error code paths to do this work.  This shaves a bit of
code, and removes some extra handling with errno which needed to be
preserved during the cleanup steps done.

Reported-by: Masahiko Sawada
Author: Michael Paquier
Reviewed-by: Tom Lane, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoDhHYVq5KkXfkaHhmjA-zJYj-e4teiRAJefvXuKJz1tKQ@mail.gmail.com
2019-04-17 09:51:45 +09:00
Alexander Korotkov 1e87198182 Fix division by zero in _bt_vacuum_needs_cleanup()
Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when
metad->btm_last_cleanup_num_heap_tuples == 0.  This commit adjusts the
expression so that no division by zero might happen.

Reported-by: Piotr Stefaniak
Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com
Reviewed-by: Masahiko Sawada
Backpatch-through: 11
2019-04-15 20:20:43 +03:00
Thomas Munro f7feb020c3 Fix GetNewTransactionId()'s interaction with xidVacLimit.
Commit ad308058 switched to returning a FullTransactionId, but
failed to load the potentially updated value in the case where
xidVacLimit is reached and we release and reacquire the lock.
Repair, closing bug #15727.

While reviewing that commit, also fix the size computation used
by EstimateTransactionStateSize() and switch to the mul_size()
macro traditionally used in such expressions.

Author: Thomas Munro
Reported-by: Roman Zharkov
Discussion: https://postgr.es/m/15727-0be246e7d852d229%40postgresql.org
2019-04-12 16:47:50 +12:00