Commit Graph

85 Commits

Author SHA1 Message Date
Alexander Korotkov b1fe8efdf1 amcheck: Normalize index tuples containing uncompressed varlena
It might happen that the varlena value wasn't compressed by index_form_tuple()
due to current storage parameters.  If compression is currently enabled, we
need to compress such values to match index tuple coming from the heap.

Backpatch to all supported versions.

Discussion: https://postgr.es/m/flat/7bdbe559-d61a-4ae4-a6e1-48abdf3024cc%40postgrespro.ru
Author: Andrey Borodin
Reviewed-by: Alexander Lakhin, Michael Zhilin, Jian He, Alexander Korotkov
Backpatch-through: 12
2024-03-24 00:09:24 +02:00
Alexander Korotkov ab65dfb0fb amcheck: Support for different header sizes of short varlena datum
In the heap, tuples may contain short varlena datum with both 1B header and 4B
headers.  But the corresponding index tuple should always have such varlena's
with 1B headers.  So, for fingerprinting, we need to convert.

Backpatch to all supported versions.

Discussion: https://postgr.es/m/flat/7bdbe559-d61a-4ae4-a6e1-48abdf3024cc%40postgrespro.ru
Author: Michael Zhilin
Reviewed-by: Alexander Lakhin, Andrey Borodin, Jian He, Alexander Korotkov
Backpatch-through: 12
2024-03-24 00:09:24 +02:00
Jeff Davis 59825d1639 Fix buildfarm failures from 2af07e2f74.
Use GUC_ACTION_SAVE rather than GUC_ACTION_SET, necessary for working
with parallel query.

Now that the call requires more arguments, wrap the call in a new
function to avoid code duplication and offer a place for a comment.

Discussion: https://postgr.es/m/E1rhJpO-0027Wf-9L@gemulon.postgresql.org
2024-03-04 19:42:16 -08:00
Jeff Davis 2af07e2f74 Fix search_path to a safe value during maintenance operations.
While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
'pg_catalog, pg_temp' to prevent inconsistent behavior.

Functions that are used for functional indexes, in index expressions,
or in materialized views and depend on a different search path must be
declared with CREATE FUNCTION ... SET search_path='...'.

This change was previously committed as 05e1737351, then reverted in
commit 2fcc7ee7af because it was too late in the cycle.

Preparation for the MAINTAIN privilege, which was previously reverted
due to search_path manipulation hazards.

Discussion: https://postgr.es/m/d4ccaf3658cb3c281ec88c851a09733cd9482f22.camel@j-davis.com
Discussion: https://postgr.es/m/E1q7j7Y-000z1H-Hr%40gemulon.postgresql.org
Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com
Reviewed-by: Greg Stark, Nathan Bossart, Noah Misch
2024-03-04 17:31:38 -08:00
Bruce Momjian 29275b1d17 Update copyright for 2024
Reported-by: Michael Paquier

Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz

Backpatch-through: 12
2024-01-03 20:49:05 -05:00
Peter Geoghegan c9c0589fda Optimize nbtree backward scan boundary cases.
Teach _bt_binsrch (and related helper routines like _bt_search and
_bt_compare) about the initial positioning requirements of backward
scans.  Routines like _bt_binsrch already know all about "nextkey"
searches, so it seems natural to teach them about "goback"/backward
searches, too.  These concepts are closely related, and are much easier
to understand when discussed together.

Now that certain implementation details are hidden from _bt_first, it's
straightforward to add a new optimization: backward scans using the <
strategy now avoid extra leaf page accesses in certain "boundary cases".
Consider the following example, which uses the tenk1 table (and its
tenk1_hundred index) from the standard regression tests:

SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1;

Before this commit, nbtree would scan two leaf pages, even though it was
only really necessary to scan one leaf page.  We'll now descend straight
to the leaf page containing a (12, -inf) high key instead.  The scan
will locate matching non-pivot tuples with "hundred" values starting
from the value 11.  The scan won't waste a page access on the right
sibling leaf page, which cannot possibly contain any matching tuples.

You can think of the optimization added by this commit as disabling an
optimization (the _bt_compare "!pivotsearch" behavior that was added to
Postgres 12 in commit dd299df8) for a small subset of cases where it was
always counterproductive.

Equivalently, you can think of the new optimization as extending the
"pivotsearch" behavior that page deletion by VACUUM has long required
(since the aforementioned Postgres 12 commit went in) to other, similar
cases.  Obviously, this isn't strictly necessary for these new cases
(unlike VACUUM, _bt_first is prepared to move the scan to the left once
on the leaf level), but the underlying principle is the same.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
2023-12-08 11:05:17 -08:00
Noah Misch 6ec9e9975e amcheck: Distinguish interrupted page deletion from corruption.
This prevents false-positive reports about "the first child of leftmost
target page is not leftmost of its level", "block %u is not leftmost"
and "left link/right link pair".  They appeared if amcheck ran before
VACUUM cleaned things, after a cluster exited recovery between the
first-stage and second-stage WAL records of a deletion.  Back-patch to
v11 (all supported versions).

Reviewed by Peter Geoghegan.

Discussion: https://postgr.es/m/20231005025232.c7.nmisch@google.com
2023-10-30 14:46:05 -07:00
Alexander Korotkov 675fed4df5 Fix indentation in contrib/amcheck/verify_nbtree.c
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/ZT9YoDPEQBUMrIHg%40paquier.xyz
2023-10-30 10:35:03 +02:00
Alexander Korotkov 5ae2087202 Teach contrib/amcheck to check the unique constraint violation
Add the 'checkunique' argument to bt_index_check() and bt_index_parent_check().
When the flag is specified the procedures will check the unique constraint
violation for unique indexes.  Only one heap entry for all equal keys in
the index should be visible (including posting list entries).  Report an error
otherwise.

pg_amcheck called with the --checkunique option will do the same check for all
the indexes it checks.

Author: Anastasia Lubennikova <lubennikovaav@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CALT9ZEHRn5xAM5boga0qnrCmPV52bScEK2QnQ1HmUZDD301JEg%40mail.gmail.com
2023-10-28 00:21:23 +03:00
Noah Misch 5f27b5f848 Dissociate btequalimage() from interval_ops, ending its deduplication.
Under interval_ops, some equal values are distinguishable.  One such
pair is '24:00:00' and '1 day'.  With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function.  This can cause incorrect results from index-only scans.
Users should REINDEX any btree indexes having interval-type columns.
After updating, pg_amcheck will report an error for almost all such
indexes.  This fix makes interval_ops simply omit the support function,
like numeric_ops does.  Back-pack to v13, where btequalimage() first
appeared.  In back branches, for the benefit of old catalog content,
btequalimage() code will return false for type "interval".  Going
forward, back-branch initdb will include the catalog change.

Reviewed by Peter Geoghegan.

Discussion: https://postgr.es/m/20231011013317.22.nmisch@google.com
2023-10-14 16:33:51 -07:00
Thomas Munro 9f0602539d Remove some more "snapshot too old" vestiges.
Commit f691f5b8 removed the logic, but left behind some now-useless
Snapshot arguments to various AM-internal functions, and missed a couple
of comments.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wznj9qSNXZ1P1uWTUD_FeaTezbUazb416EPwi4Qr_jR_6A%40mail.gmail.com
2023-09-08 17:12:12 +12:00
Peter Geoghegan d088ba5a5a nbtree: Allocate new pages in separate function.
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages remains in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function called
_bt_allocbuf.  This simplifies most _bt_getbuf callers, since it is no
longer necessary for them to pass a heaprel argument.  Many of the
changes to nbtree from commit 61b313e4 can be reverted.  This minimizes
the divergence between HEAD/PostgreSQL 16 and earlier release branches.

_bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to
_bt_getbuf.  There are only 3 affected call sites, all of which continue
to pass a heaprel for recovery conflict purposes.  Note that nbtree's
use of P_NEW was superficial; nbtree never actually relied on the P_NEW
code paths in bufmgr.c, so this change is strictly mechanical.

GiST already took the same approach; it has a dedicated function for
allocating new pages called gistNewBuffer().  That factor allowed commit
61b313e4 to make much more targeted changes to GiST.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-06-10 14:08:25 -07:00
Jeff Davis 2fcc7ee7af Revert "Fix search_path to a safe value during maintenance operations."
This reverts commit 05e1737351.
2023-06-10 08:11:41 -07:00
Jeff Davis 05e1737351 Fix search_path to a safe value during maintenance operations.
While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
'pg_catalog, pg_temp' to prevent inconsistent behavior.

Functions that are used for functional indexes, in index expressions,
or in materialized views and depend on a different search path must be
declared with CREATE FUNCTION ... SET search_path='...'.

This change addresses a security risk introduced in commit 60684dd834,
where a role with MAINTAIN privileges on a table may be able to
escalate privileges to the table owner. That commit is not yet part of
any release, so no need to backpatch.

Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com
Reviewed-by: Greg Stark
Reviewed-by: Nathan Bossart
2023-06-09 11:20:47 -07:00
Michael Paquier 8961cb9a03 Fix typos in comments
The changes done in this commit impact comments with no direct
user-visible changes, with fixes for incorrect function, variable or
structure names.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
2023-05-02 12:23:08 +09:00
Andres Freund 61b313e47e Pass down table relation into more index relation functions
This is done in preparation for logical decoding on standby, which needs to
include whether visibility affecting WAL records are about a (user) catalog
table. Which is only known for the table, not the indexes.

It's also nice to be able to pass the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-04-01 20:18:29 -07:00
Bruce Momjian c8e1ba736b Update copyright for 2023
Backpatch-through: 11
2023-01-02 15:00:37 -05:00
Peter Geoghegan 0faf7d933f Harmonize parameter names in contrib code.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in contrib code.

Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
2022-09-22 13:59:20 -07:00
Tom Lane 0a20ff54f5 Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation.  It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:

* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
  SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
  built-in GUC variables, along with some already-exported constant
  tables.
* GUC check/assign/show hook functions are moved to the variable's
  home module, whenever that's clearly identifiable.  A few hard-
  to-classify hooks ended up in commands/variable.c, which was
  already a home for miscellaneous GUC hook functions.

To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module.  That allowed removal of #include "guc.h"
from some existing headers.  The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.

There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments.  But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.

Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.

Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 11:11:45 -04:00
Tom Lane dd1c8dd101 Silence compiler warnings from some older compilers.
Since a117cebd6, some older gcc versions issue "variable may be used
uninitialized in this function" complaints for brin_summarize_range.
Silence that using the same coding pattern as in bt_index_check_internal;
arguably, a117cebd6 had too narrow a view of which compilers might give
trouble.

Nathan Bossart and Tom Lane.  Back-patch as the previous commit was.

Discussion: https://postgr.es/m/20220601163537.GA2331988@nathanxps13
2022-06-01 17:21:45 -04:00
Noah Misch a117cebd63 Make relation-enumerating operations be security-restricted operations.
When a feature enumerates relations and runs functions associated with
all found relations, the feature's user shall not need to trust every
user having permission to create objects.  BRIN-specific functionality
in autovacuum neglected to account for this, as did pg_amcheck and
CLUSTER.  An attacker having permission to create non-temp objects in at
least one schema could execute arbitrary SQL functions under the
identity of the bootstrap superuser.  CREATE INDEX (not a
relation-enumerating operation) and REINDEX protected themselves too
late.  This change extends to the non-enumerating amcheck interface.
Back-patch to v10 (all supported versions).

Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin.
Reported by Alexander Lakhin.

Security: CVE-2022-1552
2022-05-09 08:35:08 -07:00
Michael Paquier d16773cdc8 Add macros in hash and btree AMs to get the special area of their pages
This makes the code more consistent with SpGiST, GiST and GIN, that
already use this style, and the idea is to make easier the introduction
of more sanity checks for each of these AM-specific macros.  BRIN uses a
different set of macros to get a page's type and flags, so it has no
need for something similar.

Author: Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
2022-04-01 13:24:50 +09:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Michael Paquier 5d08137076 Fix some typos with {a,an}
One of the changes impacts the documentation, so backpatch.

Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+Pu6+c+r3mY24VT7u+H+E_s6vMr5OdRiZ8NT3EOa-E5Lmw@mail.gmail.com
Backpatch-through: 14
2021-12-09 15:20:36 +09:00
Tom Lane 3804539e48 Replace random(), pg_erand48(), etc with a better PRNG API and algorithm.
Standardize on xoroshiro128** as our basic PRNG algorithm, eliminating
a bunch of platform dependencies as well as fundamentally-obsolete PRNG
code.  In addition, this API replacement will ease replacing the
algorithm again in future, should that become necessary.

xoroshiro128** is a few percent slower than the drand48 family,
but it can produce full-width 64-bit random values not only 48-bit,
and it should be much more trustworthy.  It's likely to be noticeably
faster than the platform's random(), depending on which platform you
are thinking about; and we can have non-global state vectors easily,
unlike with random().  It is not cryptographically strong, but neither
are the functions it replaces.

Fabien Coelho, reviewed by Dean Rasheed, Aleksander Alekseev, and myself

Discussion: https://postgr.es/m/alpine.DEB.2.22.394.2105241211230.165418@pseudo
2021-11-28 21:33:07 -05:00
Peter Geoghegan 292698f158 amcheck: Skip unlogged relations in Hot Standby.
Have verify_heapam.c treat unlogged relations as if they were simply
empty when in Hot Standby mode.  This brings it in line with
verify_nbtree.c, which has handled unlogged relations in the same way
since bugfix commit 6754fe65a4.  This was an oversight in commit
866e24d47d, which extended contrib/amcheck to check heap relations.

In passing, lower the verbosity used when reporting that a relation has
been skipped like this, from NOTICE to DEBUG1.  This is appropriate
because the skipping behavior is only an implementation detail, needed
to work around the fact that unlogged tables don't have smgr-level
storage for their main fork when in Hot Standby mode.

Affected unlogged relations should be considered "trivially verified",
not skipped over.  They are verified in the same sense that a totally
empty relation can be verified.  This behavior seems least surprising
overall, since unlogged relations on a replica will initially be empty
if and when the replica is promoted and Hot Standby ends.

Author: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk_pukOFY7JmdiFLsrz+Pd3V8OwgC1TH2Vd5BH5ZgK4bA@mail.gmail.com
Backpatch: 14-, where heapam verification was introduced.
2021-10-11 17:21:48 -07:00
Tom Lane f10f0ae420 Replace RelationOpenSmgr() with RelationGetSmgr().
The idea behind this patch is to design out bugs like the one fixed
by commit 9d523119f.  Previously, once one did RelationOpenSmgr(rel),
it was considered okay to access rel->rd_smgr directly for some
not-very-clear interval.  But since that pointer will be cleared by
relcache flushes, we had bugs arising from overreliance on a previous
RelationOpenSmgr call still being effective.

Now, very little code except that in rel.h and relcache.c should ever
touch the rd_smgr field directly.  The normal coding rule is to use
RelationGetSmgr(rel) and not expect the result to be valid for longer
than one smgr function call.  There are a couple of places where using
the function every single time seemed like overkill, but they are now
annotated with large warning comments.

Amul Sul, after an idea of mine.

Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
2021-07-12 17:01:36 -04:00
Tom Lane def5b065ff Initial pgindent and pgperltidy run for v14.
Also "make reformat-dat-files".

The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
2021-05-12 13:14:10 -04:00
Peter Geoghegan bb3ecc8c96 amcheck: MAXALIGN() nbtree special area offset.
This isn't strictly necessary, but in theory it might matter if in the
future the width of the nbtree special area changes -- its total size
might not be an even number of MAXALIGN() quantums, even with padding.
PageInit() MAXALIGN()s all special area offsets, but amcheck uses the
offset to perform initial basic validation of line pointers, so we don't
rely on the offset from the page header.

The real reason to do this is to set a good example for new code that
adds amcheck coverage for other index AMs.

Reported-By: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CALj2ACUMqTR9nErh99FbOBmzCXE9=gXNqhBiwYOhejJJS1LXqQ@mail.gmail.com
2021-04-23 15:37:03 -07:00
Peter Eisentraut 07e5e66742 Improve quoting in some error messages 2021-04-14 09:11:29 +02:00
Peter Geoghegan 65445469d6 amcheck: Reduce debug message verbosity.
Empty sibling pages can occasionally be much more common than any other
event that we report on at elevel DEBUG1.  Increase the elevel for
relevant cases to DEBUG2 to avoid overwhelming the user with relatively
insignificant details.
2021-03-16 13:11:17 -07:00
Peter Geoghegan e5d8a99903 Use full 64-bit XIDs in deleted nbtree pages.
Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely.  Commit 6655a729 did the same thing for deleted pages in
GiST indexes.  That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages.  There is no longer any
reason to care about that condition/the oldest XID.  It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much.  We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever.  We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index.  All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic).  The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored.  We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup().  This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit.  A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users.  The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future.  Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
2021-02-24 18:41:34 -08:00
Peter Eisentraut 6f6f284c7e Simplify printing of LSNs
Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs.
Convert all applicable code to use it.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com
2021-02-23 10:27:02 +01:00
Peter Eisentraut 0e392fcc0d Use errmsg_internal for debug messages
An inconsistent set of debug-level messages was not using
errmsg_internal(), thus uselessly exposing the messages to translation
work.  Fix those.
2021-02-17 11:33:25 +01:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Peter Geoghegan aac80bfcdd Fix amcheck child check pg_upgrade bug.
Commit d114cc53 overlooked the fact that pg_upgrade'd B-Tree indexes
have leaf page high keys whose offset numbers do not match the one from
the copy of the tuple one level up (the copy stored with a downlink for
leaf page's right sibling page).  This led to false positive reports of
corruption from bt_index_parent_check() when it was called to verify a
pg_upgrade'd index.

To fix, skip comparing the offset number on pg_upgrade'd B-Tree indexes.

Author: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Andrew Bille <andrewbille@gmail.com>
Diagnosed-By: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Bug: #16619
Discussion: https://postgr.es/m/16619-aaba10f83fdc1c3c@postgresql.org
Backpatch: 13-, where child check was enhanced.
2020-09-16 10:42:30 -07:00
Andres Freund dc7420c2c9 snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12 16:03:49 -07:00
Peter Geoghegan 39132b784a Teach amcheck to verify sibling links in all cases.
Teach contrib/amcheck's bt_index_check() function to check agreement
between siblings links.  The left sibling's right link should point to a
right sibling page whose left link points back to the same original left
sibling.  This extends a check that bt_index_parent_check() always
performed to bt_index_check().

This is the first time amcheck has been taught to perform buffer lock
coupling, which we have explicitly avoided up until now.  The sibling
link check tends to catch a lot of real world index corruption with
little overhead, so it seems worth accepting the complexity.  Note that
the new lock coupling logic would not work correctly on replica servers
without the changes made by commits 0a7d771f and 9a9db08a (there could
be false positives without those changes).

Author: Andrey Borodin, Peter Geoghegan
Discussion: https://postgr.es/m/0EB0CFA8-CBD8-4296-8049-A2C0F28FAE8C@yandex-team.ru
2020-08-08 11:12:01 -07:00
Peter Geoghegan 3a3be80641 Remove obsolete amcheck comment.
Oversight in commit d114cc5387.
2020-08-06 16:23:52 -07:00
Peter Geoghegan c254d8d7b2 amcheck: Sanitize metapage's allequalimage field.
This will be helpful if it ever proves necessary to revoke an opclass's
support for deduplication.

Backpatch: 13-, where nbtree deduplication was introduced.
2020-08-06 15:25:49 -07:00
Alexander Korotkov f47b5e1395 Remove btree page items after page unlink
Currently, page unlink leaves remaining items "as is", but replay of
corresponding WAL-record re-initializes page leaving it with no items.
For the sake of consistency, this commit makes primary delete all the items
during page unlink as well.

Thanks to this change, we now don't mask contents of deleted btree page for
WAL consistency checking.

Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
2020-08-05 02:16:13 +03:00
Alexander Korotkov 34dae902ca Fix amcheck for page checks concurrent to replay of btree page deletion
amcheck expects at least hikey to always exist on leaf page even if it is
deleted page.  But replica reinitializes page during replay of page deletion,
causing deleted page to have no items.  Thus, replay of page deletion can
cause an error in concurrent amcheck run.

This commit relaxes amcheck expectation making it tolerate deleted page with
no items.

Reported-by: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
Backpatch-through: 11
2020-05-14 12:44:44 +03:00
Peter Geoghegan 18c117cc56 Remove obsolete amcheck comment.
Oversight in commit d114cc53.
2020-05-05 14:36:54 -07:00
Peter Geoghegan bc3087b626 Harmonize nbtree page split point code.
An nbtree split point can be thought of as a point between two adjoining
tuples from an imaginary version of the page being split that includes
the incoming/new item (in addition to the items that really are on the
page).  These adjoining tuples are called the lastleft and firstright
tuples.

The variables that represent split points contained a field called
firstright, which is an offset number of the first data item from the
original page that goes on the new right page.  The corresponding tuple
from origpage was usually the same thing as the actual firstright tuple,
but not always: the firstright tuple is sometimes the new/incoming item
instead.  This situation seems unnecessarily confusing.

Make things clearer by renaming the origpage offset returned by
_bt_findsplitloc() to "firstrightoff".  We now have a firstright tuple
and a firstrightoff offset number which are comparable to the
newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers
respectively.  Also make sure that we are consistent about how we
describe nbtree page split point state.

Push the responsibility for dealing with pg_upgrade'd !heapkeyspace
indexes down to lower level code, relieving _bt_split() from dealing
with it directly.  This means that we always have a palloc'd left page
high key on the leaf level, no matter what.  This enables simplifying
some of the code (and code comments) within _bt_split().

Finally, restructure the page split code to make it clearer why suffix
truncation (which only takes place during leaf page splits) is
completely different to the first data item truncation that takes place
during internal page splits.  Tuples are marked as having fewer
attributes stored in both cases, and the firstright tuple is truncated
in both cases, so it's easy to imagine somebody missing the distinction.
2020-04-13 16:39:55 -07:00
Peter Geoghegan 20fbb711ef Add contrib/amcheck debug message.
Add a DEBUG1 message indicating that verification of the index structure
is underway.  Also reduce the severity level of the existing "tree
level" debug message to DEBUG1.  It should never have been made DEBUG2.
Any B-Tree index with more than a couple of levels will generally also
have so many pages that the per-page DEBUG2 messages will become
completely unmanageable.

In passing, add a new "Tip" to the docs that advises users that run into
corruption that the debug messages might provide useful additional
context.
2020-04-10 17:44:08 -07:00
Alexander Korotkov d114cc5387 Improve checking of child pages in contrib/amcheck.
This commit eliminates lossiness in check for missing parent downlinks in
B-tree.  Instead of collecting lossy bitmap, we check for missing downlinks
while visiting child pages referenced by downlinks of target level.  We
traverse from previous child page to the subsequent child page by right links.
Intermediate pages are candidates to have lost parent downlinks.

Also this commit introduces matching of child high key to the pivot key of
it's parent.

Discussion: https://postgr.es/m/CAPpHfduoF-c4RhOyOm%3D4-Y367%2B8txq9Q6iM_ty0OYc8si1Abww%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
2020-03-11 12:00:31 +03:00
Peter Geoghegan 0d861bbb70 Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 13:05:30 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Andres Freund aae50236e4 Pass ItemPointer not HeapTuple to IndexBuildCallback.
Not all AMs use HeapTuples internally, making it inconvenient to pass
a HeapTuple. As the index callbacks really only need the TID, not the
full tuple, modify callback to only take ItemPointer.

Author: Ashwin Agrawal
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeis6=8ehuR=VNtHvj3z16cYfCwPdTcpaxU+sfSUJ5QgR3g@mail.gmail.com
2019-11-08 11:49:29 -08:00