Commit Graph

1676 Commits

Author SHA1 Message Date
Peter Geoghegan 1de58df4fe Add page-level freezing to VACUUM.
Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages.  Individual XIDs and MXIDs fields from tuple
headers now trigger freezing of whole pages, rather than independently
triggering freezing of each individual tuple header field.

Managing the cost of freezing over time now significantly influences
when and how VACUUM freezes.  The overall amount of WAL written is the
single most important freezing related cost, in general.  Freezing each
page's tuples together in batch allows VACUUM to take full advantage of
the freeze plan WAL deduplication optimization added by commit 9e540599.

Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI.  We'll have already written a large
amount of WAL just to do that much, so it's very likely a good idea to
get freezing out of the way for the page early.  This only happens in
cases where it will directly lead to marking the page all-frozen in the
visibility map.

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact.  It doesn't quite work that way in certain rare
cases involving MultiXacts, though.  It is convenient to define "freeze
the page" in a way that gives FreezeMultiXactId the leeway to put off
the work of processing an individual tuple's xmax whenever it happens to
be a MultiXactId that would require an expensive second pass to process
aggressively (allocating a new multi is especially worth avoiding here).
FreezeMultiXactId is eager when processing is cheap (as it usually is),
and lazy in the event of an individual multi that happens to require
expensive second pass processing.  This avoids regressions related to
processing of multis that page-level freezing might otherwise cause.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
2022-12-28 08:50:47 -08:00
Peter Geoghegan 4ce3afb82e Refactor how VACUUM passes around its XID cutoffs.
Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin.  This state is initialized in vacuum.c, and
then passed around by code from vacuumlazy.c to heapam.c freezing
related routines.  The new convention is that everybody works off of the
same cutoff state, which is passed around via pointers to const.

Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before.  That way the routine's xmin handling
code is symmetrical with the existing xmax handling code.  This is
preparation for an upcoming commit that will add page level freezing.

Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks.  We now test OldestXmin directly, instead of
using FreezeLimit as an inexact proxy for OldestXmin.  This is further
preparation for the page level freezing work, which will make the
function's caller cede control of page level freezing to the function
where appropriate (where heap_prepare_freeze_tuple sees a tuple that
happens to contain a MultiXactId in its xmax).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-WznS9TxXmz2_=SY+SyJyDFbiOftKofM9=aDo68BbXNBUMA@mail.gmail.com
2022-12-22 09:37:59 -08:00
Peter Eisentraut 75f49221c2 Static assertions cleanup
Because we added StaticAssertStmt() first before StaticAssertDecl(),
some uses as well as the instructions in c.h are now a bit backwards
from the "native" way static assertions are meant to be used in C.
This updates the guidance and moves some static assertions to better
places.

Specifically, since the addition of StaticAssertDecl(), we can put
static assertions at the file level.  This moves a number of static
assertions out of function bodies, where they might have been stuck
out of necessity, to perhaps better places at the file level or in
header files.

Also, when the static assertion appears in a position where a
declaration is allowed, then using StaticAssertDecl() is more native
than StaticAssertStmt().

Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
2022-12-15 10:10:32 +01:00
Tom Lane 20432f8731 Rethink handling of [Prevent|Is]InTransactionBlock in pipeline mode.
Commits f92944137 et al. made IsInTransactionBlock() set the
XACT_FLAGS_NEEDIMMEDIATECOMMIT flag before returning "false",
on the grounds that that kept its API promises equivalent to those of
PreventInTransactionBlock().  This turns out to be a bad idea though,
because it allows an ANALYZE in a pipelined series of commands to
cause an immediate commit, which is unexpected.

Furthermore, if we return "false" then we have another issue,
which is that ANALYZE will decide it's allowed to do internal
commit-and-start-transaction sequences, thus possibly unexpectedly
committing the effects of previous commands in the pipeline.

To fix the latter situation, invent another transaction state flag
XACT_FLAGS_PIPELINING, which explicitly records the fact that we
have executed some extended-protocol command and not yet seen a
commit for it.  Then, require that flag to not be set before allowing
InTransactionBlock() to return "false".

Having done that, we can remove its setting of NEEDIMMEDIATECOMMIT
without fear of causing problems.  This means that the API guarantees
of IsInTransactionBlock now diverge from PreventInTransactionBlock,
which is mildly annoying, but it seems OK given the very limited usage
of IsInTransactionBlock.  (In any case, a caller preferring the old
behavior could always set NEEDIMMEDIATECOMMIT for itself.)

For consistency also require XACT_FLAGS_PIPELINING to not be set
in PreventInTransactionBlock.  This too is meant to prevent commands
such as CREATE DATABASE from silently committing previous commands
in a pipeline.

Per report from Peter Eisentraut.  As before, back-patch to all
supported branches (which sadly no longer includes v10).

Discussion: https://postgr.es/m/65a899dd-aebc-f667-1d0a-abb89ff3abf8@enterprisedb.com
2022-12-13 14:23:58 -05:00
Alvaro Herrera fb958b5da8
Generalize ri_RootToPartitionMap to use for non-partition children
ri_RootToPartitionMap is currently only initialized for tuple routing
target partitions, though a future commit will need the ability to use
it even for the non-partition child tables, so make adjustments to the
decouple it from the partitioning code.

Also, make it lazily initialized via ExecGetRootToChildMap(), making
that function its preferred access path.  Existing third-party code
accessing it directly should no longer do so; consequently, it's been
renamed to ri_RootToChildMap, which also makes it consistent with
ri_ChildToRootMap.

ExecGetRootToChildMap() houses the logic of setting the map appropriately
depending on whether a given child relation is partition or not.

To support this, also add a separate entry point for TupleConversionMap
creation that receives an AttrMap.  No new code here, just split an
existing function in two.

Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqEYUhDXSK5BTvG_xk=eaAEJCD4GS3C6uH7ybBvv+Z_Tmg@mail.gmail.com
2022-12-02 10:35:55 +01:00
Alvaro Herrera ad86d159b6
Add 'missing_ok' argument to build_attrmap_by_name
When it's given as true, return a 0 in the position of the missing
column rather than raising an error.

This is currently unused, but it allows us to reimplement column
permission checking in a subsequent commit.  It seems worth breaking
into a separate commit because it affects unrelated code.

Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFfiai=qBxPDTjaio_ZcaqUKh+FC=prESrB8ogZgFNNNQ@mail.gmail.com
2022-11-29 09:39:36 +01:00
Thomas Munro cd4329d939 Remove promote_trigger_file.
Previously, an idle startup (recovery) process would wake up every 5
seconds to have a chance to poll for promote_trigger_file, even if that
GUC was not configured.  That promotion triggering mechanism was
effectively superseded by pg_ctl promote and pg_promote() a long time
ago.  There probably aren't many users left and it's very easy to change
to the modern mechanisms, so we agreed to remove the feature.

This is part of a campaign to reduce wakeups on idle systems.

Author: Simon Riggs <simon.riggs@enterprisedb.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Ian Lawrence Barwick <barwick@gmail.com>
Discussion: https://postgr.es/m/CANbhV-FsjnzVOQGBpQ589%3DnWuL1Ex0Ykn74Nh1hEjp2usZSR5g%40mail.gmail.com
2022-11-29 12:08:38 +13:00
David Rowley d09dbeb9bd Speedup hash index builds by skipping needless binary searches
When building hash indexes using the spool method, tuples are added to the
index page in hashkey order.  Because of this, we can safely skip
performing the binary search on the existing tuples on the page to find
the location to insert the tuple based on its hashkey value.  For this
case, we can just always put the tuple at the end of the item array as the
tuples will always arrive in hashkey order.

Testing has shown that this can improve hash index build speeds by 5-15%
with a unique set of integer values.

Author: Simon Riggs
Reviewed-by: David Rowley
Tested-by: David Zhang, Tomas Vondra
Discussion: https://postgr.es/m/CANbhV-GBc5JoG0AneUGPZZW3o4OK5LjBGeKe_icpC3R1McrZWQ@mail.gmail.com
2022-11-24 17:21:44 +13:00
Peter Geoghegan 1489b1ce72 Standardize rmgrdesc recovery conflict XID output.
Standardize on the name snapshotConflictHorizon for all XID fields from
WAL records that generate recovery conflicts when in hot standby mode.
This supersedes the previous latestRemovedXid naming convention.

The new naming convention places emphasis on how the values are actually
used by REDO routines.  How the values are generated during original
execution (details of which vary by record type) is deemphasized.  Users
of tools like pg_waldump can now grep for snapshotConflictHorizon to see
all potential sources of recovery conflicts in a standardized way,
without necessarily having to consider which specific record types might
be involved.

Also bring a couple of WAL record types that didn't follow any kind of
naming convention into line.  These are heapam's VISIBLE record type and
SP-GiST's VACUUM_REDIRECT record type.  Now every WAL record whose REDO
routine calls ResolveRecoveryConflictWithSnapshot() passes through the
snapshotConflictHorizon field from its WAL record.  This is follow-up
work to the refactoring from commit 9e540599 that made FREEZE_PAGE WAL
records use a standard snapshotConflictHorizon style XID cutoff.

No bump in XLOG_PAGE_MAGIC, since the underlying format of affected WAL
records doesn't change.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzm2CQUmViUq7Opgk=McVREHSOorYaAjR1ZpLYkRN7_dPw@mail.gmail.com
2022-11-17 14:55:08 -08:00
Peter Eisentraut 8e1db29cdb Variable renaming in preparation for refactoring
Rename page -> block and dp -> page where appropriate.  The old naming
mixed up block and page in confusing ways.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
2022-11-16 16:40:34 +01:00
Peter Eisentraut 4eb3b11200 Turn HeapKeyTest macro into inline function
It is easier to read as a function.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
2022-11-16 13:26:48 +01:00
Jeff Davis 1eda3ce802 Mark argument of RegisterCustomRmgr() as const. 2022-11-15 16:01:35 -08:00
Peter Geoghegan 9e5405993c Deduplicate freeze plans in freeze WAL records.
Make heapam WAL records that describe freezing performed by VACUUM more
space efficient by storing each distinct "freeze plan" once, alongside
an array of associated page offset numbers (one per freeze plan).  The
freeze plans required for most heap pages tend to naturally have a great
deal of redundancy, so this technique is very effective in practice.  It
often leads to freeze WAL records that are less than 20% of the size of
equivalent WAL records generated using the previous approach.

The freeze plan concept was introduced by commit 3b97e6823b, which fixed
bugs in VACUUM's handling of MultiXacts.  We retain the concept of
freeze plans, but go back to using page offset number arrays.  There is
no loss of generality here because deduplication is an additive process
that gets applied mechanically when FREEZE_PAGE WAL records are built.

More than anything else, freeze plan deduplication is an optimization
that reduces the marginal cost of freezing additional tuples on pages
that will need to have at least one or two tuples frozen in any case.
Ongoing work that adds page-level freezing to VACUUM will take full
advantage of the improved cost profile through batching.

Also refactor some of the details surrounding recovery conflicts needed
to REDO freeze records in passing: make original execution responsible
for generating a standard latestRemovedXid cutoff, rather than working
backwards to get the same cutoff in the REDO routine.  Bugfix commit
66fbcb0d2e did it the other way around, which is equivalent but obscures
what's going on.

Also rename the cutoff field from the WAL record/struct (rename the
field cutoff_xid to latestRemovedXid to match similar WAL records).
Processing of conflicts by REDO routines is already completely uniform,
so tools like pg_waldump should present the information driving the
process uniformly.  There are two remaining WAL record types that still
don't quite follow this convention (heapam's VISIBLE record type and
SP-GiST's VACUUM_REDIRECT record type).  They can be brought into line
by later work that totally standardizes how the cutoffs are presented.

Bump XLOG_PAGE_MAGIC.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAH2-Wz=XytErMnb8FAyFd+OQEbiipB0Q2FmFdXrggPL4VBnRYQ@mail.gmail.com
2022-11-15 07:48:41 -08:00
Alvaro Herrera ff0d8f27f4
Remove redundant declaration for XidInMVCCSnapshot
This was added for no good reason by c91560defc, after b7eda3e0e3
had just moved the prototype from utils/tqual.h to utils/snapmgr.h.

Author: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/MEYP282MB16693A409F3282A9DB287BADB63E9@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
2022-11-09 18:30:09 +01:00
Alvaro Herrera 9c0de04242
Reduce xlog.h inclusion footprint
This file needs xlogreader.h only for the XLogReaderState typedef; but
we can dodge that by forward-declaring it.  Many files use xlog.h for
reasons other than reading WAL, and it's not good to force all those
files to include xlogreader.h, so take it out.

Surprisingly, there is no fallout in core code from making this change.
Perhaps external code will have to start including xlogreader.h.
2022-10-12 09:47:11 +02:00
Thomas Munro b6d8a60aba Restore pg_pread and friends.
Commits cf112c12 and a0dc8271 were a little too hasty in getting rid of
the pg_ prefixes where we use pread(), pwrite() and vectored variants.

We dropped support for ancient Unixes where we needed to use lseek() to
implement replacements for those, but it turns out that Windows also
changes the current position even when you pass in an offset to
ReadFile() and WriteFile() if the file handle is synchronous, despite
its documentation saying otherwise.

Switching to asynchronous file handles would fix that, but have other
complications.  For now let's just put back the pg_ prefix and add some
comments to highlight the non-standard side-effect, which we can now
describe as Windows-only.

Reported-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13
2022-09-29 13:12:11 +13:00
Robert Haas a448e49bcb Revert 56-bit relfilenode change and follow-up commits.
There are still some alignment-related failures in the buildfarm,
which might or might not be able to be fixed quickly, but I've also
just realized that it increased the size of many WAL records by 4 bytes
because a block reference contains a RelFileLocator. The effect of that
hasn't been studied or discussed, so revert for now.
2022-09-28 09:55:28 -04:00
Peter Eisentraut c8b2ef05f4 Convert *GetDatum() and DatumGet*() macros to inline functions
The previous macro implementations just cast the argument to a target
type but did not check whether the input type was appropriate.  The
function implementation can do better type checking of the input type.

For the *GetDatumFast() macros, converting to an inline function
doesn't work in the !USE_FLOAT8_BYVAL case, but we can use
AssertVariableIsOfTypeMacro() to get a similar level of type checking.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
2022-09-27 20:50:21 +02:00
Robert Haas 05d4cbf9b6 Increase width of RelFileNumbers from 32 bits to 56 bits.
RelFileNumbers are now assigned using a separate counter, instead of
being assigned from the OID counter. This counter never wraps around:
if all 2^56 possible RelFileNumbers are used, an internal error
occurs. As the cluster is limited to 2^64 total bytes of WAL, this
limitation should not cause a problem in practice.

If the counter were 64 bits wide rather than 56 bits wide, we would
need to increase the width of the BufferTag, which might adversely
impact buffer lookup performance. Also, this lets us use bigint for
pg_class.relfilenode and other places where these values are exposed
at the SQL level without worrying about overflow.

This should remove the need to keep "tombstone" files around until
the next checkpoint when relations are removed. We do that to keep
RelFileNumbers from being recycled, but now that won't happen
anyway. However, this patch doesn't actually change anything in
this area; it just makes it possible for a future patch to do so.

Dilip Kumar, based on an idea from Andres Freund, who also reviewed
some earlier versions of the patch. Further review and some
wordsmithing by me. Also reviewed at various points by Ashutosh
Sharma, Vignesh C, Amul Sul, Álvaro Herrera, and Tom Lane.

Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
2022-09-27 13:25:21 -04:00
Michael Paquier 78fdb1e50f Mark ParallelMessagePending as sig_atomic_t
ParallelMessagePending was previously marked as a boolean which should
be fine on modern platforms, but the C standard recommends the use of
sig_atomic_t for variables manipulated in signal handlers.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB58667C15A95A234720F4F876F5529@TYAPR01MB5866.jpnprd01.prod.outlook.com
2022-09-27 09:29:56 +09:00
Michael Paquier e1e6f8f3df Remove dependency to StringInfo in xlogbackup.{c.h}
This was used as the returned result type of the generated contents for
the backup_label and backup history files.  This is replaced by a simple
string, reducing the cleanup burden of all the callers of
build_backup_content().

Reviewed-by: Bharath Rupireddy
Discussion: https://postgr.es/m/YzERvNPaZivHEKZJ@paquier.xyz
2022-09-27 09:15:07 +09:00
Michael Paquier 7d708093b7 Refactor creation of backup_label and backup history files
This change simplifies some of the logic related to the generation and
creation of the backup_label and backup history files, which has become
unnecessarily complicated since the removal of the exclusive backup mode
in commit 39969e2.  The code was previously generating the contents of
these files as a string (start phase for the backup_label and stop phase
for the backup history file), one problem being that the contents of the
backup_label string were scanned to grab some of its internal contents
at the stop phase.

This commit changes the logic so as we store the data required to build
these files in an intermediate structure named BackupState.  The
backup_label file and backup history file strings are generated when
they are ready to be sent back to the client.  Both files are now
generated with the same code path.  While on it, this commit renames
some variables for clarity.

Two new files named xlogbackup.{c,h} are introduced in this commit, to
remove from xlog.c some of the logic around base backups.  Note that
more could be moved to this new set of files.

Author: Bharath Rupireddy, Michael Paquier
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/CALj2ACXWwTDgJqCjdaPyfR7djwm6SrybGcrZyrvojzcsmt4FFw@mail.gmail.com
2022-09-26 11:15:47 +09:00
Peter Geoghegan bfcf1b3480 Harmonize parameter names in storage and AM code.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in storage, catalog,
access method, executor, and logical replication code, as well as in
miscellaneous utility/library code.

Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy.  Later commits will do the
same for other parts of the codebase.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
2022-09-19 19:18:36 -07:00
Peter Geoghegan 4bac9600f0 Harmonize heapam and tableam parameter names.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions.  Having parameter names
that are reliably consistent in this way will make it easier to reason
about groups of related C functions from the same translation unit as a
module.  It will also make certain refactoring tasks easier.

Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy.  Later commits will do the
same for other parts of the codebase.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
2022-09-19 16:46:23 -07:00
Tom Lane 0a20ff54f5 Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation.  It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:

* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
  SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
  built-in GUC variables, along with some already-exported constant
  tables.
* GUC check/assign/show hook functions are moved to the variable's
  home module, whenever that's clearly identifiable.  A few hard-
  to-classify hooks ended up in commands/variable.c, which was
  already a home for miscellaneous GUC hook functions.

To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module.  That allowed removal of #include "guc.h"
from some existing headers.  The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.

There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments.  But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.

Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.

Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 11:11:45 -04:00
Peter Eisentraut e8d78581bb Revert "Convert *GetDatum() and DatumGet*() macros to inline functions"
This reverts commit 595836e99b.

It has problems when USE_FLOAT8_BYVAL is off.
2022-09-12 19:57:07 +02:00
Peter Eisentraut 595836e99b Convert *GetDatum() and DatumGet*() macros to inline functions
The previous macro implementations just cast the argument to a target
type but did not check whether the input type was appropriate.  The
function implementation can do better type checking of the input type.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
2022-09-12 17:36:26 +02:00
Thomas Munro adb466150b Fix recovery_prefetch with low maintenance_io_concurrency.
We should process completed IOs *before* trying to start more, so that
it is always possible to decode one more record when the decoded record
queue is empty, even if maintenance_io_concurrency is set so low that a
single earlier WAL record might have saturated the IO queue.

That bug was hidden because the effect of maintenance_io_concurrency was
arbitrarily clamped to be at least 2.  Fix the ordering, and also remove
that clamp.  We need a special case for 0, which is now treated the same
as recovery_prefetch=off, but otherwise the number is used directly.
This allows for testing with 1, which would have made the problem
obvious in simple test scenarios.

Also add an explicit error message for missing contrecords.  It was a
bit strange that we didn't report an error already, and became a latent
bug with prefetching, since the internal state that tracks aborted
contrecords would not survive retrying, as revealed by
026_overwrite_contrecord.pl with this adjustment.  Reporting an error
prevents that.

Back-patch to 15.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
2022-09-08 21:44:55 +12:00
Thomas Munro 932b016300 Fix cache invalidation bug in recovery_prefetch.
XLogPageRead() can retry internally after a pread() system call has
succeeded, in the case of short reads, and page validation failures
while in standby mode (see commit 0668719801).  Due to an oversight in
commit 3f1ce973, these cases could leave stale data in the internal
cache of xlogreader.c without marking it invalid.  The main defense
against stale cached data on failure to read a page was in the error
handling path of the calling function ReadPageInternal(), but that
wasn't quite enough for errors handled internally by XLogPageRead()'s
retry loop if we then exited with XLREAD_WOULDBLOCK.

1.  ReadPageInternal() now marks the cache invalid before calling the
    page_read callback, by setting state->readLen to 0.  It'll be set to
    a non-zero value only after a successful read.  It'll stay valid as
    long as the caller requests data in the cached range.

2.  XLogPageRead() no long performs internal retries while reading
    ahead.  While such retries should work, the general philosophy is
    that we should give up prefetching if anything unusual happens so we
    can handle it when recovery catches up, to reduce the complexity of
    the system.  Let's do that here too.

3.  While here, a new function XLogReaderResetError() improves the
    separation between xlogrecovery.c and xlogreader.c, where the former
    previously clobbered the latter's internal error buffer directly.
    The new function makes this more explicit, and also clears a related
    flag, without which a standby would needlessly retry in the outer
    function.

Thanks to Noah Misch for tracking down the conditions required for a
rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and
providing a reproducer.

Back-patch to 15.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
2022-09-03 13:28:43 +12:00
Amit Kapila c98b6acdb2 Update the comment in rmgrlist.h to match it to the code.
Author: Hayato Kuroda
Reviwed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58665F20F412EDF27B0759CFF5769@TYAPR01MB5866.jpnprd01.prod.outlook.com
2022-08-30 09:16:41 +05:30
Peter Geoghegan b2fe783aec Add missing parenthesis to max item size macro.
Oversight in commit 92f37505, per buildfarm.
2022-08-05 13:06:19 -07:00
Thomas Munro d2e150831a Remove configure probe for fdatasync.
fdatasync() is in SUSv2, and all targeted Unix systems have it.  We have
a replacement function for Windows.

We retain the probe for the function declaration, which allows us to
supply the mysteriously missing declaration for macOS, and also for
Windows.  No need to keep a HAVE_FDATASYNC macro around.

Also rename src/port/fdatasync.c to win32fdatasync.c since it's only for
Windows.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
2022-08-05 16:37:38 +12:00
Peter Geoghegan 92f375056c Fix nbtree maximum item size macro.
Commit dd299df818, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe for v4+ indexes.  New pivot tuples (generated by suffix
truncation during leaf page splits) sometimes require dedicated extra
space at the end of a new leaf page high key/pivot to store a heap TID
using a special representation (a representation only used in pivot
tuples).

The definition of "1/3 of a page" was reduced by a single MAXALIGN()
quantum for v4 indexes to make sure that the final enlarged pivot tuple
always fit, even with a split point whose firstright tuple happened to
already be at the "1/3 of a page" limit (limit for non-pivot tuples).
Internal pages (which only contain pivot tuples) stuck with the original
"1/3 of a page" definition.  This scheme made it impossible for any page
split to fail to free enough space for its newitem, which is never okay.

The macro that determines whether non-pivot tuples exceed their "1/3 of
a leaf page" restriction was structured as if space was needed for all
three tuples during a leaf page split (the new pivot plus two very large
adjoining non-pivots that are separated by the split).  This was subtly
wrong, in that it accidentally relied on implementation details that
could (at least in theory) change in the future.

To fix, make the macro subtract a single MAXALIGN() quantum, once.  The
macro evaluates to exactly the same value as before in practice.  But it
no longer depends on the current layout of nbtree's special area struct.

No backpatch, since this isn't a live bug.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Robert Haas <robertmhaas@gmail.com>
Diagnosed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA+Tgmoa7UBxivM7f6Ocx_qbq4=ky3uXc+WZNOBcVX+kvJvWOEA@mail.gmail.com
2022-08-04 20:55:02 -07:00
Thomas Munro cf112c1220 Remove dead pread and pwrite replacement code.
pread() and pwrite() are in SUSv2, and all targeted Unix systems have
them.

Previously, we defined pg_pread and pg_pwrite to emulate these function
with lseek() on old Unixen.  The names with a pg_ prefix were a reminder
of a portability hazard: they might change the current file position.
That hazard is gone, so we can drop the prefixes.

Since the remaining replacement code is Windows-only, move it into
src/port/win32p{read,write}.c, and move the declarations into
src/include/port/win32_port.h.

No need for vestigial HAVE_PREAD, HAVE_PWRITE macros as they were only
used for declarations in port.h which have now moved into win32_port.h.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Greg Stark <stark@mit.edu>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
2022-08-05 09:49:21 +12:00
Michael Paquier 245e14e28b Fix inconsistent comments for some function declarations in headers
Some of the headers list a couple of function prototypes located in a
different file than what is referred to.  This fixes a couple of
places where this inconsistency exists.

Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4__RdcSNXPa7L62Ozvo_Q4LvT60o3Bnp8yrQ_m9y5CKvg@mail.gmail.com
2022-08-04 17:36:21 +09:00
Michael Paquier ffd1b6bb6f Add overflow protection for block-related data in WAL records
XLogRecordBlockHeader, the header holding the information for the data
related to a block, tracks the length of the data appended to the WAL
record with data_length (uint16).  This limitation in size was not
enforced by the public routine in charge of registering the data
assembled later to form the WAL record inserted, XLogRegisterBufData().
Incorrectly used, it could lead to the generation of records with some
of its data overflowed.  This commit adds some safeguards to prevent
that for the block data, complaining immediately if attempting to add to
a record block information with a size larger than UINT16_MAX, which is
the limit implied by the internal logic.

Note that this also adjusts XLogRegisterData() and XLogRegisterBufData()
so as the length of the WAL record data given by the caller is unsigned,
matching with what gets stored in XLogRecData->len.

Extracted from a larger patch by the same author.  The original patch
includes more protections when assembling a record in full that will be
looked at separately later.

Author: Matthias van de Meent
Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier, David
Zhang
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
2022-07-27 13:35:40 +09:00
Tom Lane f92944137c Force immediate commit after CREATE DATABASE etc in extended protocol.
We have a few commands that "can't run in a transaction block",
meaning that if they complete their processing but then we fail
to COMMIT, we'll be left with inconsistent on-disk state.
However, the existing defenses for this are only watertight for
simple query protocol.  In extended protocol, we didn't commit
until receiving a Sync message.  Since the client is allowed to
issue another command instead of Sync, we're in trouble if that
command fails or is an explicit ROLLBACK.  In any case, sitting
in an inconsistent state while waiting for a client message
that might not come seems pretty risky.

This case wasn't reachable via libpq before we introduced pipeline
mode, but it's always been an intended aspect of extended query
protocol, and likely there are other clients that could reach it
before.

To fix, set a flag in PreventInTransactionBlock that tells
exec_execute_message to force an immediate commit.  This seems
to be the approach that does least damage to existing working
cases while still preventing the undesirable outcomes.

While here, add some documentation to protocol.sgml that explicitly
says how to use pipelining.  That's latent in the existing docs if
you know what to look for, but it's better to spell it out; and it
provides a place to document this new behavior.

Per bug #17434 from Yugo Nagata.  It's been wrong for ages,
so back-patch to all supported branches.

Discussion: https://postgr.es/m/17434-d9f7a064ce2a88a3@postgresql.org
2022-07-26 13:07:03 -04:00
Thomas Munro a1b56090eb Remove O_FSYNC and associated macros.
O_FSYNC was a pre-POSIX way of spelling O_SYNC, supported since commit
9d645fd84c for non-conforming operating systems of the time.  It's not
needed on any modern system.  We can just use standard O_SYNC directly
if it exists (= all targeted systems except Windows), and get rid of our
OPEN_SYNC_FLAG macro.

Similarly for standard O_DSYNC, we can just use that directly if it
exists (= all targeted systems except DragonFlyBSD), and get rid of our
OPEN_DATASYNC_FLAG macro.

We still avoid choosing open_datasync as a default value for
wal_sync_method if O_DSYNC has the same value as O_SYNC (= only
OpenBSD), so there is no change in default behavior.

Discussion: https://postgr.es/m/CA%2BhUKGJE7y92NY7FG2ftUbZUaqohBU65_Ys_7xF5mUHo4wirTQ%40mail.gmail.com
2022-07-22 12:41:17 +12:00
Peter Eisentraut 14a8bd9827 Convert macros to static inline functions (itup.h)
Reviewed-by: Amul Sul <sulamul@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517%40enterprisedb.com
2022-07-19 07:24:55 +02:00
Peter Eisentraut f58d7073b7 Convert macros to static inline functions (tupmacs.h)
Reviewed-by: Amul Sul <sulamul@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517%40enterprisedb.com
2022-07-18 08:01:27 +02:00
Peter Eisentraut 507ba16b28 Convert macros to static inline functions (xlog_internal.h)
Reviewed-by: Amul Sul <sulamul@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517%40enterprisedb.com
2022-07-15 14:56:00 +02:00
David Rowley 0229106afa Overload index_form_tuple to allow the memory context to be supplied
40af10b57 changed things so we make use of a generation memory context for
storing tuples to be sorted by tuplesort.c. That change does not play
nicely with the changes made in 9f03ca915 (back in 2014). That commit
changed things so that index_form_tuple() is called while switched into
the tuplestore's tuplecontext. In order to fetch the tuple from the index,
index_form_tuple() must do various memory allocations which are unrelated
to the storage of the final returned tuple. Although all of these
allocations are pfree'd, the fact that we now use a generation context
means that the memory for these pfree'd allocations won't be used again by
any other allocation due to generation.c's lack of freelists.  This could
result in sorts used for building indexes exceeding maintenance_work_mem
by a very large amount.

Here we fix it so we no longer allocate anything apart from the tuple
itself into the generation context by adding a new version of
index_form_tuple named index_form_tuple_context, which can be called to
specify the MemoryContext to allocate the tuple into.

Discussion: https://postgr.es/m/CAApHDvrHQkiFRHiGiAS-LMOvJN-eK-s762=tVzBz8ZqUea-a_A@mail.gmail.com
Backpatch-through: 15, where 40af10b57 was added.
2022-07-07 08:14:00 +12:00
Robert Haas b0a55e4329 Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.

Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".

Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.

On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.

Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.

Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.

Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 11:39:09 -04:00
Heikki Linnakangas adf6d5dfb2 Fix visibility check when XID is committed in CLOG but not in procarray.
TransactionIdIsInProgress had a fast path to return 'false' if the
single-item CLOG cache said that the transaction was known to be
committed. However, that was wrong, because a transaction is first
marked as committed in the CLOG but doesn't become visible to others
until it has removed its XID from the proc array. That could lead to an
error:

    ERROR:  t_xmin is uncommitted in tuple to be updated

or for an UPDATE to go ahead without blocking, before the previous
UPDATE on the same row was made visible.

The window is usually very short, but synchronous replication makes it
much wider, because the wait for synchronous replica happens in that
window.

Another thing that makes it hard to hit is that it's hard to get such
a commit-in-progress transaction into the single item CLOG cache.
Normally, if you call TransactionIdIsInProgress on such a transaction,
it determines that the XID is in progress without checking the CLOG
and without populating the cache. One way to prime the cache is to
explicitly call pg_xact_status() on the XID. Another way is to use a
lot of subtransactions, so that the subxid cache in the proc array is
overflown, making TransactionIdIsInProgress rely on pg_subtrans and
CLOG checks.

This has been broken ever since it was introduced in 2008, but the race
condition is very hard to hit, especially without synchronous
replication. There were a couple of reports of the error starting from
summer 2021, but no one was able to find the root cause then.

TransactionIdIsKnownCompleted() is now unused. In 'master', remove it,
but I left it in place in backbranches in case it's used by extensions.

Also change pg_xact_status() to check TransactionIdIsInProgress().
Previously, it only checked the CLOG, and returned "committed" before
the transaction was actually made visible to other queries. Note that
this also means that you cannot use pg_xact_status() to reproduce the
bug anymore, even if the code wasn't fixed.

Report and analysis by Konstantin Knizhnik. Patch by Simon Riggs, with
the pg_xact_status() change added by me.

Author: Simon Riggs
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/flat/4da7913d-398c-e2ad-d777-f752cf7f0bbb%40garret.ru
2022-06-27 08:21:08 +03:00
Tomas Vondra e3fcca0d0d Revert changes in HOT handling of BRIN indexes
This reverts commits 5753d4ee32 and fe60b67250 that modified HOT to
ignore BRIN indexes. The commit message for 5753d4ee32 claims that:

    When determining whether an index update may be skipped by using
    HOT, we can ignore attributes indexed only by BRIN indexes. There
    are no index pointers to individual tuples in BRIN, and the page
    range summary will be updated anyway as it relies on visibility
    info.

This is partially incorrect - it's true BRIN indexes don't point to
individual tuples, so HOT chains are not an issue, but the visibitlity
info is not sufficient to keep the index up to date. This can easily
result in corrupted indexes, as demonstrated in the hackers thread.

This does not mean relaxing the HOT restrictions for BRIN is a lost
cause, but it needs to handle the two aspects (allowing HOT chains and
updating the page range summaries) as separate. But that requires a
major changes, and it's too late for that in the current dev cycle.

Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
2022-06-16 15:02:49 +02:00
Tom Lane e19272ef60 Remove unused-and-misspelled function extern declaration.
Commit c65507763 added "extern XLogRecPtr CalculateMaxmumSafeLSN(void)",
which bears no trace of connection to anything else in that patch
or anywhere else.  Remove it again.

Sergei Kornilov (also spotted by Bharath Rupireddy)

Discussion: https://postgr.es/m/706501646056870@vla3-6a5326aeb4ee.qloud-c.yandex.net
Discussion: https://postgr.es/m/CALj2ACVoQ7NEf43Xz0rfxsGOKYTN5r4VZp2DO2_5p+CMzsRPFw@mail.gmail.com
2022-05-21 13:26:08 -04:00
Andres Freund 905c020bef Add missing 'extern' to function prototypes.
Postgres style is to spell out extern. Noticed while scripting adding
PGDLLIMPORT markers to functions.

Discussion: https://postgr.es/m/20220512164513.vaheofqp2q24l65r@alap3.anarazel.de
2022-05-12 12:39:33 -07:00
Tom Lane 23e7b38bfe Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.
I manually fixed a couple of comments that pgindent uglified.
2022-05-12 15:17:30 -04:00
Jeff Davis ed57cac84d pg_walinspect: fix case where flush LSN is in the middle of a record.
Instability in the test for pg_walinspect revealed that
pg_get_wal_records_info_till_end_of_wal(x) would try to decode all the
records with a start LSN earlier than the flush LSN, even though that
might include a partial record at the end of the range. In that case,
read_local_xlog_page_no_wait() would return NULL when it tried to read
past the flush LSN, which would be interpreted as an error by the
caller. That caused a test failure only on a BF animal that had been
restarted recently, but could be expected to happen in the wild quite
easily depending on the alignment of various parameters.

Fix by using private data in read_local_xlog_page_no_wait() to signal
end-of-wal to the caller, so that it can be properly distinguished
from a real error.

Discussion: https://postgr.es/m/Ymd/e5eeZMNAkrXo%40paquier.xyz
Discussion: https://postgr.es/m/111657.1650910309@sss.pgh.pa.us

Authors: Thomas Munro, Bharath Rupireddy.
2022-04-30 09:05:32 -07:00
Tom Lane 2cb1272445 Rethink method for assigning OIDs to the template0 and postgres DBs.
Commit aa0105141 assigned fixed OIDs to template0 and postgres
in a very ad-hoc way.  Notably, instead of teaching Catalog.pm
about these OIDs, the unused_oids script was just hacked to
not show them as unused.  That's problematic since, for example,
duplicate_oids wouldn't report any future conflict.  Hence,
invent a macro DECLARE_OID_DEFINING_MACRO() that can be used to
define an OID that is known to Catalog.pm and will participate
in duplicate-detection as well as renumbering by renumber_oids.pl.
(We don't anticipate renumbering these particular OIDs, but we
might as well build out all the Catalog.pm infrastructure while
we're here.)

Another issue is that aa0105141 neglected to touch IsPinnedObject,
with the result that it now claimed template0 and postgres are
pinned.  The right thing to do there seems to be to teach it that
no database is pinned, since in fact DROP DATABASE doesn't check
for pinned-ness (and at least for these cases, that is an
intentional choice).  It's not clear whether this wrong answer
had any visible effect, but perhaps it could have resulted in
erroneous management of dependency entries.

In passing, rename the TemplateDbOid macro to Template1DbOid
to reduce confusion (likely we should have done that way back
when we invented template0, but we didn't), and rename the
OID macros for template0 and postgres to have a similar style.

There are no changes to postgres.bki here, so no need for a
catversion bump.

Discussion: https://postgr.es/m/2935358.1650479692@sss.pgh.pa.us
2022-04-21 16:23:15 -04:00