Document that TransactionIdDidAbort() won't indicate that transactions
that were in-progress during a crash have aborted. Tie this to existing
discussion of the TransactionIdDidCommit() and TransactionIdDidCommit()
protocol that code in heapam_visibility.c (and a few other places) must
observe.
Follow-up to bugfix commit eb5ad4ff.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzn4bEEqgmaUQL3aJ73yM9gAeK-wE4ngi7kjRjLztb+P0w@mail.gmail.com
There are a number of places where a shell command is constructed with
percent-placeholders (like %x). It's cumbersome to have to open-code
this several times. This factors out this logic into a separate
function. This also allows us to ensure consistency for and document
some subtle behaviors, such as what to do with unrecognized
placeholders.
The unified handling is now that incorrect and unknown placeholders
are an error, where previously in most cases they were skipped or
ignored. This affects the following settings:
- archive_cleanup_command
- archive_command
- recovery_end_command
- restore_command
- ssl_passphrase_command
The following settings are part of this refactoring but already had
stricter error handling and should be unchanged in their behavior:
- basebackup_to_shell.command
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/5238bbed-0b01-83a6-d4b2-7eb0562a054e%40enterprisedb.com
Currently, for large transactions, the publisher sends the data in
multiple streams (changes divided into chunks depending upon
logical_decoding_work_mem), and then on the subscriber-side, the apply
worker writes the changes into temporary files and once it receives the
commit, it reads from those files and applies the entire transaction. To
improve the performance of such transactions, we can instead allow them to
be applied via parallel workers.
In this approach, we assign a new parallel apply worker (if available) as
soon as the xact's first stream is received and the leader apply worker
will send changes to this new worker via shared memory. The parallel apply
worker will directly apply the change instead of writing it to temporary
files. However, if the leader apply worker times out while attempting to
send a message to the parallel apply worker, it will switch to
"partial serialize" mode - in this mode, the leader serializes all
remaining changes to a file and notifies the parallel apply workers to
read and apply them at the end of the transaction. We use a non-blocking
way to send the messages from the leader apply worker to the parallel
apply to avoid deadlocks. We keep this parallel apply assigned till the
transaction commit is received and also wait for the worker to finish at
commit. This preserves commit ordering and avoid writing to and reading
from files in most cases. We still need to spill if there is no worker
available.
This patch also extends the SUBSCRIPTION 'streaming' parameter so that the
user can control whether to apply the streaming transaction in a parallel
apply worker or spill the change to disk. The user can set the streaming
parameter to 'on/off', or 'parallel'. The parameter value 'parallel' means
the streaming will be applied via a parallel apply worker, if available.
The parameter value 'on' means the streaming transaction will be spilled
to disk. The default value is 'off' (same as current behaviour).
In addition, the patch extends the logical replication STREAM_ABORT
message so that abort_lsn and abort_time can also be sent which can be
used to update the replication origin in parallel apply worker when the
streaming transaction is aborted. Because this message extension is needed
to support parallel streaming, parallel streaming is not supported for
publications on servers < PG16.
Author: Hou Zhijie, Wang wei, Amit Kapila with design inputs from Sawada Masahiko
Reviewed-by: Sawada Masahiko, Peter Smith, Dilip Kumar, Shi yu, Kuroda Hayato, Shveta Mallik
Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
Waken related worker processes immediately at commit of a transaction
that has performed ALTER SUBSCRIPTION (including the RENAME and
OWNER variants). This reduces the response time for such operations.
In the real world that might not be worth much, but it shaves several
seconds off the runtime for the subscription test suite.
In the case of PREPARE, we just throw away this notification state;
it doesn't seem worth the work to preserve it. The workers will
still react after the eventual COMMIT PREPARED, but not as quickly.
Nathan Bossart
Discussion: https://postgr.es/m/20221122004119.GA132961@nathanxps13
We cannot rely on TransactionIdDidAbort here, since in general it may
report transactions that were in-progress at the time of an earlier hard
crash as not aborted, effectively behaving as if they were still in
progress even after crash recovery completes. Go back to defensively
verifying that xmax didn't commit instead.
Oversight in commit 79d4bf4e.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230104035636.hy5djyr2as4gbc4q@awork3.anarazel.de
Commit 4f627f89 switched SLRU truncation for multixacts back to being a
task performed during VACUUM, but missed some comments that continued to
reference truncation happening as part of checkpointing. Update those
comments now.
Also update comments that became obsolete when commit c3ffa731 changed
the way that vacuum_multixact_freeze_min_age is applied by VACUUM as it
computes its MultiXactCutoff cutoff (which is used by VACUUM to decide
what to freeze). Explain the same issues by referencing how OldestMxact
is the latest valid value that relminmxid can ever be advanced to at the
end of a VACUUM (following the work in commit 0b018fab).
pg_xact lookups are relatively expensive. Move the xmin/xmax commit
status checks from the point that freeze plans are prepared to the point
that they're actually executed. Otherwise we'll repeat many commit
status checks whenever multiple successive VACUUM operations scan the
same pages and decide against freezing each time, which is a waste of
cycles.
Oversight in commit 1de58df4, which added page-level freezing.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkZpe4K6qMfEt8H4qYJCKc2R7TPvKsBva7jc9w7iGXQSw@mail.gmail.com
Improve comments added by commit 1de58df4 which describe the
lazy_scan_prune "freeze the page" path. These newly revised comments
are based on suggestions from Jeff Davis.
In passing, remove nearby visibility_cutoff_xid comments left over from
commit 6daeeb1f.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/ebc857107fe3edd422ef8a65191ca4a8da568b9b.camel@j-davis.com
While on it, newlines are removed from the end of two elog() strings.
The others are simple grammar mistakes. One comment in pg_upgrade
referred incorrectly to sequences since a7e5457.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20221230231257.GI1153@telsasoft.com
Backpatch-through: 11
The term "truncation" has been ambiguous since commit 10a8d13823 added
line pointer array truncation during heap pruning. Clear things up by
specifying that we're talking about rel truncation here, to match nearby
comments that apply to tuples with storage.
Don't allow VACUUM to WAL-log the value FrozenTransactionId as the
snapshotConflictHorizon of freezing or visibility map related WAL
records.
The only special XID value that's an allowable snapshotConflictHorizon
is InvalidTransactionId, which is interpreted as "record definitely
doesn't require a recovery conflict".
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
When brin_minmax_multi_union merges summaries, we may end up with just a
single range after merge_overlapping_ranges. The summaries may contain
just one range each, and they may overlap (or be exactly the same).
With a single range there's no distance to calculate, but we happen to
call build_distances anyway - which is fine, we don't calculate the
distance in this case, except that with asserts this failed due to a
check there are at least two ranges.
The assert is unnecessarily strict, so relax it a bit and bail out if
there's just a single range. The relaxed assert would be enough, but
this way we don't allocate unnecessary memory for distance.
Backpatch to 14, where minmax-multi opclasses were introduced.
Reported-by: Jaime Casanova
Backpatch-through: 14
Discussion: https://postgr.es/m/YzVA55qS0hgz8P3r@ahch-to
Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages. Individual XIDs and MXIDs fields from tuple
headers now trigger freezing of whole pages, rather than independently
triggering freezing of each individual tuple header field.
Managing the cost of freezing over time now significantly influences
when and how VACUUM freezes. The overall amount of WAL written is the
single most important freezing related cost, in general. Freezing each
page's tuples together in batch allows VACUUM to take full advantage of
the freeze plan WAL deduplication optimization added by commit 9e540599.
Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI. We'll have already written a large
amount of WAL just to do that much, so it's very likely a good idea to
get freezing out of the way for the page early. This only happens in
cases where it will directly lead to marking the page all-frozen in the
visibility map.
In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact. It doesn't quite work that way in certain rare
cases involving MultiXacts, though. It is convenient to define "freeze
the page" in a way that gives FreezeMultiXactId the leeway to put off
the work of processing an individual tuple's xmax whenever it happens to
be a MultiXactId that would require an expensive second pass to process
aggressively (allocating a new multi is especially worth avoiding here).
FreezeMultiXactId is eager when processing is cheap (as it usually is),
and lazy in the event of an individual multi that happens to require
expensive second pass processing. This avoids regressions related to
processing of multis that page-level freezing might otherwise cause.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
When VACUUM determines that an existing MultiXact should use a freeze
plan that sets xmax to InvalidTransactionId, the original Multi may or
may not be before OldestMxact. Remove an incorrect assertion that
expected it to always be from before OldestMxact.
Oversight in commit 4ce3af.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/TYAPR01MB5866B24104FD80B5D7E65C3EF5ED9@TYAPR01MB5866.jpnprd01.prod.outlook.com
The former name was discussed as being confusing, so use "split", as per
a suggestion from Magnus Hagander.
While on it, one of the output arguments is renamed from "segno" to
"segment_number", as per a suggestion from Kyotaro Horiguchi.
The documentation is updated to reflect all these changes.
Bump catalog version.
Author: Bharath Rupireddy, Michael Paquier
Discussion: https://postgr.es/m/CABUevEytQVaOOhGdoh0D7hGwe3fuKcRF6NthsSW7ww04EmtFgQ@mail.gmail.com
Perform a failsafe check every time VACUUM's first heap scan scans a
further FAILSAFE_EVERY_PAGES pages, rather than using an approach based
on the number of physical blocks that our current blkno is from the
blkno at the time of the previous failsafe check. That way VACUUM will
perform a failsafe check every time it has scanned a uniform number of
pages, without it mattering when or how VACUUM skipped pages using the
visibility map.
Sami Imseih, with changes to FAILSAFE_EVERY_PAGES comments added by me.
Author: Sami Imseih <simseih@amazon.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/401CE010-4049-4B94-9961-0B610A5D254D%40amazon.com
Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin. This state is initialized in vacuum.c, and
then passed around by code from vacuumlazy.c to heapam.c freezing
related routines. The new convention is that everybody works off of the
same cutoff state, which is passed around via pointers to const.
Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before. That way the routine's xmin handling
code is symmetrical with the existing xmax handling code. This is
preparation for an upcoming commit that will add page level freezing.
Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks. We now test OldestXmin directly, instead of
using FreezeLimit as an inexact proxy for OldestXmin. This is further
preparation for the page level freezing work, which will make the
function's caller cede control of page level freezing to the function
where appropriate (where heap_prepare_freeze_tuple sees a tuple that
happens to contain a MultiXactId in its xmax).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-WznS9TxXmz2_=SY+SyJyDFbiOftKofM9=aDo68BbXNBUMA@mail.gmail.com
This shaves some code by replacing the combinations of
CreateTemplateTupleDesc()/TupleDescInitEntry() hardcoding a mapping of
the attributes listed in pg_proc.dat by get_call_result_type() to build
the TupleDesc needed for the rows generated.
get_call_result_type() is more expensive than the former style, but this
removes some duplication with the lists of OUT parameters (pg_proc.dat
and the attributes hardcoded in these code paths). This is applied to
functions that are not considered as critical (aka that could be called
repeatedly for monitoring purposes).
Author: Bharath Rupireddy
Reviewed-by: Robert Haas, Álvaro Herrera, Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACV23HW5HP5hFjd89FNS-z5X8r2jNXdMXcpN2BgTtKd87w@mail.gmail.com
This function takes in input a WAL segment name and returns a tuple made
of the segment sequence number (dependent on the WAL segment size of the
cluster) and its timeline, as of a thin SQL wrapper around the existing
XLogFromFileName().
This function has multiple usages, like being able to compile a LSN from
a file name and an offset, or finding the timeline of a segment without
having to do to some maths based on the first eight characters of the
segment.
Bump catalog version.
Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Kyotaro Horiguchi, Maxim Orlov, Michael
Paquier
Discussion: https://postgr.es/m/CALj2ACWV=FCddsxcGbVOA=cvPyMr75YCFbSQT6g4KDj=gcJK4g@mail.gmail.com
Because we added StaticAssertStmt() first before StaticAssertDecl(),
some uses as well as the instructions in c.h are now a bit backwards
from the "native" way static assertions are meant to be used in C.
This updates the guidance and moves some static assertions to better
places.
Specifically, since the addition of StaticAssertDecl(), we can put
static assertions at the file level. This moves a number of static
assertions out of function bodies, where they might have been stuck
out of necessity, to perhaps better places at the file level or in
header files.
Also, when the static assertion appears in a position where a
declaration is allowed, then using StaticAssertDecl() is more native
than StaticAssertStmt().
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
Commits f92944137 et al. made IsInTransactionBlock() set the
XACT_FLAGS_NEEDIMMEDIATECOMMIT flag before returning "false",
on the grounds that that kept its API promises equivalent to those of
PreventInTransactionBlock(). This turns out to be a bad idea though,
because it allows an ANALYZE in a pipelined series of commands to
cause an immediate commit, which is unexpected.
Furthermore, if we return "false" then we have another issue,
which is that ANALYZE will decide it's allowed to do internal
commit-and-start-transaction sequences, thus possibly unexpectedly
committing the effects of previous commands in the pipeline.
To fix the latter situation, invent another transaction state flag
XACT_FLAGS_PIPELINING, which explicitly records the fact that we
have executed some extended-protocol command and not yet seen a
commit for it. Then, require that flag to not be set before allowing
InTransactionBlock() to return "false".
Having done that, we can remove its setting of NEEDIMMEDIATECOMMIT
without fear of causing problems. This means that the API guarantees
of IsInTransactionBlock now diverge from PreventInTransactionBlock,
which is mildly annoying, but it seems OK given the very limited usage
of IsInTransactionBlock. (In any case, a caller preferring the old
behavior could always set NEEDIMMEDIATECOMMIT for itself.)
For consistency also require XACT_FLAGS_PIPELINING to not be set
in PreventInTransactionBlock. This too is meant to prevent commands
such as CREATE DATABASE from silently committing previous commands
in a pipeline.
Per report from Peter Eisentraut. As before, back-patch to all
supported branches (which sadly no longer includes v10).
Discussion: https://postgr.es/m/65a899dd-aebc-f667-1d0a-abb89ff3abf8@enterprisedb.com
Pay down some ancient technical debt (dating to commit 022fd9966):
fix a couple of places in datetime parsing that were throwing
ereport's immediately instead of returning a DTERR code that could be
interpreted by DateTimeParseError. The reason for that was that there
was no mechanism for passing any auxiliary data (such as a zone name)
to DateTimeParseError, and these errors seemed to really need it.
Up to now it didn't matter that much just where the error got thrown,
but now we'd like to have a hard policy that datetime parse errors
get thrown from just the one place.
Hence, invent a "DateTimeErrorExtra" struct that can be used to
carry any extra values needed for specific DTERR codes. Perhaps
in the future somebody will be motivated to use this to improve
the specificity of other DateTimeParseError messages, but for now
just deal with the timezone-error cases.
This is on the way to making the datetime input functions report
parse errors softly; but it's really an independent change, so
commit separately.
Discussion: https://postgr.es/m/3bbbb0df-7382-bf87-9737-340ba096e034@postgrespro.ru
The same code pattern is repeated 17 times for int64 counters (0 for
missing entry) and 5 times for timestamps (NULL for missing entry) on
table entries. This code is switched to use a macro for the basic code
instead, shaving a few hundred lines of originally-duplicated code. The
function names remain the same, but some fields of PgStat_StatTabEntry
have to be renamed to cope with the new style.
Author: Bertrand Drouvot
Reviewed-by: Nathan Bossart
Discussion: https:/postgr.es/m/20221204173207.GA2669116@nathanxps13
Passing a NULL snapshot (InvalidSnapshot) is going to work but only as long
as the index can't find any matching rows. This can be confusing for
the extension authors, so add an explicit check for this argument. The check
is implemented with Assert() in order to avoid overhead in release builds.
Reported-by: Sven Klemm
Discussion: https://postgr.es/m/CAJ7c6TPxitD4vbKyP-mpmC1XwyHdPPqvjLzm%2BVpB88h8LGgneQ%40mail.gmail.com
Author: Aleksander Alekseev
Reviewed-by: Pavel Borisov
The error messages reported during any failures while reading or
validating the header of a WAL currently includes only the offset of the
page but not the compiled LSN referring to the page, requiring an extra
step to compile it if looking at the surroundings with pg_waldump or
similar. Adding this information costs a bit in translation, but also
eases debugging.
Author: Bharath Rupireddy
Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, Maxim Orlov, Michael
Paquier
Discussion: https://postgr.es/m/CALj2ACWV=FCddsxcGbVOA=cvPyMr75YCFbSQT6g4KDj=gcJK4g@mail.gmail.com
ri_RootToPartitionMap is currently only initialized for tuple routing
target partitions, though a future commit will need the ability to use
it even for the non-partition child tables, so make adjustments to the
decouple it from the partitioning code.
Also, make it lazily initialized via ExecGetRootToChildMap(), making
that function its preferred access path. Existing third-party code
accessing it directly should no longer do so; consequently, it's been
renamed to ri_RootToChildMap, which also makes it consistent with
ri_ChildToRootMap.
ExecGetRootToChildMap() houses the logic of setting the map appropriately
depending on whether a given child relation is partition or not.
To support this, also add a separate entry point for TupleConversionMap
creation that receives an AttrMap. No new code here, just split an
existing function in two.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqEYUhDXSK5BTvG_xk=eaAEJCD4GS3C6uH7ybBvv+Z_Tmg@mail.gmail.com
Previously, we'd compress only when the active range of array entries
reached Max(4 * PROCARRAY_MAXPROCS, 2 * pArray->numKnownAssignedXids).
If max_connections is large, the first term could result in not
compressing for a long time, resulting in much wastage of cycles in
hot-standby backends scanning the array to take snapshots. Get rid
of that term, and just bound it to 2 * pArray->numKnownAssignedXids.
That however creates the opposite risk, that we might spend too much
effort compressing. Hence, consider compressing only once every 128
commit records. (This frequency was chosen by benchmarking. While
we only tried one benchmark scenario, the results seem stable over
a fairly wide range of frequencies.)
Also, force compression when processing RecoveryInfo WAL records
(which should be infrequent); the old code could perform compression
then, but would do so only after the same array-range check as for
the transaction-commit path.
Also, opportunistically run compression if the startup process is about
to wait for WAL, though not oftener than once a second. This should
prevent cases where we waste lots of time by leaving the array
not-compressed for long intervals due to low WAL traffic.
Lastly, add a simple check to keep us from uselessly compressing
when the array storage is already compact.
Back-patch, as the performance problem is worse in pre-v14 branches
than in HEAD.
Simon Riggs and Michail Nikolaev, with help from Tom Lane and
Andres Freund.
Discussion: https://postgr.es/m/CALdSSPgahNUD_=pB_j=1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw@mail.gmail.com
When it's given as true, return a 0 in the position of the missing
column rather than raising an error.
This is currently unused, but it allows us to reimplement column
permission checking in a subsequent commit. It seems worth breaking
into a separate commit because it affects unrelated code.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFfiai=qBxPDTjaio_ZcaqUKh+FC=prESrB8ogZgFNNNQ@mail.gmail.com
Previously, an idle startup (recovery) process would wake up every 5
seconds to have a chance to poll for promote_trigger_file, even if that
GUC was not configured. That promotion triggering mechanism was
effectively superseded by pg_ctl promote and pg_promote() a long time
ago. There probably aren't many users left and it's very easy to change
to the modern mechanisms, so we agreed to remove the feature.
This is part of a campaign to reduce wakeups on idle systems.
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Ian Lawrence Barwick <barwick@gmail.com>
Discussion: https://postgr.es/m/CANbhV-FsjnzVOQGBpQ589%3DnWuL1Ex0Ykn74Nh1hEjp2usZSR5g%40mail.gmail.com
When building hash indexes using the spool method, tuples are added to the
index page in hashkey order. Because of this, we can safely skip
performing the binary search on the existing tuples on the page to find
the location to insert the tuple based on its hashkey value. For this
case, we can just always put the tuple at the end of the item array as the
tuples will always arrive in hashkey order.
Testing has shown that this can improve hash index build speeds by 5-15%
with a unique set of integer values.
Author: Simon Riggs
Reviewed-by: David Rowley
Tested-by: David Zhang, Tomas Vondra
Discussion: https://postgr.es/m/CANbhV-GBc5JoG0AneUGPZZW3o4OK5LjBGeKe_icpC3R1McrZWQ@mail.gmail.com
Pass VACUUM parameters (VacuumParams state) to vacuum_set_xid_limits()
directly, rather than passing most individual VacuumParams fields as
separate arguments.
Also make vacuum_set_xid_limits() output parameter symbol names match
those used by its vacuumlazy.c caller.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=TE7gW5DgSahDkf0UEZigFGAoHNNN6EvSrdzC=Kn+hrA@mail.gmail.com
We shouldn't ever need to rely on whether HEAP_XMAX_INVALID is set in
t_infomask when considering whether or not an xmax should be deemed
already frozen, since that status flag is just a hint. The only
acceptable representation for an "xmax_already_frozen" raw xmax field is
the transaction ID value zero (also known as InvalidTransactionId).
Adjust code that superficially appeared to rely on HEAP_XMAX_INVALID to
make the rule about xmax_already_frozen clear. Also avoid needlessly
rereading the tuple's raw xmax.
Oversight in bugfix commit d2599ecf. There is no evidence that this
ever led to incorrect behavior, so no backpatch. The worst consequence
of this bug was that VACUUM could hypothetically fail to notice and
report on certain kinds of corruption, which seems fairly benign.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzkh3DMCDRPfhZxj9xCq9v3WmzvmbiCpf1dNKUBPadhCbQ@mail.gmail.com
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.
Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.
The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.
In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.
As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
Standardize on the name snapshotConflictHorizon for all XID fields from
WAL records that generate recovery conflicts when in hot standby mode.
This supersedes the previous latestRemovedXid naming convention.
The new naming convention places emphasis on how the values are actually
used by REDO routines. How the values are generated during original
execution (details of which vary by record type) is deemphasized. Users
of tools like pg_waldump can now grep for snapshotConflictHorizon to see
all potential sources of recovery conflicts in a standardized way,
without necessarily having to consider which specific record types might
be involved.
Also bring a couple of WAL record types that didn't follow any kind of
naming convention into line. These are heapam's VISIBLE record type and
SP-GiST's VACUUM_REDIRECT record type. Now every WAL record whose REDO
routine calls ResolveRecoveryConflictWithSnapshot() passes through the
snapshotConflictHorizon field from its WAL record. This is follow-up
work to the refactoring from commit 9e540599 that made FREEZE_PAGE WAL
records use a standard snapshotConflictHorizon style XID cutoff.
No bump in XLOG_PAGE_MAGIC, since the underlying format of affected WAL
records doesn't change.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzm2CQUmViUq7Opgk=McVREHSOorYaAjR1ZpLYkRN7_dPw@mail.gmail.com
Make heapam WAL records that describe freezing performed by VACUUM more
space efficient by storing each distinct "freeze plan" once, alongside
an array of associated page offset numbers (one per freeze plan). The
freeze plans required for most heap pages tend to naturally have a great
deal of redundancy, so this technique is very effective in practice. It
often leads to freeze WAL records that are less than 20% of the size of
equivalent WAL records generated using the previous approach.
The freeze plan concept was introduced by commit 3b97e6823b, which fixed
bugs in VACUUM's handling of MultiXacts. We retain the concept of
freeze plans, but go back to using page offset number arrays. There is
no loss of generality here because deduplication is an additive process
that gets applied mechanically when FREEZE_PAGE WAL records are built.
More than anything else, freeze plan deduplication is an optimization
that reduces the marginal cost of freezing additional tuples on pages
that will need to have at least one or two tuples frozen in any case.
Ongoing work that adds page-level freezing to VACUUM will take full
advantage of the improved cost profile through batching.
Also refactor some of the details surrounding recovery conflicts needed
to REDO freeze records in passing: make original execution responsible
for generating a standard latestRemovedXid cutoff, rather than working
backwards to get the same cutoff in the REDO routine. Bugfix commit
66fbcb0d2e did it the other way around, which is equivalent but obscures
what's going on.
Also rename the cutoff field from the WAL record/struct (rename the
field cutoff_xid to latestRemovedXid to match similar WAL records).
Processing of conflicts by REDO routines is already completely uniform,
so tools like pg_waldump should present the information driving the
process uniformly. There are two remaining WAL record types that still
don't quite follow this convention (heapam's VISIBLE record type and
SP-GiST's VACUUM_REDIRECT record type). They can be brought into line
by later work that totally standardizes how the cutoffs are presented.
Bump XLOG_PAGE_MAGIC.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAH2-Wz=XytErMnb8FAyFd+OQEbiipB0Q2FmFdXrggPL4VBnRYQ@mail.gmail.com
During XLOG_HASH_SPLIT_ALLOCATE_PAGE replay, we were checking for a
cleanup lock on the new bucket page after acquiring an exclusive lock on
it and raising a PANIC error on failure. However, it is quite possible
that checkpointer can acquire the pin on the same page before acquiring a
lock on it, and then the replay will lead to an error. So instead, directly
acquire the cleanup lock on the new bucket page during
XLOG_HASH_SPLIT_ALLOCATE_PAGE replay operation.
Reported-by: Andres Freund
Author: Robert Haas
Reviewed-By: Amit Kapila, Andres Freund, Vignesh C
Backpatch-through: 11
Discussion: https://postgr.es/m/20220810022617.fvjkjiauaykwrbse@awork3.anarazel.de
Instead of dozens of mostly-duplicate pg_foo_aclcheck() functions,
write one common function object_aclcheck() that can handle almost all
of them. We already have all the information we need, such as which
system catalog corresponds to which catalog table and which column is
the ACL column.
There are a few pg_foo_aclcheck() that don't work via the generic
function and have special APIs, so those stay as is.
I also changed most pg_foo_aclmask() functions to static functions,
since they are not used outside of aclchk.c.
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://www.postgresql.org/message-id/flat/95c30f96-4060-2f48-98b5-a4392d3b6066@enterprisedb.com
Instead of dozens of mostly-duplicate pg_foo_ownercheck() functions,
write one common function object_ownercheck() that can handle almost
all of them. We already have all the information we need, such as
which system catalog corresponds to which catalog table and which
column is the owner column.
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://www.postgresql.org/message-id/flat/95c30f96-4060-2f48-98b5-a4392d3b6066@enterprisedb.com
The original report was concerned with a possible inconsistency
between the heap and the visibility map, which I was unable to
confirm. The concern has been retracted.
However, there did seem to be a torn page hazard when using
checksums. By not setting the heap page LSN during redo, the
protections of minRecoveryPoint were bypassed. Fixed, along with a
misleading comment.
It may have been impossible to hit this problem in practice, because
it would require a page tear between the checksum and the flags, so I
am marking this as a theoretical risk. But, as discussed, it did
violate expectations about the page LSN, so it may have other
consequences.
Backpatch to all supported versions.
Reported-by: Konstantin Knizhnik
Reviewed-by: Konstantin Knizhnik
Discussion: https://postgr.es/m/fed17dac-8cb8-4f5b-d462-1bb4908c029e@garret.ru
Backpatch-through: 11
Previously, trying to set storage parameters on a partitioned table
always led to "unrecognized parameter foo", because the code expected
there might be some valid parameters; but there aren't any. The docs
make clear that it's intended that there never will be any, so let's
replace this useless search with a more to-the-point message.
Simon Riggs and Karina Litskevich
Discussion: https://postgr.es/m/CANbhV-H=eZ9kTR9mUgKGK0Qv9uXP=U+dQg3rinQHfTdFMhBA2A@mail.gmail.com
Add a little to the header comments for these functions to make it
clearer what guarantees about commit behavior are provided to callers.
(See commit f92944137 for context.)
Although this is only a comment change, it's really documentation
aimed at authors of extensions, so it seems appropriate to back-patch.
Yugo Nagata and Tom Lane, per further discussion of bug #17434.
Discussion: https://postgr.es/m/17434-d9f7a064ce2a88a3@postgresql.org
This routine is designed to write zeros to a file using vectored I/O,
for a size given by its caller, being useful when it comes to
initializing a file with a final size already known.
XLogFileInitInternal() in xlog.c is changed to use this new routine when
initializing WAL segments with zeros (wal_init_zero enabled). Note that
the aligned buffers used for the vectored I/O writes have a size of
XLOG_BLCKSZ, and not BLCKSZ anymore, as pg_pwrite_zeros() relies on
PGAlignedBlock while xlog.c originally used PGAlignedXLogBlock.
This routine will be used in a follow-up patch to do the pre-padding of
WAL segments for pg_receivewal and pg_basebackup when these are not
compressed.
Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Andres Freund, Thomas Munro, Michael
Paquier
Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com
Previously, the description of XLOG_RUNNING_XACTS showed only
top-transaction XIDs and whether subtransactions overflowed. This commit
improves it to show individual subtransaction XIDs. This also improves the
description of overflowed subtransactions.
This additional information can be helpful for testing and debugging
purposes.
Author: Masahiko Sawada
Reviewd by: Fujii Masao, Kyotaro Horiguchi, Ashutosh Bapat, Bharath Rupireddy
Discussion: https://postgr.es/m/CAD21AoAqvaE+XEeXHHPdAGQPcCoGXxuoeutq_nWhUSQvTt5+tA@mail.gmail.com
We have various requirements when using a dlist_head to keep track of the
number of items in the list. This, traditionally, has been done by
maintaining a counter variable in the calling code. Here we tidy this up
by adding "dclist", which is very similar to dlist but also keeps track of
the number of items stored in the list.
Callers may use the new dclist_count() function when they need to know how
many items are stored. Obtaining the count is an O(1) operation.
For simplicity reasons, dclist and dlist both use dlist_node as their node
type and dlist_iter/dlist_mutable_iter as their iterator type. dclists
have all of the same functionality as dlists except there is no function
named dclist_delete(). To remove an item from a list dclist_delete_from()
must be used. This requires knowing which dclist the given item is stored
in.
Additionally, here we also convert some dlists where additional code
exists to keep track of the number of items stored and to make these use
dclists instead.
Author: David Rowley
Reviewed-by: Bharath Rupireddy, Aleksander Alekseev
Discussion: https://postgr.es/m/CAApHDvrtVxr+FXEX0VbViCFKDGxA3tWDgw9oFewNXCJMmwLjLg@mail.gmail.com
This is similar to 7d25958, and this commit takes care of all the
remaining inconsistencies between the initial value used in the C
variable associated to a GUC and its default value stored in the GUC
tables (as of pg_settings.boot_val).
Some of the initial values of the GUCs updated rely on a compile-time
default. These are refactored so as the GUC table and its C declaration
use the same values. This makes everything consistent with other
places, backend_flush_after, bgwriter_flush_after, port,
checkpoint_flush_after doing so already, for example.
Extracted from a larger patch by Peter Smith. The spots updated in the
modules are from me.
Author: Peter Smith, Michael Paquier
Reviewed-by: Nathan Bossart, Tom Lane, Justin Pryzby
Discussion: https://postgr.es/m/CAHut+PtHE0XSfjjRQ6D4v7+dqzCw=d+1a64ujra4EX8aoc_Z+w@mail.gmail.com
Here we add a new 'copy' parameter to tuplesort_getdatum so that we can
instruct the function not to datumCopy() byref Datums before returning.
Similar to 91e9e89dc, this can provide significant performance
improvements in nodeSort when sorting by a single byref column and the
sort's targetlist contains only that column.
This allows us to re-enable Datum sorts for byref types which was disabled
in 3a5817695 due to a reported memory leak.
Additionally, here we slightly optimize DISTINCT aggregates so that we no
longer perform any datumCopy() when we find the current value not to be
distinct from the previous value. Previously the code would always take a
copy of the most recent Datum and pfree the previous value, even when the
values were the same. Testing shows a small but noticeable performance
increase when aggregate transitions are skipped due to the current
transition value being the same as the prior one.
Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvqS6wC5U==k9Hd26E4EQXH3QR67-T4=Q1rQ36NGvjfVSg@mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvqHonfe9G1cVaKeHbDx70R_zCrM3qP2AGXpGrieSKGnhA@mail.gmail.com
Commit df3737a651 added an incorrect assertion about the preconditions
for invoking the backup cleanup callback: it misfires at session end in
case a backup completes successfully. Fix it, using coding from Michaël
Paquier. Also add some tests for the various cases.
Reported by Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20221021.161038.1277961198945653224.horikyota.ntt@gmail.com
Since pg_backup_start() and pg_backup_stop() exist, the tablespace map
data and the backup state data (backup_label string until 7d70809) have
been allocated in the TopMemoryContext. This approach would cause
memory leaks in the session calling these functions if failures happen
before pg_backup_stop() ends, leaking more memory on repeated failures.
Both things need little memory so that would not be really noticeable
for most users, except perhaps connection poolers with long-lived
connections able to trigger backup failures with these functions.
This commit improves the logic in this area by not allocating anymore
the backup-related data that needs to travel across the SQL-callable
backup functions in TopMemoryContext, by using instead a dedicated
memory context child of TopMemoryContext. The memory context is created
in pg_backup_start() and deleted when finishing pg_backup_stop(). In
the event of an in-flight failure, this memory context gets reset in the
follow-up pg_backup_start() call, so as we are sure that only one run
worth of data is leaked at any time. Some cleanup was already done for
the backup data on a follow-up call of pg_backup_start(), but using a
memory context makes the whole simpler.
BASE_BACKUP commands are executed in isolation, relying on the memory
context created for replication commands, hence these do not need such
an extra logic.
Author: Bharath Rupireddy
Reviewed-by: Robert Haas, Alvaro Herrera, Cary Huang, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACXqvfKF2B0beQ=aJMdWnpNohmBPsRg=EDQj_6y1t2O8mQ@mail.gmail.com
After commit 39969e2a1e, ->forcePageWrites is no longer very
interesting: we can just test whether runningBackups is different from 0.
This simplifies some code, so do away with it.
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/39969e2a1e4d7f5a37f3ef37d53bbfe171e7d77a
We had two copies of almost identical logic to revert shared memory
state when a running backup aborts; we can remove
pg_backup_start_callback if we adapt do_pg_abort_backup so that it can
be used for this purpose too.
However, in order for this to work, we have to repurpose the flag passed
to do_pg_abort_backup. It used to indicate whether to throw a warning
(and the only caller always passed true). It now indicates whether the
callback is being called at start time (in which case the session backup
state is known not to have been set to RUNNING yet, so action is always
taken) or shmem time (in which case action is only taken if the session
backup state is RUNNING). Thus the meaning of the flag is no longer
superfluous, but it's actually quite critical to get right. I (Álvaro)
chose to change the polarity and the code flow re. the flag from what
Bharath submitted, for coding clarity.
Co-authored-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/20221013111330.564fk5tkwe3ha77l%40alvherre.pgsql
Per discussion, the existing routine name able to initialize a SRF
function with materialize mode is unpopular, so rename it. Equally, the
flags of this function are renamed, as of:
- SRF_SINGLE_USE_EXPECTED -> MAT_SRF_USE_EXPECTED_DESC
- SRF_SINGLE_BLESS -> MAT_SRF_BLESS
The previous function and flags introduced in 9e98583 are kept around
for compatibility purposes, so as any extension code already compiled
with v15 continues to work as-is. The declarations introduced here for
compatibility will be removed from HEAD in a follow-up commit.
The new names have been suggested by Andres Freund and Melanie
Plageman.
Discussion: https://postgr.es/m/20221013194820.ciktb2sbbpw7cljm@awork3.anarazel.de
Backpatch-through: 15
Previously GetCurrentTransactionStopTimestamp() computed a new timestamp
whenever xactStopTimestamp was unset and xactStopTimestamp was only set when a
commit or abort record was written.
An upcoming patch will add additional calls to
GetCurrentTransactionStopTimestamp() from pgstats. To avoid computing
timestamps multiple times, set xactStopTimestamp in
GetCurrentTransactionStopTimestamp() if not already set.
Author: Dave Page <dpage@pgadmin.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Discussion: https://postgr.es/m/20220906155325.an3xesq5o3fq36gt%40awork3.anarazel.de
Contrary to what is documented in src/backend/access/transam/README,
ginHeapTupleFastInsert() had a few ordering issues with the way it does
its WAL operations when inserting items in its fast path.
First, when using a separate list, XLogBeginInsert() was being always
called before START_CRIT_SECTION(), and in this case a second thing was
wrong when merging lists, as an exclusive lock was taken on the tail
page *before* calling XLogBeginInsert(). Finally, when inserting items
into a tail page, the order of XLogBeginInsert() and
START_CRIT_SECTION() was reversed. This commit addresses all these
issues by moving the calls of XLogBeginInsert() after all the pages
logged are locked and pinned, within a critical section.
Author: Matthias van de Meent, Zhang Mingli
Discussion: https://postgr.es/m/CAEze2WhL8uLMqynnnCu1LAPwxD5RKEo0nHV+eXGg_N6ELU88HQ@mail.gmail.com
An LSN was calculated from a segment number, a segment size and a
position offset, matching exactly the LSN given by the caller of
XLogReaderValidatePageHeader(). This change removes the extra LSN
calculation, relying only on the LSN given by the function caller
instead.
Author: Bharath Rupireddy
Reviewed-by: Richard Guo, Álvaro Herrera, Kyotaro Horiguchi
Discussion: https://postgr.es/m/CALj2ACXuh4Ms9j9sxMYdtHEe=5sFcyrs-GAHyADu_A_G71kZTg@mail.gmail.com
In a similar effort to f01592f91, here we mostly rename shadowed local
variables to remove the warnings produced when compiling with
-Wshadow=compatible-local.
This fixes 63 warnings and leaves just 5.
Author: Justin Pryzby, David Rowley
Reviewed-by: Justin Pryzby
Discussion https://postgr.es/m/20220817145434.GC26426%40telsasoft.com
This cleans up a couple of areas:
- Remove XLogSegNo calculation for the last WAL segment in backup in
xlog.c (7d70809 has moved this logic entirely to xlogbackup.c when
building the contents of the backup history file).
- Remove check on log_min_duration in analyze.c, as it is already true
where this code path is reached.
- Simplify call to find_option() in guc.c.
Author: Ranier Vilela
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAEudQArCDQQiPiFR16=yu9k5s2tp4tgEe1U1ZbkW4ofx81AWWQ@mail.gmail.com
Commit 34f581c39 intended to ensure that RelationGetBufferForTuple
would acquire a visibility-map page pin in case the otherBuffer's
all-visible bit had become set since we last had lock on that page.
But I missed a case: when we're extending the relation, VM concerns
were dealt with only in the relatively-less-likely case that we
fail to conditionally lock the otherBuffer. I think I'd believed
that we couldn't need to worry about it if the conditional lock
succeeds, which is true for the target buffer; but the otherBuffer
was unlocked for awhile so its bit might be set anyway. So we need
to do the GetVisibilityMapPins dance, and then also recheck the
page's free space, in both cases.
Per report from Jaime Casanova. Back-patch to v12 as the previous
patch was (although there's still no evidence that the bug is
reachable pre-v14).
Discussion: https://postgr.es/m/E1lWLjP-00006Y-Ml@gemulon.postgresql.org
SYSTEM_USER is a reserved keyword of the SQL specification that,
roughly described, is aimed at reporting some information about the
system user who has connected to the database server. It may include
implementation-specific information about the means by the user
connected, like an authentication method.
This commit implements SYSTEM_USER as of auth_method:identity, where
"auth_method" is a keyword about the authentication method used to log
into the server (like peer, md5, scram-sha-256, gss, etc.) and
"identity" is the authentication identity as introduced by 9afffcb (peer
sets authn to the OS user name, gss to the user principal, etc.). This
format has been suggested by Tom Lane.
Note that thanks to d951052, SYSTEM_USER is available to parallel
workers.
Bump catalog version.
Author: Bertrand Drouvot
Reviewed-by: Jacob Champion, Joe Conway, Álvaro Herrera, Michael Paquier
Discussion: https://postgr.es/m/7e692b8c-0b11-45db-1cad-3afc5b57409f@amazon.com
Commits cf112c12 and a0dc8271 were a little too hasty in getting rid of
the pg_ prefixes where we use pread(), pwrite() and vectored variants.
We dropped support for ancient Unixes where we needed to use lseek() to
implement replacements for those, but it turns out that Windows also
changes the current position even when you pass in an offset to
ReadFile() and WriteFile() if the file handle is synchronous, despite
its documentation saying otherwise.
Switching to asynchronous file handles would fix that, but have other
complications. For now let's just put back the pg_ prefix and add some
comments to highlight the non-standard side-effect, which we can now
describe as Windows-only.
Reported-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13
Fetch the next-item pointer before the call not after, so that
we aren't dereferencing a dangling pointer if the callback
deregistered itself during the call. The risky coding pattern
appears in CallXactCallbacks, CallSubXactCallbacks, and
ResourceOwnerReleaseInternal. (There are some other places that
might be at hazard if they offered deregistration functionality,
but they don't.)
I (tgl) considered back-patching this, but desisted because it
wouldn't be very safe for extensions to rely on this working in
pre-v16 branches.
Hao Wu
Discussion: https://postgr.es/m/CAH+9SWXTiERkmhRke+QCcc+jRH8d5fFHTxh8ZK0-Yn4BSpyaAg@mail.gmail.com
There are still some alignment-related failures in the buildfarm,
which might or might not be able to be fixed quickly, but I've also
just realized that it increased the size of many WAL records by 4 bytes
because a block reference contains a RelFileLocator. The effect of that
hasn't been studied or discussed, so revert for now.
The previous macro implementations just cast the argument to a target
type but did not check whether the input type was appropriate. The
function implementation can do better type checking of the input type.
For the *GetDatumFast() macros, converting to an inline function
doesn't work in the !USE_FLOAT8_BYVAL case, but we can use
AssertVariableIsOfTypeMacro() to get a similar level of type checking.
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
RelFileNumbers are now assigned using a separate counter, instead of
being assigned from the OID counter. This counter never wraps around:
if all 2^56 possible RelFileNumbers are used, an internal error
occurs. As the cluster is limited to 2^64 total bytes of WAL, this
limitation should not cause a problem in practice.
If the counter were 64 bits wide rather than 56 bits wide, we would
need to increase the width of the BufferTag, which might adversely
impact buffer lookup performance. Also, this lets us use bigint for
pg_class.relfilenode and other places where these values are exposed
at the SQL level without worrying about overflow.
This should remove the need to keep "tombstone" files around until
the next checkpoint when relations are removed. We do that to keep
RelFileNumbers from being recycled, but now that won't happen
anyway. However, this patch doesn't actually change anything in
this area; it just makes it possible for a future patch to do so.
Dilip Kumar, based on an idea from Andres Freund, who also reviewed
some earlier versions of the patch. Further review and some
wordsmithing by me. Also reviewed at various points by Ashutosh
Sharma, Vignesh C, Amul Sul, Álvaro Herrera, and Tom Lane.
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
This was used as the returned result type of the generated contents for
the backup_label and backup history files. This is replaced by a simple
string, reducing the cleanup burden of all the callers of
build_backup_content().
Reviewed-by: Bharath Rupireddy
Discussion: https://postgr.es/m/YzERvNPaZivHEKZJ@paquier.xyz
This change simplifies some of the logic related to the generation and
creation of the backup_label and backup history files, which has become
unnecessarily complicated since the removal of the exclusive backup mode
in commit 39969e2. The code was previously generating the contents of
these files as a string (start phase for the backup_label and stop phase
for the backup history file), one problem being that the contents of the
backup_label string were scanned to grab some of its internal contents
at the stop phase.
This commit changes the logic so as we store the data required to build
these files in an intermediate structure named BackupState. The
backup_label file and backup history file strings are generated when
they are ready to be sent back to the client. Both files are now
generated with the same code path. While on it, this commit renames
some variables for clarity.
Two new files named xlogbackup.{c,h} are introduced in this commit, to
remove from xlog.c some of the logic around base backups. Note that
more could be moved to this new set of files.
Author: Bharath Rupireddy, Michael Paquier
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/CALj2ACXWwTDgJqCjdaPyfR7djwm6SrybGcrZyrvojzcsmt4FFw@mail.gmail.com
Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC build system is hard to maintain for
developers not using Windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern build system.
After evaluating different build system choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.
We decided that it's more realistic to commit a relatively early version of
the new build system and mature it in tree.
This commit adds an initial version of a meson based build system. It supports
building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD,
Solaris and Windows (however only gcc is supported on aix, solaris). For
Windows/MSVC postgres can now be built with ninja (faster, particularly for
incremental builds) and msbuild (supporting the visual studio GUI, but
building slower).
Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM
bitcode generation, documentation adjustments) are done in subsequent commits
requiring further review. Other aspects (e.g. not installing test-only
extensions) are not yet addressed.
When building on Windows with msbuild, builds are slower when using a visual
studio version older than 2019, because those versions do not support
MultiToolTask, required by meson for intra-target parallelism.
The plan is to remove the MSVC specific build system in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make build system in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.
Some initial help for postgres developers is at
https://wiki.postgresql.org/wiki/Meson
With contributions from Thomas Munro, John Naylor, Stone Tickle and others.
Author: Andres Freund <andres@anarazel.de>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
If the ps display is not cleared at this point, the process could
continue displaying "recovering NNN" even if handling end-of-recovery
steps. df9274a has tackled that by providing some information with the
end-of-recovery checkpoint but 7ff23c6 has nullified the effect of the
first commit.
Per a suggestion from Justin, just clear the ps display when we are done
with recovery, so as no incorrect information is displayed. This may
get extended in the future, but for now restore the pre-7ff23c6
behavior.
Author: Justin Prysby
Discussion: https://postgr.es/m/20220913223954.GU31833@telsasoft.com
Backpatch-through: 15
clang 15+ will issue a set-but-not-used warning when the only
use of a variable is in autoincrements (e.g., "foo++;").
That's perfectly sensible, but it detects a few more cases that
we'd not noticed before. Silence the warnings with our usual
methods, such as PG_USED_FOR_ASSERTS_ONLY, or in one case by
actually removing a useless variable.
One thing that we can't nicely get rid of is that with %pure-parser,
Bison emits "yynerrs" as a local variable that falls foul of this
warning. To silence those, I inserted "(void) yynerrs;" in the
top-level productions of affected grammars.
Per recently-established project policy, this is a candidate
for back-patching into out-of-support branches: it suppresses
annoying compiler warnings but changes no behavior. Hence,
back-patch to 9.5, which is as far as these patches go without
issues. (A preliminary check shows that the prior branches
need some other set-but-not-used cleanups too, so I'll leave
them for another day.)
Discussion: https://postgr.es/m/514615.1663615243@sss.pgh.pa.us
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in storage, catalog,
access method, executor, and logical replication code, as well as in
miscellaneous utility/library code.
Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy. Later commits will do the
same for other parts of the codebase.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com